Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I

[Discuss][C++] Switch to mimalloc by default?

2024-06-05 Thread Antoine Pitrou
Hello, Arrow C++ features a MemoryPool abstraction that allows using different allocators interchangeably. Several MemoryPool implementations are provided with Arrow C++ (though one can also build their own): - a jemalloc-based implementation, currently the default on Linux - a

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou
(Gang Wu, Antoine Pitrou, Wes McKinney) 9x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko Driesprong, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen Zhang, Rok Mihevc) Arrow: 6x +1 binding (Micah Kornfield, Antoine Pitrou, Andy Grove, Raúl Cumplido, Wes McKinney

Re: [C++] Thread deadlock in ObjectOutputStream

2024-05-29 Thread Antoine Pitrou
Hi Li! Sorry for the delay. It seems the problem lies here: https://github.com/apache/arrow/blob/9f5899019d23b2b1eae2fedb9f6be8827885d843/cpp/src/arrow/filesystem/s3fs.cc#L1858 The Future is marked finished with the ObjectOutputStream's mutex taken, and the Future's callback then triggers a

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Antoine Pitrou
+1 (binding). Thanks for taking this up, Rok! Regards Antoine. Le 29/05/2024 à 16:14, Rok Mihevc a écrit : # sending this to both dev@arrow and dev@parquet Hi all, Following the ML discussion [1] I would like to propose a vote for parquet-cpp issues to be moved from Parquet Jira [2] to

Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Antoine Pitrou
Is it somehow possible to be a "member" of this account to indicate that we have PMC status, or is that not possible within the LinkedIn membership/permissions model? Le 24/05/2024 à 18:04, Ian Cook a écrit : Following the discussion [1] earlier this year about the status of the Apache

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
t;, "min", > > >"byte_width" and "distinct_count" but users can also use > > >application specific keys. > > > 3. If true, then the value is approximate or best-effort. > > > > > > VALUE_SCHEMA is a dense union with

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou
Hi Kou, I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision. Why not simply pass the statistics ArrowArray separately in your

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-05-14 Thread Antoine Pitrou
I think these flags should be advisory and consumers should be free to ignore them. However, some consumers apparently would benefit from them to more faithfully represent the producer's intention. For example, in Arrow C++, we could perhaps have a ImportDatum function whose actual return

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) Le 19/04/2024 à 22:22, Rok Mihevc a écrit : Hi all, Following initial requests [1][2] and recent tangential ML discussion [3] I would like to propose a vote to add language for UUID canonical extension type to CanonicalExtensions.rst as in PR [4] and written below. A draft C++

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 requirement and the 3 current String types allowed. Regards Antoine. Le 30/04/2024 à 19:26, Rok Mihevc a écrit : Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
o we could use this in that context). I think that I would still prefer a canonical extension type (with storage type null) over a new dedicated type. On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou wrote: Ah! Well, I think this could be an interesting proposal, but someone should put a mor

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
Ah! Well, I think this could be an interesting proposal, but someone should put a more formal proposal, perhaps as a draft PR. Regards Antoine. Le 17/04/2024 à 11:57, David Li a écrit : For an unsupported/other extension type. On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: What

Re: AW: Personal feedback on your last release on Apache Arrow ADBC 0.11.0

2024-04-17 Thread Antoine Pitrou
Out of curiosity, did you notice this by chance or do you have some kind of script that processes ASF mailing-list archives for possible voting irregularities? Regards Antoine. Le 17/04/2024 à 10:44, Christofer Dutz a écrit : When looking at whimsy, I can’t see any person named Sutou

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
ne-off nominal types for very specific use-cases? — Felipe On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical type for UUID and

Re: [RFC] Enabling data frames in disaggregated shared memory

2024-04-10 Thread Antoine Pitrou
Hello John, Arrow IPC files can be backed quite naturally by shared memory, simply by memory-mapping them for reading. So if you have some pieces of shared memory containing Arrow IPC files, and they are reachable using a filesystem mount point, you're pretty much done. You can see an

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-09 Thread Antoine Pitrou
It seems that perhaps this discussion should be rebooted for each individual component, one at a time? Let's start with something simple and obvious, with some frequent contribution activity, such as perhaps Go? Le 09/04/2024 à 14:27, Joris Van den Bossche a écrit : I am also in favor

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-07 Thread Antoine Pitrou
Le 28/03/2024 à 21:42, Jacob Wujciak a écrit : For Arrow C++ bindings like Arrow R and PyArrow having distinct versions would require additional work to both enable the use of different versions and ensure version compatibility is monitored and potentially updated if needed. We could simply

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Thanks. The Arrow spec does support multiple union members with the same type, but not all implementations do. The C++ implementation should support it, though to my surprise we do not seem to have any tests for it. If the Java implementation doesn't, then you can probably open an issue

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Antoine Pitrou
Can you explain what ADT means ? Le 02/04/2024 à 11:31, Finn Völkel a écrit : Hi, my question primarily concerns the union layout described at https://arrow.apache.org/docs/format/Columnar.html#union-layout There are two ways to use unions: - polymorphic vectors (world 1) - ADT

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Antoine Pitrou
Regardless of whether they have different compression ratios, it doesn't explain why you would want a different compression *algorithm* altogether. The choice of a compression algorithm should basically be driven by two concerns: the acceptable space/time tradeoff (do you want to minimize

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Antoine Pitrou
Hello Andrei, Le 23/03/2024 à 13:23, Andrei Lazăr a écrit : At this very moment, specifying different compression algorithms per column is supported and in my use case it is extremely helpful, as I have some columns (mostly containing floats), for which a compression algorithm like Snappy

Re: ADBC - OS-level driver manager

2024-03-20 Thread Antoine Pitrou
Also, with ADBC driver implementations currently in flux (none of them has reached the "stable" status in https://arrow.apache.org/adbc/main/driver/status.html), it might be a disservice to users to implicitly fetch drivers from potentially outdated DLLs on the current system. Regards

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Antoine Pitrou
Congratulations Bryce, and keep up the good work! Regards Antoine. Le 18/03/2024 à 03:21, Nic Crane a écrit : On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions!

Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-04 Thread Antoine Pitrou
I didn't run the release script but I'm +1 on this (binding). Regards Antoine. Le 04/03/2024 à 10:05, Raúl Cumplido a écrit : Hi, I would like to propose the following release candidate (RC0) of Apache Arrow version 15.0.1. This is a release consisting of 37 resolved GitHub issues[1].

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
want as many parties in the community as possible to be part of this. Thanks everyone. --Matt On Tue, Feb 27, 2024 at 12:48 PM Antoine Pitrou wrote: Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adopted as an Arrow spec

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Antoine Pitrou
Hello, I'd really like to see more engagement and criticism from non-Voltron Data parties before this is formally adopted as an Arrow spec. Regards Antoine. Le 27/02/2024 à 18:35, Matt Topol a écrit : Hey all, I'd like to propose a vote for us to officially adopt the protocol described

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-14 Thread Antoine Pitrou
for today's bi-weekly call. Thanks, Raúl El mar, 13 feb 2024 a las 23:20, Antoine Pitrou () escribió: Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow Java identified

Re: [DISCUSS] Arrow 15.0.1 patch release

2024-02-13 Thread Antoine Pitrou
Well, https://github.com/apache/arrow/issues/20379 makes me wonder if anyone is using the Java Dataset bridge seriously. Le 13/02/2024 à 21:10, Dane Pitkin a écrit : Hi all, Arrow Java identified an issue[1] in the 15.0.0 release. There is an undefined symbol in the dataset module that

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-13 Thread Antoine Pitrou
ed semantics? If so, is there a way to include the original service in the list of locations without the implied precedence? Thanks, Joel On Mon, Feb 12, 2024 at 11:52 James Duong .invalid> wrote: This seems like a good idea, and also improves consistency with clients that erroneously assumed that th

Re: [ANNOUNCE] Apache Arrow nanoarrow 0.4.0 Released

2024-02-12 Thread Antoine Pitrou
Hi Dewey, Le 12/02/2024 à 15:01, Dewey Dunnington a écrit : Apache Arrow nanoarrow is a small C library for building and interpreting Arrow C Data interface structures with bindings for users of the R programming language. Do you want to reconsider this sentence? It seems nanoarrow is

Re: [DISCUSS] Flight RPC: add 'fallback' URI scheme

2024-02-12 Thread Antoine Pitrou
Hello, This looks fine to me. Regards Antoine. Le 12/02/2024 à 14:46, David Li a écrit : Hello, I'd like to propose a slight update to Flight RPC to make Flight SQL work better in different deployment scenarios. Comments on the doc would be appreciated:

Re: [DISCUSS] Proposal to expand Arrow Communications

2024-02-07 Thread Antoine Pitrou
I think we should find a proper descriptive name for the "high-performance protocol", because "high-performance" is vague and context-dependent, and also spreads unnecessary confusion about existing alternatives such as regular Arrow IPC. I would for example propose "Dissociated Arrow IPC"

Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-27 Thread Antoine Pitrou
My 2 cents : I don't understand what an open source project gains by publishing on a microblogging platform. As for Twitter specifically, its recent governance changes would be good reason for terminating the @ApacheArrow account, IMHO. Regards Antoine. Le 27/01/2024 à 23:06, Bryce

Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-25 Thread Antoine Pitrou
Hello, My own answers: 1) isDelta should be true only when a delta is being transmitted (to be appended to the existing dictionary with the same id); it should be false when a full dictionary is being transmitted (to replace the existing dictionary with the same id, if any) 2) yes, it

Re: [DataFusion] New Blog Post -- DataFusion 34.0

2024-01-23 Thread Antoine Pitrou
Impressive, thank you! Le 23/01/2024 à 14:06, Andrew Lamb a écrit : If anyone is interested, here is a new blog post about the last 6 months in DataFusion[1] and where we are heading this year. Andrew [1]: https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/

Re: [DISC] Improve Arrow Release verification process

2024-01-19 Thread Antoine Pitrou
Well, if the main objective is to just follow the ASF Release guidelines, then our verification process can be simplified drastically. The ASF indeed just requires: """ Every ASF release MUST contain one or more source packages, which MUST be sufficient for a user to build and test the

Re: [VOTE] Release Apache Arrow 15.0.0 - RC1

2024-01-17 Thread Antoine Pitrou
Go verification fails on Ubuntu 22.04: ``` # google.golang.org/grpc ../../gopath/pkg/mod/google.golang.org/grpc@v1.58.3/server.go:2096:14: undefined: atomic.Int64 note: module requires Go 1.19 # github.com/apache/arrow/go/v15/arrow/avro arrow/avro/reader_types.go:594:16: undefined:

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Antoine Pitrou
Hi, For now, I would suggest that each implementation decides on their own strategy, because we don't have a clear idea of which is better (and extension types are probably not getting a lot of use yet). Regards Antoine. Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : The main

Re: Java, dictionary ids and schema equality

2023-12-09 Thread Antoine Pitrou
Hi Curt, Yes, it's a problem in the Java implementation of these tests. Ideally this should be fixed, but doing so would require some amount of scaffolding. Regards Antoine. Le 09/12/2023 à 21:47, Curt Hagenlocher a écrit : I've (mostly) fixed the C# implementation of dictionary IPC but

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Antoine Pitrou
+1 (binding) Le 08/12/2023 à 20:42, David Li a écrit : Let's start a formal vote just so we're on the same page now that we've discussed a few things. I would like to propose we remove 'experimental' from Flight SQL and make it stable: - Remove the 'experimental' option from the Protobuf

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2023-12-06 Thread Antoine Pitrou
Hi, While this looks like a nice start, I would expect more precise recommendations for writing non-trivial services. Especially, one question is how to send both an application-specific POST request and an Arrow stream, or an application-specific GET response and an Arrow stream. This

Re: [Discussion][Gandiva] Migration JIT engine from MCJIT to ORC v2

2023-12-06 Thread Antoine Pitrou
Given that MCJIT is deprecated and there doesn't seem to be a downside to the new APIs, migrating to ORC v2 sounds fine to me. Just a question: does it raise the minimum supported LLVM version? Regards Antoine. Le 05/12/2023 à 03:35, Yue Ni a écrit : Hi there, I'd like to initiate a

Re: CIDR 2024

2023-12-06 Thread Antoine Pitrou
For the sake of clarity, it seems this is talking about the Conference on Innovative Data Systems Research: https://www.cidrdb.org/cidr2024/ Regards Antoine. Le 06/12/2023 à 01:15, Wes McKinney a écrit : I will also be there. On Mon, Dec 4, 2023 at 12:58 PM Tony Wang wrote: I am Get

Re: Documentation of Breaking Changes

2023-11-21 Thread Antoine Pitrou
Hello, Le 21/11/2023 à 22:59, Chris Thomas a écrit : I apologize if this is not the appropriate venue for this request; if that's the case, please let me know where I should be asking: Earlier this month Dependabot flagged a security vulnerability with PyArrow which prompted us to do an

Re: [DISCUSS] Protocol for exchanging Arrow data over REST APIs

2023-11-20 Thread Antoine Pitrou
I also agree that an informal spec "how to efficiently transfer Arrow data over HTTP" makes sense. Probably with several aspects: - one-shot GET data - streaming GET - one-shot PUT or POST - streaming POST - non-Arrow prologue and epilogue (for example JSON-based metadata) - conventions for

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Antoine Pitrou
Welcome Raul, we're glad to have you! Regards Antoine. Le 13/11/2023 à 20:27, Andrew Lamb a écrit : The Project Management Committee (PMC) for Apache Arrow has invited Raúl Cumplido to become a PMC member and we are pleased to announce that Raúl Cumplido has accepted. Please join me in

Re: decimal64

2023-11-09 Thread Antoine Pitrou
/CanonicalExtensions.html On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a decimal64 type

Re: decimal64

2023-11-09 Thread Antoine Pitrou
, at 11:56, Antoine Pitrou wrote: Or they could trivially use a int64 column for that, since the scale is fixed anyway, and you're probably not going to multiply money values together. Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit : If Arrow had a decimal64 type, someone could choose to use

Re: decimal64

2023-11-09 Thread Antoine Pitrou
column knowing that there are edge cases where they may get an undesired result. On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou wrote: Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent it

Re: decimal64

2023-11-09 Thread Antoine Pitrou
Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit : Or more succinctly, "111,111,111,111,111." will fit into a decimal64; would you prevent it from being stored in one so that you can describe the column as "decimal(18, 4)"? That's what we do for other decimal types, see PyArrow below: ```

Re: [VOTE][FORMAT] Bulk ingestion support for Flight SQL

2023-11-09 Thread Antoine Pitrou
For the record, the correct PR link seems to be https://github.com/apache/arrow/pull/38385 Le 08/11/2023 à 21:49, David Li a écrit : Hello, Joel Lubi has proposed adding bulk ingestion support to Arrow Flight SQL [1]. This provides a path for uploading an Arrow dataset to a Flight SQL

CVE-2023-47248: PyArrow, PyArrow: Arbitrary code execution when loading a malicious data file

2023-11-08 Thread Antoine Pitrou
Severity: critical Affected versions: - PyArrow 0.14.0 through 14.0.0 - PyArrow 0.14.0 through 14.0.0 Description: Deserialization of untrusted data in IPC and Parquet readers in PyArrow versions 0.14.0 to 14.0.0 allows arbitrary code execution. An application is vulnerable if it reads Arrow

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 18:59, Dewey Dunnington a écrit : That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). By the

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > A potential alternative might be to allow any ArrowArray to declare > its buffer sizes in array->buffers[array->n_buffers], perhaps with a > new flag in schema->flags to advertise that capability. That sounds a bit hackish to me. I'd rather

Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-25 Thread Antoine Pitrou
Hello, We might want to keep the variadic buffers at the end and instead export the buffer sizes as buffer #2? Though that's mostly stylistic... Regards Antoine. Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit : Hello all, The C ABI does not store buffer lengths explicitly, which

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Antoine Pitrou
Welcome Xuwei! Le 23/10/2023 à 05:28, Sutou Kouhei a écrit : On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions!

Re: [Format] C Data Interface integration testing

2023-10-19 Thread Antoine Pitrou
active the community is being, I'm reasonably confident that they'll come to it soon :) Regards Antoine. Le 26/09/2023 à 14:46, Antoine Pitrou a écrit : Hello, We have added some infrastructure for integration testing of the C Data Interface between Arrow implementations. We are now testing

Re: Apache Arrow file format

2023-10-18 Thread Antoine Pitrou
The fact that they describe Arrow and Feather as distinct formats (they're not!) with different characteristics is a bit of a bummer. Le 18/10/2023 à 22:20, Andrew Lamb a écrit : If you are looking for a more formal discussion and empirical analysis of the differences, I suggest reading "A

Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Antoine Pitrou
+1 Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit : Hello all, I propose "vu" and "vz" as format strings for the Utf8View and BinaryView types in the Arrow C data interface [1]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of these new C data format strings [ ] +0 [ ]

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-14 Thread Antoine Pitrou
Welcome to the PMC, Jon! Le 14/10/2023 à 19:42, David Li a écrit : Congrats Jon! On Sat, Oct 14, 2023, at 13:25, Ian Cook wrote: Congratulations Jonathan! On Sat, Oct 14, 2023 at 13:24 Andrew Lamb wrote: The Project Management Committee (PMC) for Apache Arrow has invited Jonathan Keane

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-11 Thread Antoine Pitrou
 PM Antoine Pitrou wrote: Hi Alva, I'll let others give their opinions on the repo. Regards Antoine. Le 10/10/2023 à 19:25, Alva Bandy a écrit : Hi Antoine, Thanks for the reply. It would be great to get the Swift implementation added to the integration test. I have a task for adding

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou
not looked into Julia’s implementation. Thank you, Alva Bandy On 2023/10/10 08:54:30 Antoine Pitrou wrote: Hello Alva, This is a reasonable request, but it might come with its own drawbacks as well. One significant drawback is that adding the Swift implementation to the cross-implementation integration

Re: [DISCUSS][Swift] repo for swift similar to arrow-rs

2023-10-10 Thread Antoine Pitrou
Hello Alva, This is a reasonable request, but it might come with its own drawbacks as well. One significant drawback is that adding the Swift implementation to the cross-implementation integration tests will be slightly more complicated. It is very important that all Arrow implementations

Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-07 Thread Antoine Pitrou
+1 from me. But I also reiterate my plea that these existing parsers get fixed so as to entirely validate the format string instead of stopping early. Regards Antoine. Le 06/10/2023 à 23:26, Felipe Oliveira Carvalho a écrit : Hello, I'm writing to propose "+vl" and "+vL" as format

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Antoine Pitrou
); +} else { + type_ = list_view(field); +} + } else { +return f_parser_.Invalid(); + } +} + return Status::OK(); } -- Felipe On Thu, Oct 5, 2023 at 5:26 PM Antoine Pitrou wrote: I don't think the parsing will be a problem even in C. It's not like

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Antoine Pitrou
I don't think the parsing will be a problem even in C. It's not like you have to backtrack anyway. +1 from me on Felipe's proposal. Regards Antoine. Le 05/10/2023 à 20:33, Felipe Oliveira Carvalho a écrit : This mailing list thread is going to be the discussion. The union types also

Re: [VOTE] [Format] Add app_metadata to FlightInfo and FlightEndpoint

2023-10-03 Thread Antoine Pitrou
+1 from me. It might be worth spelling out whether any relationship is expected between the `app_metadata` for a FlightInfo and any of the corresponding `FlightEndpoint`s and `FlightData` chunks. Le 12/09/2023 à 17:48, Matt Topol a écrit : Hey all, I would like to propose adding a new

Re: [DISCUSS][C++] Raw pointer string views

2023-10-03 Thread Antoine Pitrou
Le 03/10/2023 à 01:36, Matt Topol a écrit : The cost of conversion is actually significantly higher than the actual overhead of simply accessing the values in either representation, leading to a high potential for bottleneck. For systems like Velox and DuckDB where it's important to be able

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
approach be willing to meet us in the middle and switch to an offset based encoding? This to me feels like it would be the best outcome for the ecosystem as a whole. Kind Regards, Raphael On 02/10/2023 13:50, Antoine Pitrou wrote: Le 01/10/2023 à 16:21, Micah Kornfield a écrit : I would also

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Antoine Pitrou
Hello, +1 and thanks for working on this! There'll probably be some minor comments to the format PR, but those don't deter from accepting these new layouts into the standard. Regards Antoine. Le 29/09/2023 à 14:09, Felipe Oliveira Carvalho a écrit : Hello, I'd like to propose adding

Re: [DISCUSS][C++] Raw pointer string views

2023-10-02 Thread Antoine Pitrou
Le 01/10/2023 à 16:21, Micah Kornfield a écrit : I would also assert that another way to reduce this risk is to add some prose to the relevant sections of the columnar format specification doc to clearly explain that a raw pointers variant of the layout, while not part of the official spec,

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou
be clearly flagged as being non-Arrow compliant. It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`). But, they could be also be provided by a distinct library. Regards Antoine. Le 28/09/2023 à 09:01, Antoine

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Antoine Pitrou
Hi Ben, Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit : @Antoine What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. We already do this in every implementation of the

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Antoine Pitrou
Hello, What this PR is creating is an "unofficial" Arrow format, with data types exposed in Arrow C++ that are not part of the Arrow standard, but are exposed as if they were. Most users will probably not read the official format spec, but will simply trust the official Arrow

[Format] C Data Interface integration testing

2023-09-26 Thread Antoine Pitrou
Hello, We have added some infrastructure for integration testing of the C Data Interface between Arrow implementations. We are now testing the C++ and Go implementations, but the goal in the future is for all major implementations to be tested there (perhaps including nanoarrow). - PR to

Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-25 Thread Antoine Pitrou
Hi Yue, Le 25/09/2023 à 18:15, Yue Ni a écrit : a CMake entrypoint (for example a function) making it easy for third-party projects to compile their own functions I can come up with a minimum CMake template so that users can compile C++ based functions, and I think if the integration

Re: [DISCUSS][Gandiva] External function registry proposal

2023-09-25 Thread Antoine Pitrou
Hello, Being making Gandiva more extensible sounds like a worthwhile improvement. However, I'm not sure why we would need to choose a JSON-based format for this. Instead, I think Gandiva could simply provide the two following basic-blocks: 1. a CMake entrypoint (for example a function)

Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-13 Thread Antoine Pitrou
Le 13/09/2023 à 02:37, Rok Mihevc a écrit : * **ragged_dimensions** = indices of ragged dimensions whose sizes may differ. Dimensions where all elements have the same size are called uniform dimensions. Indices are a subset of all possible dimension indices ([0, 1, ..,

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Antoine Pitrou
Hi Li, Le 06/09/2023 à 17:55, Li Jin a écrit : Hello, I have been testing "What is the max rss needed to scan through ~100G of data in a parquet stored in gcs using Arrow C++". The current answer is about ~6G of memory which seems a bit high so I looked into it. What I observed during the

Re: Standardizing a Python PEP-249 extension to retrieve Arrow data

2023-09-05 Thread Antoine Pitrou
Hello Jonas, What is the standardization model you are after? PEP 249 is marked final and therefore won't be updated (except for minutiae such as typos, markup, etc.). Are you planning to submit a new PEP for this extension? If so, I would suggest starting a discussion on

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-24 Thread Antoine Pitrou
+1 on the format additions The implementations will probably need a bit more review back-and-forth. Regards Antoine. Le 28/06/2023 à 21:34, Benjamin Kietzman a écrit : Hello, I'd like to propose adding Utf8View arrays to the arrow format. Previous discussion in [1], columnar format

[Discuss][C++] A framework for contextual/implicit/ambient vars

2023-08-24 Thread Antoine Pitrou
Hello, Arrow C++ comes with execution facilities (such as thread pools, async generators...) meant to unlock higher performance by hiding IO latencies and exploiting several CPU cores. These execution facilities also obscure the context in which a task is executed: you cannot simply use

Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou
latbuffers output from the release package only. “Caches”, multi stage compilation etc should be ok. Best regards, Adam Lippai On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou wrote: If the main impetus for the verification script is to comply with ASF requirements, probably the script can be

Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou
latbuffers output from the release package only. “Caches”, multi stage compilation etc should be ok. Best regards, Adam Lippai On Tue, Aug 22, 2023 at 10:40 Antoine Pitrou wrote: If the main impetus for the verification script is to comply with ASF requirements, probably the script can be made mu

Re: [Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou
cripts don't need much maintenance so we just continue the ceremony. However, I certainly don't think we would lose much/any test coverage if we stopped their use. Andrew On Tue, Aug 22, 2023 at 4:54 AM Antoine Pitrou wrote: Hello, Abiding by the Apache Software Foundation's guidelines,

Re: [VOTE] Release Apache Arrow 13.0.0 - RC3

2023-08-22 Thread Antoine Pitrou
+1 from me (binding). The verification script failed for me, but I consider it not a problem (see separate discussion thread). Regards Antoine. Le 18/08/2023 à 10:00, Raúl Cumplido a écrit : Hi, I would like to propose the following release candidate (RC3) of Apache Arrow version

[Discuss] Do we need a release verification script?

2023-08-22 Thread Antoine Pitrou
Hello, Abiding by the Apache Software Foundation's guidelines, every Arrow release is voted on and requires at least 3 "binding" votes to be approved. Also, every Arrow release vote is accompanied by a little ceremonial where contributors and core developers run a release verification

Re: [VOTE] Release Apache Arrow 13.0.0 - RC3

2023-08-22 Thread Antoine Pitrou
Hello, It seems the verification instructions are not up to date? https://cwiki.apache.org/confluence/display/ARROW/How+to+Verify+Release+Candidates I've tried to run the suggested command: $ dev/release/verify-release-candidate.sh source 13.0.0 3 and I get the following error message: """

Re: Sort a Table In C++?

2023-08-17 Thread Antoine Pitrou
Or you can simply call the "sort_indices" compute function: https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions Le 17/08/2023 à 23:20, Ian Cook a écrit : Li, Here's a standalone C++ example that constructs a Table and executes an Acero ExecPlan to sort it:

Re: [VOTE] Apache Arrow ADBC (API) 1.1.0

2023-08-16 Thread Antoine Pitrou
.) This vote will be open for at least 72 hours. [ ] +1 Adopt the ADBC 1.1.0 specification [ ] 0 [ ] -1 Do not adopt the specification because... Thanks to Sutou Kouhei, Matt Topol, Dewey Dunnington, Antoine Pitrou, Will Ayd, and Will Jones for feedback on the design and various work-in-progress PRs. [1

Re: [DISCUSS][Arrow] Extension metadata encoding design

2023-08-16 Thread Antoine Pitrou
ion of metadata to a string, different encoder-implementations still might still produce non-comparable strings, resulting in falsely reported datatype mismatches, but at least avoiding the case of false positives. On Wed, Aug 16, 2023 at 5:19 PM Antoine Pitrou wrote: Hi Jeremy, A single key ma

Re: [DISCUSS][Arrow] Extension metadata encoding design

2023-08-16 Thread Antoine Pitrou
Hi Jeremy, A single key makes it easier for generic code to recreate extension types it does not know about. Here is an example in the C++ IPC layer: https://github.com/apache/arrow/blob/641201416c1075edfd05d78b539275065daac31d/cpp/src/arrow/ipc/metadata_internal.cc#L823-L845 Here is

Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Antoine Pitrou
+1 from me (binding). It would be nice to get approval from authors of other implementations such as Rust, C#, Javascript... Thanks for doing this! Le 16/08/2023 à 16:16, Matt Topol a écrit : Hey All, As proposed by Felipe [1] I'm starting a vote on the proposed update to the Format

Re: [Format] C data interface format string for run-end encoded arrays

2023-08-15 Thread Antoine Pitrou
I think we should. Regards Antoine. Le 15/08/2023 à 19:58, Matt Topol a écrit : I'm in favor of this as the C Data format string. Though since this is technically a format/spec change do others think we should take a vote on this? --Matt On Tue, Aug 15, 2023, 12:19 PM Felipe Oliveira

  1   2   3   4   5   6   7   8   9   10   >