Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Felipe Oliveira Carvalho
I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or null if the statistics refer to whole table or batch column:

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Felipe Oliveira Carvalho
+1 (non-binding) On Wed, 29 May 2024 at 11:30 Micah Kornfield wrote: > +1 (non-binding for Parquet, Binding for Arrow if that makes a difference) > > > > On Wed, May 29, 2024 at 7:15 AM Rok Mihevc wrote: > > > # sending this to both dev@arrow and dev@parquet > > > > Hi all, > > > > Following

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Felipe Oliveira Carvalho
I want to +1 on what Dewey is saying here and some comments. Sutou Kouhei wrote: > ADBC may be a bit larger to use only for transmitting statistics. ADBC has > statistics related APIs but it has more other APIs. It's impossible to keep the responsibility of communication protocols cleanly

Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread Felipe Oliveira Carvalho
Great news. Congratulations Dane! On Tue, May 7, 2024 at 7:57 PM Vibhatha Abeykoon wrote: > > Congratulations Dane!!! > > Vibhatha Abeykoon > > > On Wed, May 8, 2024 at 4:02 AM Jacob Wujciak wrote: > > > Congrats! > > > > Am Di., 7. Mai 2024 um 23:19 Uhr schrieb Bryce Mecum > >: > > > > >

Re: [VOTE][Format] UUID canonical extension type

2024-04-29 Thread Felipe Oliveira Carvalho
Isn't that easily decodable from the UUID data itself? If you allow the version to be specified as metadata, you now have to validate and make sure it's consistent with the version encoded in the contents of the UUID column. And UUID versions are more of a concern for UUID generation than

Re: Unsupported/Other Type

2024-04-11 Thread Felipe Oliveira Carvalho
The OP used UUID as an example. Would that be enough or the request is for a flexible mechanism that allows the creation of one-off nominal types for very specific use-cases? — Felipe On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou wrote: > > Yes, JSON and UUID are obvious candidates for new

Re: [Format][Union] polymorphic vectors vs ADT style vectors

2024-04-02 Thread Felipe Oliveira Carvalho
Algebraic Data Types (Sums and Products) are very abstract. This means they don't fully specify a concrete/physical layout [1]: different physical layouts can match the same algebraic definition. As an in-memory data format specification, Arrow doesn't and shouldn't rigidly specify concretization

Re: [DISCUSS] Looking for feedback on my Rust library

2024-03-14 Thread Felipe Oliveira Carvalho
Two comments: —— Since this library is analogous to things like ADBC, ODBC, and JDBC, it’s more of a “driver” than a “connector”. This might make your life easier when explaining what it does. It’s not a black and white thing, but “connector” might imply networking to some people. I believe

Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-29 Thread Felipe Oliveira Carvalho
> I have found Twitter an extremely effective way for an open-source project to communicate with the “exo-community” — people who are interested in the project but not so invested that they join the email list. An open source project needs to perform pretty much all of the functions of a

Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-08 Thread Felipe Oliveira Carvalho
gt; wrote: > > > > > > > > > Congratulations, Felipe! > > > > > ________ > > > > > From: Daniël Heres > > > > > Sent: Thursday, December 7, 2023 2:59 PM > > > > > To: dev@arrow.apache.org > &

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread Felipe Oliveira Carvalho
Congratulations! Well deserved. On Mon, Nov 13, 2023 at 5:16 PM Neal Richardson wrote: > Congratulations! > > On Mon, Nov 13, 2023 at 3:10 PM Matt Topol wrote: > > > Congratulations Raul!! > > > > On Mon, Nov 13, 2023, 3:09 PM Antoine Pitrou wrote: > > > > > > > > Welcome Raul, we're glad to

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Felipe Oliveira Carvalho
Congratulations Xuwei! — Felipe On Mon, 23 Oct 2023 at 10:26 Vibhatha Abeykoon wrote: > Congratulations Xuwei! > > On Mon, Oct 23, 2023 at 6:38 PM Weston Pace wrote: > > > Congratulations Xuwei! > > > > On Mon, Oct 23, 2023 at 3:38 AM wish maple > wrote: > > > > > Thanks kou and every nice

Re: [VOTE][Format] C data interface format strings for Utf8View and BinaryView

2023-10-18 Thread Felipe Oliveira Carvalho
+1 On Wed, Oct 18, 2023 at 2:49 PM Dewey Dunnington wrote: > +1! > > On Wed, Oct 18, 2023 at 2:14 PM Matt Topol wrote: > > > > +1 > > > > On Wed, Oct 18, 2023 at 1:05 PM Antoine Pitrou > wrote: > > > > > +1 > > > > > > Le 18/10/2023 à 19:02, Benjamin Kietzman a écrit : > > > > Hello all, > >

Re: Apache Arrow file format

2023-10-17 Thread Felipe Oliveira Carvalho
It’s not the best since the format is really focused on in- memory representation and direct computation, but you can do it: https://arrow.apache.org/docs/python/feather.html — Felipe On Tue, 17 Oct 2023 at 23:26 Nara wrote: > Hi, > > Is it a good idea to use Apache Arrow as a file format?

Re: Language-specific discussion (with C# example)

2023-10-17 Thread Felipe Oliveira Carvalho
The Zulip is https://ursalabs.zulipchat.com/ On Tue, Oct 17, 2023 at 9:55 PM Will Jones wrote: > Hi Curt, > > I think the most visible place for now would be creating an issue for > discussion. > > In the future, if you and some others want to have a place to discuss C# > development, you

Re: [Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-11 Thread Felipe Oliveira Carvalho
gt; > > > > But I also reiterate my plea that these existing parsers get fixed so > as > > > to entirely validate the format string instead of stopping early. > > > > > > Regards > > > > > > Antoine. > > > > > > > >

[Vote][Format] (new proposal) C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Felipe Oliveira Carvalho
Hello, I'm writing to propose "+vl" and "+vL" as format strings for list-view and large list-view arrays passing through the Arrow C data interface [1]. The previous proposal was considered a bad idea because existing parsers of these format strings might be looking at only the first `l` (or

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-06 Thread Felipe Oliveira Carvalho
1 for +vl and > +vL. > > On Thu, Oct 5, 2023 at 6:40 PM Felipe Oliveira Carvalho > wrote: > > > > > Union format strings share enough properties that having them in the > > > same switch case doesn't result in additional complexity...lists and > > > list

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Felipe Oliveira Carvalho
haracter version (i.e., > >> maybe +v and +V)? A single-character version is (slightly) easier to > >> parse in C. > >> > >> On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho > >> wrote: > >>> > >>> Hello, > >&g

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Felipe Oliveira Carvalho
arse the format string are already rather > unwieldy...it would be a nice quality-of-life improvement (although by > no means a required one) to use a separate character. > > On Thu, Oct 5, 2023 at 3:34 PM Felipe Oliveira Carvalho > wrote: > > > > This mailing

Re: [Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Felipe Oliveira Carvalho
where this discussion may have occurred...is there a reason > that +lv and +Lv were chosen over a single-character version (i.e., > maybe +v and +V)? A single-character version is (slightly) easier to > parse in C. > > On Thu, Oct 5, 2023 at 2:00 PM Felipe Oliveira Carvalho > wrote: &g

[Vote][Format] C data interface format string for ListView and LargeListView arrays

2023-10-05 Thread Felipe Oliveira Carvalho
Hello, I'm writing to propose "+lv" and "+Lv" as format strings for list-view and large list-view arrays passing through the Arrow C data interface [1]. The vote will be open for at least 72 hours. [ ] +1 - I'm in favor of this new C Data Format string [ ] +0 [ ] -1 - I'm against adding this

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Felipe Oliveira Carvalho
> > There'll probably be some minor comments to the format PR, but those > > >> > don't deter from accepting these new layouts into the standard. > > >> > > > >> > Regards > > >> > > > >> > Antoine. > > >> &g

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread Felipe Oliveira Carvalho
sues as [1]? > > Kind Regards, > > Raphael Taylor-Davies > > [1]: https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy > > On 29/09/2023 13:09, Felipe Oliveira Carvalho wrote: > > Hello, > > > > I'd like to propose adding ListView and LargeListVie

[VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread Felipe Oliveira Carvalho
Hello, I'd like to propose adding ListView and LargeListView arrays to the Arrow format. Previous discussion in [1][2], columnar format description and flatbuffers changes in [3]. There are implementations available in both C++ [4] and Go [5]. I'm working on the integration tests which I will

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Felipe Oliveira Carvalho
My take here is that Ben did an excellent job in hiding the fact that C++ has two variations of the format without leaking the pointer version via the interfaces through which Arrow arrays are communicated to other implementations. As things stand right now, there is no zero-copy transfer of

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread Felipe Oliveira Carvalho
> (a) stays pretty stable throughout the scan (stays < 1G), (b) keeps increasing during the scan (looks linear to the number of files scanned). I wouldn't take this to mean a memory leak but the memory allocator not paging out virtual memory that has been allocated throughout the scan. Could you

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-08-21 Thread Felipe Oliveira Carvalho
I marked the C++ implementation PR ready for review today and will soon be working on the Go implementation. https://github.com/apache/arrow/pull/35345 Note that differently from Velox's ArrayVector, the Arrow implementation (ListView) also features a 64-bit version (LargeListView) to be

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-18 Thread Felipe Oliveira Carvalho
+1 (non-binding) — Felipe On Fri, 18 Aug 2023 at 18:48 Jacob Wujciak-Jens wrote: > +1 (non-binding) > > On Fri, Aug 18, 2023 at 6:04 PM L. C. Hsieh wrote: > > > +1 (binding) > > > > On Fri, Aug 18, 2023 at 5:53 AM Neal Richardson > > wrote: > > > > > > +1 > > > > > > Thanks all for the

[Format] C data interface format string for run-end encoded arrays

2023-08-15 Thread Felipe Oliveira Carvalho
Hello, I'm writing to inform you that I'm proposing "+r" as format string for run-end encoded arrays passing through the Arrow C data interface [1]. Feel free to also discuss in the linked PR with the changes to bridge.cc and reference docs. [1]

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-05 Thread Felipe Oliveira Carvalho
ave > multiple physical layouts. I agree. E.g. variable size list<32>, variable > size list<64>, and REE are the physical layouts that, combined with the > logical type "string", give you "string", "large string", and "ree" > > [1

Re: [DISCUSS] Canonical alternative layout proposal

2023-08-01 Thread Felipe Oliveira Carvalho
A major difficulty in making the Arrow array types open for extension [1] is that as soon as we define an (a) universal representation* or (b) abstract interface, we close the door for vectorization. (a) prevents having new vectorization friendly formats and (b) limits the implementation of new

Re: Question about TypeHolder in arrow

2023-07-04 Thread Felipe Oliveira Carvalho
int8(), int16()… all return the same shared_ptr that gets inc-ref’d on every "creation". But any code taking type pointers shouldn't assume it comes from `static` storage. All uses of a non-owning TypeHolder should be based on something else ensuring the shared_ptr is alive while the TypeHolder

Re: Question about nested columnar validity

2023-06-29 Thread Felipe Oliveira Carvalho
Values in the `offsets` Buffer of a ListArray can’t be left undefined because the length of a valid entry before a NULL entry is the offset associated with that NULL entry minus the previous offset. The ListViewArray format I’m working on doesn’t have that restriction because all the information

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Felipe Oliveira Carvalho
herently wrong with it, and if it ain't broke we > really shouldn't be trying to fix it. > > Kind Regards, > > Raphael Taylor-Davies > > On 14 June 2023 17:52:52 BST, Felipe Oliveira Carvalho > wrote: > > General approach to alternative formats aside,

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
ort > ListView aspires to, such an addition could require non trivial changes to > many / all of those implementations (and the APIs they expose). > > Andrew > > On Wed, Jun 14, 2023 at 12:53 PM Felipe Oliveira Carvalho < > felipe...@gmail.com> wrote: > > > General a

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-14 Thread Felipe Oliveira Carvalho
t; > On Wed, Jun 14, 2023 at 2:07 AM Antoine Pitrou wrote: > > > > > I agree that ListView cannot be an extension type, given that it > > features a new layout, and therefore cannot reasonably be backed by an > > existing storage type (AFAICT). > > > > Also, I'm very lu

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Felipe Oliveira Carvalho
t; > worried that it might undermine the perception that the Arrow > > > format > > > > is > > > > > > > stable. I think it might be worth thinking about "soft > deprecating" > > > > the > > > > > > old > > > >

Re: [VOTE][Format] Add experimental ArrowDeviceArray to C-Data API

2023-05-25 Thread Felipe Oliveira Carvalho
+1 for me. The C structs are clean and leave good room for extension. -- Felipe On Thu, May 25, 2023 at 12:04 PM David Li wrote: > +1 for me. > > (Heads up: on the PR, there was some discussion since the last email and > the meaning of 'experimental' was clarified.) > > On Tue, May 23, 2023,

Re: New datatype: Huge integers & decimals

2023-05-24 Thread Felipe Oliveira Carvalho
Have you considered using fixed-length binary values for these? Crypto algorithms might logically be defined in terms of mathematical operations on integers, but their efficient implementation tends to feature inlined operations at the machine word level instead of generic add, div, mod, mul

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-21 Thread Felipe Oliveira Carvalho
ple, > operations > >> that slice these containers can be implemented in a zero-copy manner by > >> just rearranging the lengths/offsets indices, without ever touching the > >> larger internal buffers. This is a similar motivation as for StringView > >> (think

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Felipe Oliveira Carvalho
luding compute kernels? Or are they likely to > just > > convert this type to ListArray at import boundaries? > > > > Because if it turns out to be the latter, then we might as well ask Velox > > to export this type as ListArray and save the rest of the ecosystem some > >

Re: Freeing memory when working with static crt in windows.

2023-05-12 Thread Felipe Oliveira Carvalho
> I am actually trying to switch to arrow_static.lib. Perhaps the issue is arrow_static.lib being linked with a static crt that's not the one you are using in your project? On Fri, May 12, 2023 at 3:13 PM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) < avertl...@bloomberg.net> wrote: > This is not

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-11 Thread Felipe Oliveira Carvalho
; > >> > > >> On Tue, Apr 25, 2023 at 3:13 PM Will Jones > > wrote: > > >> > > >>> Hi Felipe, > > >>> > > >>> Thanks for the introduction. I'd be interested to hear about the > > >>> applications Velox h

Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Felipe Oliveira Carvalho
Congratulations, Matt! On Wed, 3 May 2023 at 14:37 Andrew Lamb wrote: > The Project Management Committee (PMC) for Apache Arrow has invited > Matt Topol (zeroshade) to become a PMC member and we are pleased to > announce > that Matt has accepted. > > Congratulations and welcome! >

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Felipe Oliveira Carvalho
After Weston's suggestion above, I've renamed files and classes in my WIP implementation: ArrayView -> ListView On Wed, Apr 26, 2023 at 11:08 AM Ian Cook wrote: > +1 to what Weston and Joris suggested regarding the name. "ListView" > seems like the best name to use for this layout in Arrow. >

[DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread Felipe Oliveira Carvalho
Hi folks, I would like to start a public discussion on the inclusion of a new array format to Arrow — array-view array. The name is also up for debate. This format is inspired by Velox's ArrayVector format [1]. Logically, this array represents an array of arrays. Each element is an array-view

Re: [DISCUSS] The default commit message for merge button

2023-01-31 Thread Felipe Oliveira Carvalho
+1 for "pull request title *and* description". Being able to read descriptions without leaving the editor is handy. Keeping that information tracked in the repo means we don’t depend on GitHub to reconstruct the history of the project. On Tue, 31 Jan 2023 at 06:43 Antoine Pitrou wrote: > > +1