Re: DISCUSS: Make bucket creation opt-in in C++ S3FileSystem?

2022-05-29 Thread Micah Kornfield
+1, this sounds reasonable to me. On Sun, May 29, 2022 at 9:26 AM Antoine Pitrou wrote: > +1 as well > > > Le 25/05/2022 à 01:25, Weston Pace a écrit : > > +1 > > > > I think opt-in is the right way to go here. > > > > On Tue, May 24, 2022 at 12:40 PM Will Jones > wrote: > >> > >> Hello Arrow d

Re: [Python] Converting Python Schema Object to Java Schema Object

2022-05-29 Thread Micah Kornfield
I'm not aware of them (but not an expert here), you could do this by serializing the schema as flatbuffers on one side and deserializing of the other On Sat, May 21, 2022 at 7:22 AM Srinivas Lade wrote: > Hi dev@arrow, > > The pyarrow.jvm library contains the schema() function which can convert

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Micah Kornfield
> > I'm currently working on adding Run-Length encoding to arrow. Nice > What are the intended use cases for this: > - external engines want to provide run-length encoded data to work on > using arrow? > It is more than just external engines. Many popular file formats support RLE encoding. Bei

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Micah Kornfield
> > Thinking about compatibility with existing software, RLE could possibly > even made an Extension Type that follows the layout of a struct of > int32 and the encoded value type. I'm wondering wether this would be > better for compatibility. I might be misunderstanding this proposal, but I don'

Re: [C++] Can we remove cpp/src/arrow/dbi/hiveserver2?

2022-06-06 Thread Micah Kornfield
+1 On Mon, Jun 6, 2022 at 8:58 AM Antoine Pitrou wrote: > > +1 for removing it. > > > On Fri, 03 Jun 2022 08:32:35 +0900 (JST) > Sutou Kouhei wrote: > > Hi, > > > > We have Hive adapter in cpp/src/arrow/dbi/hiveserver2 but > > it's not maintained. Can we remove this? > > > > Reasons: > > > > 1.

Re: int64_t vs size_t

2022-06-08 Thread Micah Kornfield
> > Is it an oversight or a conscious design decision? If latter, what is the > reason behind it? This comes from the style guide (Google) [1] the project adapted [1] https://google.github.io/styleguide/cppguide.html#Integer_Types On Wed, Jun 8, 2022 at 10:39 AM Arkadiy Vertleyb (BLOOMBERG/ 120

Re: [FlightSQL] Structured/Serialized representation of query (like JSON) rather than SQL string possible?

2022-06-30 Thread Micah Kornfield
ouping by multiple distinct aggregates", > > >> "Supports > > >> > > self-joins on aliased tables" etc > > >> > > This is going to be unique to each implementation, but I couldn't > > >> > determine > > >&g

Re: [VOTE] Accept donation of Flight SQL JDBC driver

2022-07-03 Thread Micah Kornfield
+1 Binding. On Sat, Jul 2, 2022 at 5:29 AM David Li wrote: > Done! Thanks Kou, I had thought that was auto-generated. > > On Thu, Jun 30, 2022, at 16:55, Sutou Kouhei wrote: > > +1 > > > >> [2]: > https://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/arrow-flight-sql-jdbc-

Re: [C#] Adding compression and decompression support

2022-07-05 Thread Micah Kornfield
I'm not an expert in the C# but I imagine there should be high quality LZ4 implementations in C#. There is [1] which is referenced on the LZ4 page for isntances The approach we took in java is to put compressors/decompressors in a separate sub-component, with the interface contained in the main pr

Re: std::string_view?

2022-07-12 Thread Micah Kornfield
You can substitute the definition in Arrow There might be a few spots that don't compile that use non-standard methods but I've been trying to clean those up. On Tuesday, July 12, 2022, John Muehlhausen wrote: > error: invalid operands to binary expression > ('nonstd::sv_lite::basic_string_view

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Micah Kornfield
Are there more details on what exactly an "Arrow Intermediate Representation (AIR)" is? We've talked about in the past maybe having a memory layout specification for row-based data as well as column based data. There was also a recent attempt at least in C++ to try to build utilities to do these

[RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Micah Kornfield
t; > > >>> Hi Laurent, > > >>> > > >>> I agree that there is a common pattern in converting row-based > formats > > to > > >>> Arrow. > > >>> > > >>> Imho the difficult part is not to map the storage

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Micah Kornfield
Just to be clear, I think we are referring to a "well known"/canonical extension type [1] here? I'd also be in favor of this (Disclaimer I'm a colleague of Padeep's) [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney wrote: > T

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Micah Kornfield
there is a common pattern in converting row-based formats > to > >>> Arrow. > >>> > >>> Imho the difficult part is not to map the storage format to Arrow > >>> specifically - it is to map the storage format to any in-memory (row- > or > >>

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Micah Kornfield
> > It would be reasonable to restrict JSON to utf8, and tell people they > need to transcode in the rare cases where some obnoxious software > outputs utf16-encoded JSON. +1 I think this aligns with the latest JSON RFC [1] as well. Sounds good to me too. +1 on the canonical extension type option

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Micah Kornfield
> > > 2. What do we do about different non-utf8 encodings? There does not > appear > > to be a consensus yet on this point. One option is to only allow utf8 > > encoding and force implementers to convert non-utf8 to utf8. Second > option > > is to allow all encodings and capture the encoding in the

Re: [VOTE] Format: Rules and procedures for Canonical extension types

2022-08-24 Thread Micah Kornfield
Sorry for beling late. I'm -0.5 on "org.apache.arrow." given people previously raising naming concerns about having "apache" and "arrow" coupled together.I think just "arrow" makes sense here. I also am not sure about relaxing the 2 language requirement for simple implementations, but feel le

Re: [VOTE] C++: switch to C++17

2022-08-24 Thread Micah Kornfield
+1, I assume this doesn't impact python wheels. On Wed, Aug 24, 2022 at 2:15 PM Sutou Kouhei wrote: > +1 > > In <7ab9b8ef-d5ca-313f-4b12-79647f32a...@python.org> > "[VOTE] C++: switch to C++17" on Wed, 24 Aug 2022 17:31:52 +0200, > Antoine Pitrou wrote: > > > > > Hello, > > > > I would like

Re: Usage of the name Feather?

2022-08-30 Thread Micah Kornfield
I think one source of ambiguity for Arrow files, at least for me, is whether they are just a string of messages concatenated or they are the files that contain the metadata footer. On Tue, Aug 30, 2022 at 5:11 AM Dewey Dunnington wrote: > Ian has a very good point...I would be in favour of calli

Re: [ANNOUNCE] New Arrow PMC member: L. C. Hsieh

2022-09-03 Thread Micah Kornfield
Congrats! On Sat, Sep 3, 2022 at 8:19 PM QP Hou wrote: > Congrats Liang-Chi! > > On Sat, Sep 3, 2022 at 8:25 PM Remzi Yang <1371656737...@gmail.com> wrote: > > > Congratulation Liang-Chi! > > > > On Sun, 4 Sept 2022 at 05:39, Sutou Kouhei wrote: > > > > > The Project Management Committee (PMC)

Re: Transactional semantics in Acero

2022-09-09 Thread Micah Kornfield
I would think any transaction concerns would live at the peripheries? e.g. the Datasets? Or at least that is where compatibility would have to be built first. On Fri, Sep 9, 2022 at 12:01 PM Sasha Krassovsky wrote: > Hi Jayjeet, > Transactions are currently out of scope for Acero - Acero is on

Re: Transactional semantics in Acero

2022-09-11 Thread Micah Kornfield
out what >> that would look like in great detail and, at a minimum, you'd maybe >> want some kind of Iceberg -> Substrait planner. >> >> [1] https://arrow.apache.org/docs/python/dataset.html#a-note-on- >> transactions-acid-guarantees >> >> On Fri,

Re: PRs for RLE support

2022-09-14 Thread Micah Kornfield
> > * Should we encode "run lengths" or "run ends"? I think the project has leaned towards sublinear access, so run ends make sense. The downside is that we run into similar issues with List/LargeList where the total number of elements is limited by bit-width (which can also cause space wastage

Re: RLE array slicing

2022-09-15 Thread Micah Kornfield
I agree slicing can be tricky here. Since slicing is not part of the specification, maybe there should be two separate discussions here. I'll be honest, I forget exactly how slicing works in the C++ implementation, but is > Say you want to slice the RLE array from Logical Offset 4 (which doesn't

Re: RLE array slicing

2022-09-15 Thread Micah Kornfield
key is you need to keep the logical offset around as part of you metadata so you can do the subtract. On Thu, Sep 15, 2022 at 1:19 AM Antoine Pitrou wrote: > > Le 15/09/2022 à 10:14, Micah Kornfield a écrit : > > I agree slicing can be tricky here. Since slicing is not part of th

Re: [DISCUSS][C++] C++ API as a user-facing API

2022-09-29 Thread Micah Kornfield
I think the convention we have been using is that headers included directly in the "api.h" headers were considered public, those that aren't were considered a gray area. On Thu, Sep 29, 2022 at 11:51 AM Aldrin wrote: > > many parts were written without the intention that those APIs would not > b

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Micah Kornfield
> > Was just wondering was support for UTF-16 Strings considered? As far as I > am aware VarChar vectors only support UTF-8. Are they something that may be > supported in the future? This hasn't really been discussed and is a pretty large change because it would specification updates and other imp

Re: [Java] UTF-16 support for VarCharVectors

2022-09-29 Thread Micah Kornfield
> > I've never attempted to transport that data over the wire or export it > using the C-Data Interface, however. It seems like that's where it would > fall down. Yeah, there would be funny characters or validation failures someplace down the line when trying to transfer the data. On Thu, Sep 29,

Re: pyarrow dataset API

2022-11-01 Thread Micah Kornfield
Moving conversation to dev@ which is more appropriate place to discuss. On Tuesday, November 1, 2022, Chang She wrote: > Hi there, > > The pyarrow dataset API is marked experimental so I'm curious if y'all > have made any decisions on it for upcoming releases. Specifically, any > thoughts on mak

Re: Struct evolution

2022-11-08 Thread Micah Kornfield
Hi Matthew, Could you give some more specifics about what language/component you are using. In general, Arrow at a specification level doesn't deal with schema evolution. Is this in regard to Datasets or a different component? Thanks, Micah On Mon, Nov 7, 2022 at 5:06 PM Matthew Scanlon < matth

Re: [RFC] Schema Evolution

2022-11-09 Thread Micah Kornfield
It doesn't look like comment access is enabled? On Wed, Nov 9, 2022 at 5:16 PM Weston Pace wrote: > I've created a document[1] that both describes the general idea of > schema evolution as well as my best guess at how it should work. This > is written from an Acero / datasets perspective but th

Re: Array::GetValue ?

2022-11-14 Thread Micah Kornfield
Hi John, There are a couple of edge cases that need to be discussed to move the function to the base array class (which IIUC is this proposal): 1. boolean 2. struct 3. lists/LargeList 4. DictionaryArray FlatArray [1] seems like a better place for this method if there is consensus on adding it.

Re: Array::GetValue ?

2022-11-16 Thread Micah Kornfield
ray_primitive.h#L50 > > >>> > > >> [2] > > >> > > > https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/array_primitive.h#L109 > > >> < > > >> > > > https://github.com/js8544/arrow/blob/master/cpp/src/arrow/array/ar

Struct evolution

2022-11-28 Thread Micah Kornfield
uot; of what a list of structs achieves raises a > pyarrow.lib.ArrowNotImplementedError when calling table_to_blocks(). > > On Tue, Nov 8, 2022 at 12:53 PM Micah Kornfield > wrote: > > > Hi Matthew, > > Could you give some more specifics about what language/component you a

Re: [DISCUSS] JSON Canonical Extension Type

2022-11-28 Thread Micah Kornfield
This seems like a reasonable definition to me. Since there hasn't been much feedback, I think maybe following through an implementation + this description in a PR would be the next steps. If there isn't further feedback on this, once the PR is up we can have try to vote (which might bring up some

Re: [DISCUSS] JSON Canonical Extension Type

2022-11-30 Thread Micah Kornfield
row but you are welcome to propose something if you would like. Cheers, Micah On Mon, Nov 28, 2022 at 11:55 AM Lee, David wrote: > Can a logical extension be based on another logical extension? > > HOCON support might be nice.. > > -Original Message----- > From: Micah Kornfi

Re: Array::GetValue ?

2022-11-30 Thread Micah Kornfield
uld be named though (also, we may want to be careful with multiple > inheritance?). > > Regards > > Antoine. > > > Le 17/11/2022 à 06:15, Micah Kornfield a écrit : > >> > >> std::string_view FlatArray::GetValueBytes(int64_t index) > > > > > &

Re: [ANNOUNCE] New Arrow PMC chair: Andrew Lamb

2022-12-27 Thread Micah Kornfield
Congrats! On Tue, Dec 27, 2022 at 8:51 AM Krisztián Szűcs wrote: > Congrats Andrew! > > On Tue, Dec 27, 2022 at 8:54 AM Ian Joiner wrote: > > > > Congrats Andrew! > > > > Ian > > > > On Tuesday, December 27, 2022, Raúl Cumplido > wrote: > > > > > Congratulations Andrew! > > > > > > > > > El ma

Re: [Monorepo] Add labels breaking-change and critical-fix

2023-01-06 Thread Micah Kornfield
These sounds good to me, we should be careful around crashes/security issues to not tag them until they are triaged and we decide if a new one-off release is necessary. On Fri, Jan 6, 2023 at 8:57 AM Will Jones wrote: > Hello Arrow devs, > > For the monorepo, I would like to propose adding two n

Re: [QUESTION][Parquet][Decimal] Why not implement the INT32/INT64 to store Decimal logical type in parquet file

2023-01-06 Thread Micah Kornfield
> > Hi Kun, > The document of arrow c++ about Reading and writing Parquet files > requires > `(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.` I don't think this is a requirement, it is simply documenting current b

[DISCUSS] Updating what are considered reference implementations?

2023-01-06 Thread Micah Kornfield
I'm having trouble finding it, but I think we've previously agreed that new features needed implementations in 2 reference implementations before approval (I had thought the community agreed on Java and C++ as the two implementations but I can't find the vote thread on it). The recent of addition

Re: [DISCUSS] Updating what are considered reference implementations?

2023-01-06 Thread Micah Kornfield
o maintain two "feature complete" implementations at all times. I worry if there is a pick 2 from N reference implementations that potentially leads to fragmentation more quickly. But maybe this is premature? Cheers, Micah On Fri, Jan 6, 2023 at 10:02 AM Antoine Pitrou wrote: > >

Re: ADLS C++ support in next release (version 11)

2023-01-06 Thread Micah Kornfield
It looks like there is an open PR: https://github.com/apache/arrow/pull/12914 for this but no recent activity. Its not clear how much remaining work there is but it seems like timing might be getting tight. If you need this functionality consider coordinating with the author to see if you can hel

Re: [IPC] How to plugin a custom compression type?

2023-01-19 Thread Micah Kornfield
Hi Rong, IMO, the purpose of IPC spec is for wide interoperability to ensure data can easily be interchanged. Therefore, I don't think we should be adding a pluggable codecs that can be used to write Arrow IPC files to the main repo. At the very least if we wanted to support custom codecs, we wou

Re: [C++] Parquet and Arrow overlap

2023-02-12 Thread Micah Kornfield
> > I am a committer on Arrow, > but not on Parquet right now. Does that mean I should only merge Parquet > C++ PRs for code changes in parquet/arrow? FWIW, This was the mode I was operating under. My preference here would be to continue to operate under this mode for the governance perspective.

Re: [DISCUSS] arrow/arrow2 path forward

2023-02-21 Thread Micah Kornfield
Great to see, thank you for bringing it to the attention of the wider community. On Mon, Feb 20, 2023 at 4:15 AM Andrew Lamb wrote: > There has been a significant amount of discussion in the past on this list > about the relationship between the two major Rust implementations of > Apache Arrow (

Re: row counts in footer of IPC file format

2023-04-01 Thread Micah Kornfield
IIRC Struct's are immutable once defined, if you want to evolve, then Tables are necessary. On Mon, Mar 20, 2023 at 8:22 AM Weston Pace wrote: > +1, I'm generally in favor of the idea. I would prefer > `recordBatchNumRows` (or, less favorably, `recordBatchSize`). I don't > think `recordBatchLe

Re: Arrow community meeting April 12 at 16:00 UTC

2023-04-14 Thread Micah Kornfield
ad and do this; the Parquet Rust > implementation did something similar > - There are already some Parquet issues that were reported and > resolved in the Arrow monorepo in this release without ever being > opened as Parquet Jira issues [10] > - Check with Micah Kornfield, Fatemah Pa

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Micah Kornfield
Small bikeshed: But to keep naming consistent "ViewList"? On Wed, Apr 26, 2023 at 8:02 AM Weston Pace wrote: > > My understanding is that the primary benefit of this ListView layout > > over Arrow's existing List layouts [1] is that ListView allows for > > buffer alignment [2] without padding,

Re: [Format] Is it legal to have a struct array with a shorter length than its children?

2023-05-09 Thread Micah Kornfield
> > I think the `the same length` means the length of the struct array, this is > similar in the case of RecordBatch where the `num_rows` of a RecordBatch > can be different to the length of its fields. I didn't think this is true and I thought we had prior discussions on the mailing list sugges

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-27 Thread Micah Kornfield
This sounds reasonable to me but my main concern is, I'm not sure there is a great mechanism to enforce canonical layouts don't somehow become default (or the only implementation). Even for these new layouts, I think it might be worth rethinking binding a layout into the schema versus having a dif

Re: [DISCUSS][C++] Raw pointer string views

2023-10-01 Thread Micah Kornfield
> > I would also assert that another way to reduce this risk is to add > some prose to the relevant sections of the columnar format > specification doc to clearly explain that a raw pointers variant of > the layout, while not part of the official spec, may be implemented in > some Arrow libraries.

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-10-02 Thread Micah Kornfield
Sorry to chime in late. In practice I'm not sure how much LargeList is used? Are we doing this just for symmetry purposes? Is there a known use-case for it? On Mon, Oct 2, 2023 at 11:20 AM Matt Topol wrote: > Should have expanded my messages, i forgot that i already +1'd this > > d'oh! Sorry

Re: [Java][Discuss]: consensus for JDK 8 deprecation

2023-10-06 Thread Micah Kornfield
I think given the stability of Arrow Java, dropping support probably makes sense. If a bug comes up or consumers really need to new features we can always make a patch release of an older version. On Thu, Oct 5, 2023 at 3:13 PM Dane Pitkin wrote: > I also learned today that Apache Spark has dro

Re: Are interval components unsigned?

2023-10-13 Thread Micah Kornfield
My understanding is that the intent in Arrow is that intervals are signed ([1] has the discussion on the types). IIUC, this aligns with most SQL type systems. I don't have context on Parquet (and I think Avro) chose to make them unsigned. Also, note that because of this there is no canonical wa

Re: [DISCUSS][IPC Format] Allow dictionary replacement in file format.

2023-10-15 Thread Micah Kornfield
IIRC, the main issue that the community did not want to tackle at the time was determining which dictionary applied to which particular record batch when it comes to random access. The IPC File Footer [1] does not contain enough information to do this without using heuristics. Thanks, Micah [1]

Re: decimal64

2023-11-09 Thread Micah Kornfield
Narrower Decimal types has come up in the past as something that is desirable (for instance parquet supports using ints of similar precision/scale). IIRC I think the main blocker has been people willing to work on two compatible implementations. FWIW, we decided for Decimal256 to be conservative

Re: decimal64

2023-11-09 Thread Micah Kornfield
I think this should be a type. [1] is the last discussion on deciding, and I think the consensus was that if it is an additional parameterization of existing types, then using the existing type makes sense instead of an extension type. [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmg

Re: [VOTE][FORMAT] Bulk ingestion support for Flight SQL

2023-11-15 Thread Micah Kornfield
Sorry for the late reply but left some comments which I think are potentially worth addressing, so I'm +0.0 at the moment. On Wed, Nov 15, 2023 at 10:31 AM Matt Topol wrote: > +1 > > On Wed, Nov 15, 2023, 10:44 AM Jean-Baptiste Onofré > wrote: > > > +1 (non binding) > > > > Regards > > JB > > >

Re: [parquet][Iceberg] Should hive partition keys appear as corresponding columns in the file

2023-11-29 Thread Micah Kornfield
I don't think there is a strong consensus here unfortunately and different people might want different things, and there is the issue with legacy systems. As another example, whether to include partition columns in data files is a configuration option in Hudi. If I was creating new data from scra

Re: [DISCUSS] Flight SQL as experimental

2023-12-07 Thread Micah Kornfield
This applies to mostly existing APIs (e.g. recent additions are still experimental)? Or would it apply to everything going forward? Thanks, Micah On Thu, Dec 7, 2023 at 2:25 PM David Li wrote: > Yes, we'd update the docs, the Protobuf definitions, and anything else > referring to Flight SQL as

Re: [DISCUSS] Flight SQL as experimental

2023-12-08 Thread Micah Kornfield
ent > > > Flight/Flight SQL protocol and code as it is today. Protocol extensions > > > should be still deemed experimental if still in their incubating phase? > > > > > > Laurent > > > > > > On Thu, Dec 7, 2023 at 4:54 PM Micah Kornfield >

Re: [VOTE] Accept donation of flightsql-odbc

2024-01-05 Thread Micah Kornfield
+1 (binding) On Fri, Jan 5, 2024 at 9:28 AM James Duong wrote: > +1 > > From: Santiago Mota > Date: Friday, January 5, 2024 at 7:49 AM > To: dev@arrow.apache.org > Subject: Re: [VOTE] Accept donation of flightsql-odbc > +1 > > Enviado desde Yahoo Mail con Android > > El vie, ene 5, 2024 a 16

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

2024-01-10 Thread Micah Kornfield
Hi Chao, Very cool. I think this is something that a lot of people are interested in. I think the main questions I have are: 1. Would Spark itself not be a reasonable place for this work? 2. Do you anticipate this would move with DataFusion to its own top-level project [1] if that happens or sta

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

2024-01-11 Thread Micah Kornfield
It sounds like there is likely enough support for this to move forward, I'd guess next steps are to work on the donation process/vote. Probably someone more involved with DataFusion should help drive this effort? On Thu, Jan 11, 2024 at 12:55 PM L. C. Hsieh wrote: > Spark as a widely used compu

Re: [DataFusion] New Blog Post -- DataFusion 34.0

2024-01-24 Thread Micah Kornfield
+1 Nice work. On Tue, Jan 23, 2024 at 12:29 PM Antoine Pitrou wrote: > > Impressive, thank you! > > > Le 23/01/2024 à 14:06, Andrew Lamb a écrit : > > If anyone is interested, here is a new blog post about the last 6 months > in > > DataFusion[1] and where we are heading this year. > > > > Andre

Re: [IPC] Delta Dictionary Flag Clarification for Multi-Batch IPC

2024-01-24 Thread Micah Kornfield
Hi Chris, My interpretations: 1) I'm not sure it is clearly defined, but my impression is the first dictionary is never a delta dictionary (option 1) 2) I don't think they are prevented from switching state (which I supposed is more complicated?) but hopefully not by much? 3) Dictionaries are reuse

Re: [VOTE] Accept donation of Comet Spark native engine

2024-01-27 Thread Micah Kornfield
+1 Binding On Sat, Jan 27, 2024 at 10:21 AM David Li wrote: > +1 (binding) > > On Sat, Jan 27, 2024, at 13:03, L. C. Hsieh wrote: > > +1 (binding) > > > > On Sat, Jan 27, 2024 at 8:10 AM Andrew Lamb > wrote: > >> > >> +1 (binding) > >> > >> This is super exciting > >> > >> On Sat, Jan 27, 2024

Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Micah Kornfield
+1 (binding) On Friday, March 1, 2024, Uwe L. Korn wrote: > +1 (binding) > > On Fri, Mar 1, 2024, at 2:37 PM, Andy Grove wrote: > > +1 (binding) > > > > On Fri, Mar 1, 2024 at 6:20 AM Weston Pace > wrote: > > > >> +1 (binding) > >> > >> On Fri, Mar 1, 2024 at 3:33 AM Andrew Lamb > wrote: > >>

Re: Unsupported/Other Type

2024-04-10 Thread Micah Kornfield
Hi Norman, Arrow has a concept of extension types [1] along with the possibility of proposing new canonical extension types [2]. This seems to cover the use-cases you mention but I might be misunderstanding? Thanks, Micah [1] https://arrow.apache.org/docs/format/Columnar.html#format-metadata-ext

Re: [VOTE][Format] JSON canonical extension type

2024-04-29 Thread Micah Kornfield
+1, I added a comment to the PR because I think we should recommend implementations specifically reject parsing Binary arrays with the annotation in-case we want to support non-UTF8 encodings in the future (even thought IIRC these aren't really JSON spec compliant). On Fri, Apr 19, 2024 at 1:24 PM

Re: [VOTE][Format] UUID canonical extension type

2024-04-29 Thread Micah Kornfield
Apologies for the late reply, but I think being able to specify the UUID version as metadata might make sense in some cases? On Fri, Apr 19, 2024 at 1:22 PM Rok Mihevc wrote: > Hi all, > > Following initial requests [1][2] and recent tangential ML discussion [3] I > would like to propose a vote

Re: [VOTE][Format] UUID canonical extension type

2024-04-29 Thread Micah Kornfield
generation than consumption. > > -- > Felipe > > On Mon, Apr 29, 2024 at 2:31 PM Micah Kornfield > wrote: > > > > Apologies for the late reply, but I think being able to specify the UUID > > version as metadata might make sense in some cases? > > &

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-28 Thread Micah Kornfield
I'm also +1 on the idea for both Parquet-java and parquet-pp On Tuesday, May 28, 2024, Andrew Lamb wrote: > I think it is a great idea -- github has served arrow (and datafusion) very > well in my opinion. > > Specifically, having to sign up for a JIRA account (which can not be > created self-se

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Micah Kornfield
+1 (non-binding for Parquet, Binding for Arrow if that makes a difference) On Wed, May 29, 2024 at 7:15 AM Rok Mihevc wrote: > # sending this to both dev@arrow and dev@parquet > > Hi all, > > Following the ML discussion [1] I would like to propose a vote for > parquet-cpp issues to be moved fr

Re: [DISCUSS] Migration of parquet-cpp issues to GitHub

2024-05-29 Thread Micah Kornfield
SGTM +1 On Wed, May 29, 2024 at 10:50 AM Rok Mihevc wrote: > On Wed, May 29, 2024 at 4:39 PM Fokko Driesprong wrote: > > > Hey Rok, > > > > Thanks for bringing this up. I'm also very much in favor of Github. Once > > we've migrated cpp, I think migrating the other repositories is a great > > id

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

2024-06-05 Thread Micah Kornfield
Generally I think this is a good idea that has been proposed before but I don't think we could ever make progress on design. On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei wrote: > Hi, > > Related GitHub issue: > https://github.com/apache/arrow/issues/41909 > > How about adding arrow::ArrayStatisti

Re: Unsupported/Other Type

2024-06-13 Thread Micah Kornfield
I left some comments mostly around bike shedding and organization of description. I agree this is useful. On Tue, Jun 11, 2024 at 10:22 AM Antoine Pitrou wrote: > > Sorry, I had forgotten to comment on this. I think this is generally a > good idea, but it would obviously need more eyes on it :-

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-18 Thread Micah Kornfield
As Boolean is already in the arrow type system I think it might be worth asking the question as to whether this should be an extension type or a first class type. Given what I think of the last discussion on the trade-offs [1], I think there is room for debate here, since Boolean is not currently

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Micah Kornfield
> > > > Regards > > > > Antoine. > > > > > > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit : > > > I think it would confuse implementors of the spec and people > implementing > > > kernels way too much. “the bool Arrow type”

Re: [VOTE][Format] Opaque canonical extension type

2024-07-24 Thread Micah Kornfield
+1 (binding) On Wed, Jul 24, 2024 at 2:19 PM Sutou Kouhei wrote: > +1 (binding) > > In <8ce8b9a4-ae7a-41eb-ab6e-a5ceb2258...@app.fastmail.com> > "[VOTE][Format] Opaque canonical extension type" on Wed, 24 Jul 2024 > 14:33:01 +0900, > "David Li" wrote: > > > Hello, > > > > I'd like to propos

Re: [DISCUSS] Variant Spec Location

2024-08-16 Thread Micah Kornfield
so need to standardize many functions > > > related to it. > > > > > > A neutral place to maintain it is a great choice. > > > > > > - As Gang Wu said, a standalone project is good, just like > RoaringBitmap > > > [1]. > > > - As Ryan

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

2024-09-12 Thread Micah Kornfield
Hi Roman, I tried to skim your writing at least on the read side. IIUC you are advocating for something which to some extent is a generalization of Iceberg's REST catalog planning API [1]. But instead of just Iceberg, it would encompass other formats, and also be more flexible on negotiating whic

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Micah Kornfield
> wrote: > > > >> Somewhat related issue: > https://issues.apache.org/jira/browse/ARROW-10406 > >> > >> On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield > > >> wrote: > >> > >>> BTW, this nuance always felt a little strange to me

[Gandiva] Active maintainers?

2021-03-18 Thread Micah Kornfield
Is anybody actively looking at PRs for Gandiva? There seems to be queue building 18 (or so open). The committers that seemed to be active in the past don't seem to be responding to pings through Github. Thanks, Micah

Re: [ALL] Integration tests for dense and sparse tensor

2021-03-19 Thread Micah Kornfield
For historical context golden files were first introduced so we could verify backwards compatibility. I think the preferred method is still to do "live" testing. (i.e. Having one implementation consume JSON output a binary file, read the binary file with the second implementation and emit JSON, a

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-20 Thread Micah Kornfield
ssue is "NOT ASSIGNED!", it asks if you want to assign to the > > reporter, and if the reporter is not a "contributor", add them to that > role > > first. That's not the only time it comes up (and if we start having more > > JIRAs created from GitHub issue

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-21 Thread Micah Kornfield
21 at 2:31 PM Wes McKinney > wrote: > > > > > "MINOR:" might be a better marker since "trivial" carries the > > > connotation of "unimportant" to me (the dictionary says "of little > > > value or importance"). > > >

Re: [VOTE] Accept donation of Rust Ballista project

2021-03-21 Thread Micah Kornfield
+1 (binding) On Sun, Mar 21, 2021 at 9:31 AM Krisztián Szűcs wrote: > +1 (binding) > > Glad to see this happening! > > On 2021. Mar 21., Sun at 16:56, Andy Grove wrote: > > > Dear all, > > > > On behalf of the Ballista community, I would like to propose that we > donate > > Ballista to the Apac

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Micah Kornfield
> > I executed some of the benchmarks in the airlift/aircompressor project. I > found that aircompressior achieves on average only about 72% > throughput compared to the current version of the lz4-java JNI bindings > when compressing. When decompressing the gap is even bigger with around 56% > thro

Re: [Java] Source control of generated flatbuffers code

2021-03-23 Thread Micah Kornfield
I think checking in the java files is fine and probably better then relying on a third party package. We should make sure there are instructions on how to regenerate them along with the PR On Monday, March 22, 2021, Antoine Pitrou wrote: > > Le 22/03/2021 à 20:17, bobtins a écrit : > >> TL;DR:

Re: [Java] Source control of generated flatbuffers code

2021-03-23 Thread Micah Kornfield
the old format? Or is there enough interop testing that the > problem would get caught right away? > > I'm new to the project and don't know how big of an issue this is in > practice. Thanks for any enlightenment. > > On 2021/03/23 07:39:16, Micah Kornfield wrote: > &

Re: sparse data array

2021-03-26 Thread Micah Kornfield
I made a proposal a while ago that covers a form of RLE encoding [1]. I haven't had time to work on it, since it is a substantial effort to implement. I wouldn't expect an intern to be able to complete the work necessary to get this merged over the course of a normal 3 month internship. [1] http

Re: [Gandiva] Active maintainers?

2021-03-28 Thread Micah Kornfield
dra and Praveen) and will > get them merged. > > Thanks > Vivek > > On Fri, Mar 19, 2021 at 9:29 AM Micah Kornfield > wrote: > >> Is anybody actively looking at PRs for Gandiva? There seems to be queue >> building 18 (or so open). The committers that seemed to b

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-29 Thread Micah Kornfield
://cwiki.apache.org/confluence/display/ARROW > >> [2] https://arrow.apache.org/docs/developers/contributing.html > >> > >> On Sat, Mar 20, 2021 at 2:31 PM Wes McKinney > wrote: > >> > >>> "MINOR:" might be a better marker since &quo

Re: Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-03-30 Thread Micah Kornfield
esn't support Nanoseconds, only microseconds). 3. ZetaSQL implementation (Requires some bit manipulation) but supports the most reasonable ranges for Year, Month and Nanoseconds independently. Thoughts? Micah On 2021/02/18 04:30:55 Micah Kornfield wrote: > > > > I didn’t find

Re: sparse data array

2021-03-30 Thread Micah Kornfield
; > with other significant additions to Arrow over the last several years. > > > > On Sat, Mar 27, 2021 at 9:40 AM Kirill Lykov > wrote: > > > > > Thanks for the information and ideas, I need to check them out > (especially > > > one with structures). > &g

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-03-31 Thread Micah Kornfield
t; You can already use the Duration type for that. > > Regards > > Antoine. > > > Le 31/03/2021 à 05:48, Micah Kornfield a écrit : > > To follow-up on this conversation I did some analysis on interval types: > > > > > https://docs.google.com/docume

Re: sparse data array

2021-04-01 Thread Micah Kornfield
ts on this list and my own experience. I'll look at > the spec again and put it on my back burner. > > On 2021/03/31 04:03:07, Micah Kornfield wrote: > > Hi Bob, > > > > > > > I can observe that in a project like Arrow, there is always a tension > > &g

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-04-02 Thread Micah Kornfield
today there is an explicit constraint on the bounds for the millisecond component). -Micah On Wed, Mar 31, 2021 at 9:03 AM Antoine Pitrou wrote: > > Le 31/03/2021 à 17:55, Micah Kornfield a écrit : > > Thanks for the feedback. A couple of points here and some responses > below.

  1   2   3   4   5   6   7   8   9   10   >