Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Aldrin
For what it's worth, duckdb accesses arrow data via IPC in an extension then exports to C data interface to call into code in its core. Also, assumptions about when query optimization occurs relative to data access potentially breaks down in scenarios involving: views, distributed tables,

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Aldrin
with more information and thoughts in the meantime. [1]: https://arxiv.org/pdf/2304.05028.pdf Sent from Proton Mail for iOS On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr lazarandrei...@gmail.com wrote: Hi Aldrin, thanks for taking the time to reply to my email! In my understanding, compression on Par

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-22 Thread Aldrin
Hello! I don't do much with compression, so I could be wrong, but I assume a compression algorithm spans the whole column and areas of large variance generally benefit less from the compression, but the encoding still provides benefits across separate areas (e.g. separate row groups). My

Re: [DISCUSS][C++] Help needed to refactor Skyhook

2024-03-15 Thread Aldrin
# -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Thursday, March 14th, 2024 at 09:10, Jayjeet Chakraborty wrote: > Hi Ben, I am willing to help out with the refactor too ! > > On Wed, Mar 13, 2024 at 9:25 PM Aldrin

Re: [DISCUSS][C++] Help needed to refactor Skyhook

2024-03-13 Thread Aldrin
I am interested in helping to refactor! -Aldrin On Wed, Mar 13, 2024 at 08:54, Benjamin Kietzman bengil...@gmail.com wrote: Skyhook [1] enables efficient predicate and projection pushdown from Arrow Dataset to a Ceph storage cluster. This is very cool functionality, but it's tightly coupled

Re: dev question - is it possible to store different types in a single array ?

2024-02-29 Thread Aldrin
Hello! For an Array of mixed types, you can use a DenseUnion [1] or SparseUnion type [2]. For modeling as rows instead of columns, the short answer is "no" but you could store the pivot/rotation of the table (columns represent rows) or you can use something like a StructArray [3]. The data in

Re: [DISCUSS] Move sqlparser-rs back into DataFusion project?

2024-02-27 Thread Aldrin
Maybe it would be valuable to more explicitly define "moving back into DataFusion project". I assumed it meant absorbing into the datafusion repo, but it occurs to me that may not be the case. Then, how would sqlparser-rs be "moved"? # ---

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Aldrin
feedback. I glanced at the document before but I'll go through again to see if there is anything I can comment on. # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Tuesday, February 27th, 2024 at 17:43, Paul

Re: [DISCUSS] Move sqlparser-rs back into DataFusion project?

2024-02-17 Thread Aldrin
<<< text/html; charset=utf-8: Unrecognized >>> publicKey - octalene.dev@pm.me - 0x21969656.asc Description: application/pgp-keys signature.asc Description: OpenPGP digital signature

Re: Is there a way we can read a data frame from a cpp program in Apache fusion program in Rust?

2024-02-08 Thread Aldrin
/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_csv [4]: https://arrow.apache.org/datafusion/library-user-guide/custom-table-providers.html # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io

Re: [VOTE] Flight SQL as experimental

2023-12-08 Thread Aldrin
<<< text/html; charset=utf-8: Unrecognized >>> publicKey - octalene.dev@pm.me - 0x21969656.asc Description: application/pgp-keys signature.asc Description: OpenPGP digital signature

Re: Is there anyway to resize record batches

2023-11-22 Thread Aldrin
implementations since ChunkedArray is not part of the specification, though I am optimistic that if you pass ChunkedArray to a different implementation then the C++ implementation could consolidate it as a single Array. # -- # Aldrin https://github.com/drin

Re: Is there anyway to resize record batches

2023-11-22 Thread Aldrin
#_CPPv4N5arrow16TableBatchReaderE [8]: https://arrow.apache.org/docs/cpp/compute.html#selections # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Wednesday, November 22nd, 2023 at 10:58, Jacek Pliszka wrote: > Hi! > > I t

Re: [Format] C Data Interface integration testing

2023-10-19 Thread Aldrin
try the unsubscribe link at [1].[1]: https://arrow.apache.org/community/ Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 23:41, Richard Haven wrote: UNSUBSCRIBEBAJARSEANFOSGRIFIADОТПИШИHLOKOMELAOn Thu, Oct 19, 2023 at 9:56 AM Antoine Pitrou wrote:>> Hello

Re: Apache Arrow file format

2023-10-19 Thread Aldrin
And the first paper's reference of arrow (in the references section) lists 2022 as the date of last access. Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:51, Aldrin <octalene@pm.me.INVALID> wrote: For context, that second referenced paper has Wes McKinney as a co-auth

Re: Apache Arrow file format

2023-10-19 Thread Aldrin
For context, that second referenced paper has Wes McKinney as a co-author, so they were much better positioned to say "the right things." Sent from Proton Mail for iOS On Thu, Oct 19, 2023 at 18:38, Jin Shang wrote: Honestly I don't understand why this VLDB paper [1]

Re: [DISCUSS][C++] Raw pointer string views

2023-09-27 Thread Aldrin
 convert any type to a raw pointer I assume that internal representations are not problematic. But, even so, perhaps those benchmarks can be reused to do the comparison (if that helps reduce the amount of work to be done for Ben).-Aldrin Sent from Proton Mail for iOS On Wed, Sep 27, 2023 at 15:12

Re: Need help on ArrayaSpan and writing C++ udf

2023-07-17 Thread Aldrin
Oh wait, I see now that you're incrementing with a uint8_t*. That could be fine for your own use, but you might want to make sure it aligns with the type of your output (Int64Array vs Int32Array). Sent from Proton Mail for iOS On Mon, Jul 17, 2023 at 06:20, Aldrin <octalene@pm.me.INVA

Re: Need help on ArrayaSpan and writing C++ udf

2023-07-17 Thread Aldrin
Hi Wenbo,An ArraySpan is like an ArrayData but does not own the data, so the ColumnarFormat doc that Jon shared is relevant for both.In the case of a binary format, the output ArraySpan must have at least 2 buffers: the offsets and the contiguous binary data (values). If the output of your UDF

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Aldrin
without having to prove out the benefits for libraries that >use a different tech stack (e.g. rust vs C++ vs go). [1]: https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing # ------ # Aldrin https://github.com/dri

Re: Apache Arrow | Graph Algorithms & Data Structures

2023-06-30 Thread Aldrin
djacency lists or if you're using a more normalized relational format. Thanks! # ------ # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene publickey - octalene.dev@pm.me - 0x21969656.asc Description: application/pgp-keys signature.

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-20 Thread Aldrin
I don't feel like this representation is necessarily a detail of the query engine, but I am also not sure why this representation would have to be converted to a non-view format when serializing. Could you clarify that? My impression is that this representation could be used for persistence or

Re: [DISCUSS][C++] How to run arrow-dataset-dataset-writer-test

2023-04-07 Thread Aldrin
tself is working or if there's something in your configuration that's wrong. I can show more direct examples once I update my environment. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Apr 7, 2023 at 7:34 AM Haocheng Liu wrote: > Hi, > > I'm new to arrow development and

Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-31 Thread Aldrin
a draft PR? In general I agree with the general direction of the discussion otherwise. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Mar 31, 2023 at 7:49 AM Will Jones wrote: > > Also good to know: contributors apparently can't re-open PRs if it was > > closed by

Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Aldrin
Congrats Will!! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Mar 13, 2023 at 11:13 AM Dewey Dunnington wrote: > Congrats, Will! > > On Mon, Mar 13, 2023 at 3:07 PM Matt Topol wrote: > > > > Congrats Will! > > > > On Mon, Mar 13, 2023, 2

Re: [DISCUSS] Acero roadmap / philosophy

2023-03-09 Thread Aldrin
as valuable (should be prioritized) or if additional support is going to be "as-needed". Note that I have a minimal understanding of how "large" substrait is and what proportion of it is already supported by Acero. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar

Re: Question about memory usage and type casting using pyarrow Table

2023-02-15 Thread Aldrin
]: https://arrow.apache.org/docs/python/generated/pyarrow.Field.html#pyarrow.Field.with_metadata Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Feb 15, 2023 at 2:52 PM Li Jin wrote: > Oh thanks that could be a workaround! I thought pa tables are supposed to > be imm

Re: [FLIGHT] Question about Flight Protocol Usage

2023-02-03 Thread Aldrin
out, your main concern should probably be protocol compatibility. If you will have control of the client side of communications, then I think there are minimal concerns other than how you design what a Ticket or FlightInfo contains. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri

Re: [DISCUSS][C++] C++ API as a user-facing API

2022-09-29 Thread Aldrin
especially while Arrow is still growing. In addition, if I want to contribute to Arrow, I would also need to interact with the lower-level API at some point and I wouldn't necessarily want to start with trying to contribute code before using it in my own project(s). Aldrin Montana Computer Scienc

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-06 Thread Aldrin
awesome, congrats! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Tue, Sep 6, 2022 at 6:10 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Congrats Weston! It is great to have you on the team! > > On Tue, 6 Sept 2022 at 06:10, Weston Pace wrote:

Re: Usage of the name Feather?

2022-08-31 Thread Aldrin
e "IPC" is necessary, but it does push the intent into the name (unless it's actually a misnomer). Aldrin Montana Computer Science PhD Student UC Santa Cruz On Tue, Aug 30, 2022 at 8:29 PM Micah Kornfield wrote: > I think one source of ambiguity for Arrow files, at least for me, is

Re: Using Acero in a distributed environment

2022-08-31 Thread Aldrin
/presentation/d/1Nollf087CRhMmEAWcwfudIizIhF-ttPRGgaqmuXtSBQ/edit#slide=id.g12c2952ca0d_0_67 Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Aug 31, 2022 at 10:29 AM Jayjeet Chakraborty < jayjeetchakrabort...@gmail.com> wrote: > Thanks a lot for your reply, Niranda a

Re: [C++] Read Flight data source into Acero

2022-08-17 Thread Aldrin
I don't have any pointers, but just wanted to mention that I am going to try and figure this out quite a bit in the next week. I can try to create some relevant cookbook recipes as I plod along. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Aug 17, 2022 at 9:15 AM Li Jin

Re: [C++] Disable anonymous namespaces in debug mode

2022-08-12 Thread Aldrin
ooh, that seems like a good idea to me. I'd be happy to follow that style. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Aug 10, 2022 at 4:21 PM Sasha Krassovsky wrote: > Hi everyone, > I've recently had quite a few pain points while debugging due to the use of >

Re: [Rust] IPC Format / Feather support in Datafusion

2022-07-25 Thread Aldrin
oh, perfect. I'll just link the JIRAs. Thanks Kou! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Jul 25, 2022 at 1:53 PM Sutou Kouhei wrote: > Hi, > > https://issues.apache.org/jira/browse/ARROW-17092 may be > related. > > Thanks, > -- > kou >

Re: [Rust] IPC Format / Feather support in Datafusion

2022-07-25 Thread Aldrin
://arrow.apache.org/docs/format/Columnar.html#ipc-file-format [3]: https://arrow.apache.org/docs/cpp/ipc.html Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Jul 22, 2022 at 2:46 PM Will Jones wrote: > FYI It looks like there is active work to change the Python [1] and R

Re: [Rust] IPC Format / Feather support in Datafusion

2022-07-22 Thread Aldrin
sorry, I meant "...especially *for* the rust community if they are just using IPC directly for file formats." Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Jul 22, 2022 at 11:14 AM Aldrin wrote: > I always assumed IPC was when it was in memory, fea

Re: [Rust] IPC Format / Feather support in Datafusion

2022-07-22 Thread Aldrin
since V2. I'm not sure if a feather V3 would ever diverge from IPC format or if feather adds anything that's more filesystem friendly (versus other storage system interfaces) or makes filesystem performance more predictable. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Jul 22

Re: arrow usage

2022-06-29 Thread Aldrin
table.html#_CPPv4N5arrow17ConcatenateTablesERKNSt6vectorINSt10shared_ptrI5Table24ConcatenateTablesOptionsP10MemoryPool Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Jun 29, 2022 at 9:53 AM L Ait wrote: > Hi, > > I would like to be added to the mailing list and would like it if there is > some dedicated forum to ask some questions. > > I would lik

Re: Arrow FunctionRegsitry usage in Python

2022-06-23 Thread Aldrin
done. [1]: https://arrow.apache.org/docs/cpp/compute.html#invoking-functions Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Jun 22, 2022 at 12:34 PM Murali S wrote: > Hi , > > I was wondering if it is possible to add a C++ Function to the Compute > Functi

Re: vectorized processing for arrow::take()

2022-06-23 Thread Aldrin
instructions? I think a little bit more context about what you know and what you're trying to do could also help others who know more about this function (and vectorization in Arrow in general) to chime in. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Jun 23, 2022 at 12:41

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-20 Thread Aldrin
quot;C++" can be inserted ("A C++ compute...") Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, May 19, 2022 at 6:07 PM Will Jones wrote: > > > > A relatively obscure name at least makes it easy to search for. I guess > > we'll want to w

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-09 Thread Aldrin
in that vein, I feel like you could also say that "ACE" has an "an" prefix to deflect the connotation of primacy: - An Arrow Compute Engine - An Arrow C++ Compute Engine Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, May 9, 2022 at 2:12 PM Ian Cook

Re: what is the default batch size of the RecordBatchReader?

2022-04-25 Thread Aldrin
[1]: https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/parquet/properties.h#L556 [2]: https://arrow.apache.org/docs/cpp/api/dataset.html#_CPPv4N5arrow7dataset7Scanner7ToTableEv Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Apr 25, 2022 at 3:05 AM 1057445597 <1057445...@q

Re: storing per record batch metadata in arrow IPC file

2022-04-05 Thread Aldrin
lob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L644 [3]: https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L665 [4]: https://github.com/apache/arrow/blob/apache-arrow-7.0.0/cpp/src/arrow/ipc/writer.cc#L1253 Aldrin Montana Computer Science PhD Student UC Santa Cruz

Re: Recompiling pyarrow package without static libraries

2022-02-14 Thread Aldrin
Thanks for the response! I'll try that out. It didn't occur to me that archlinux might be building the static libraries yet not installing them (and/or removing them). I'll check a few things and report back here what works. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Feb

Recompiling pyarrow package without static libraries

2022-02-11 Thread Aldrin
ON=ON \ -DARROW_SIMD_LEVEL=AVX2\ -DARROW_USE_GLOG=ON\ -DARROW_WITH_BROTLI=ON \ -DPARQUET_REQUIRE_ENCRYPTION=ON make -C build Thank you for any help you can offer! Aldrin Montana Computer Science PhD Student UC Santa Cruz

Re: Jira Access

2021-12-22 Thread Aldrin
I think you just sign up: https://issues.apache.org/jira/secure/Dashboard.jspa Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Dec 22, 2021 at 9:08 PM Dulvin Witharane wrote: > Hi, > > I would love to have access to JIRA. Please enroll me or let me know the >

Re: [Parquet][C++][Python] Maximum Row Group Length Default

2021-11-22 Thread Aldrin
t of time parsing metadata and > much less time actually reading data. Thanks! > -- Aldrin Montana Computer Science PhD Student UC Santa Cruz

Re: [DISCUSS] Deprecate user@ in favor for github issues/discussions

2021-10-05 Thread Aldrin
> > How about trying GitHub issues and/or discussion in a > specified period without deprecating user@? e.g. between > 6.0.0 release and 7.0.0 release. Oooh, I like this idea. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Oct 4, 2021 at 7:11 PM Sutou Kouhei

Re: [DISCUSS] Deprecate user@ in favor for github issues/discussions

2021-09-29 Thread Aldrin
degree, though, the ease of searching should mitigate this if people are properly cross-referencing as appropriate. But, I'm not entirely sure what this would be problematic for. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Wed, Sep 29, 2021 at 11:16 AM Micah Kornfield wrote

Re: Flight SQL

2021-08-19 Thread Aldrin
tion is prohibited. If you are not the > intended recipient, please contact the sender by reply email and destroy > all copies of the original message. Thank you. > -- Aldrin Montana Computer Science PhD Student UC Santa Cruz

Re: [ANNOUNCE] New Arrow PMC member: David M Li

2021-06-22 Thread Aldrin
Congrats David! Thanks for the contributions to documentation, it's pretty awesome. :) Aldrin Montana Computer Science PhD Student UC Santa Cruz On Tue, Jun 22, 2021 at 10:55 AM Daniël Heres wrote: > Congrats to you! > > On Tue, Jun 22, 2021, 19:42 Eduardo Ponce wrote: > &g

Re: Long title on github page

2021-05-18 Thread Aldrin
art of the interface for efficiency - Arrow certainly has a data format, but that format is the crux of the interface (IMO). However, it also makes using other formats easy (via filesystem API and parquet reader/writers, etc.). So, focusing on the data format seems unnecessary in such a terse d

Re: New style in documentation on the website looks great

2021-05-05 Thread Aldrin
I very much enjoy the new theme Aldrin Montana Computer Science PhD Student UC Santa Cruz On Tue, May 4, 2021 at 11:47 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Thanks, I am happy that people like it! > It's a slightly customized version of the pydata-

Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-12 Thread Aldrin
This is great, thanks! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Mar 12, 2021 at 11:39 AM Andrew Lamb wrote: > Here are links to the content, should anyone be interested: > > Query Engine Design and the Rust-Based DataFusion in Apache Arrow > reco

Re: Question about joining two tables

2021-03-11 Thread Aldrin
Great, thanks for the responses! That all makes sense :) On Thu, Mar 11, 2021 at 1:29 PM Benjamin Kietzman wrote: > Hi Aldrin, > > We don't have a unified repository for design docs that I'm aware of. > Governance-wise only JIRA and the mailing lists are canonical, but > IIU

Re: Question about joining two tables

2021-03-11 Thread Aldrin
to navigate to a google drive or a page enumerating the various documents. Thank you! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 11, 2021 at 10:07 AM Benjamin Kietzman wrote: > Hi, > > This is not yet implemented but it is on the roadmap for the near future:

Documenting the dataset/compute/expression APIs

2021-02-12 Thread Aldrin
t; OR description ~ "expression") Specifically, I'm interested in C++ rather than python (though, I suppose pyarrow documentation can help with the C++ documentation?). I wanted to ping here in case anyone has materials to gather, and also in case anyone knows of materials I've missed. Thanks! Aldrin Montana Computer Science PhD Student UC Santa Cruz

Re: Computational Kernels: the project overview

2021-01-29 Thread Aldrin
in (or completed) consolidating or expanding documentation on the compute and dataset/expression APIs and how they interact, etc.? Thanks! Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, Nov 30, 2020 at 7:40 AM Wes McKinney wrote: > One objective of the precompiled kernels proj

[jira] [Created] (ARROW-2683) Resource Warning (Unclosed File) when using pyarrow.parquet.read_table()

2018-06-08 Thread Aldrin (JIRA)
Aldrin created ARROW-2683: - Summary: Resource Warning (Unclosed File) when using pyarrow.parquet.read_table() Key: ARROW-2683 URL: https://issues.apache.org/jira/browse/ARROW-2683 Project: Apache Arrow