Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Alessandro Molina
I brought it up on Github, but writing here too to avoid spawning too many threads. https://github.com/apache/arrow/issues/38837#issuecomment-2145343755 It's not something we have to address now, but it would be great if we could design a solution that can be extended in the future to add Par-Batc

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-08 Thread Alessandro Molina
On Sun, Apr 7, 2024 at 3:06 PM Andrew Lamb wrote: > > We have had separate releases / votes for Arrow Rust (and Arrow DataFusion) > and it has served us quite well. The version schemes have diverged > substantially from the monorepo (we are on version 51.0.0 in arrow-rs, for > example) and it doe

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2024-02-02 Thread Alessandro Molina
On Wed, Dec 6, 2023 at 7:45 PM Ian Cook wrote: > > I am interested to hear more perspectives on this. My perspective is > that we should recommend using HTTP conventions to keep clean > separation between the Arrow-formatted binary data payloads and the > various application-specific fields. This

Re: Proposal: add a bot to close PRs that haven't been updated in 30 days

2023-03-31 Thread Alessandro Molina
I think that marking them drafts could be a good way to reduce the overload for people having to review PRs, drafts can easily be filtered out in github searches. > I am personally not a huge fan of auto-closing PRs. Especially not > after a short period like 30 days (I think that's too short for

Re: Plasma will be removed in Arrow 12.0.0

2023-03-17 Thread Alessandro Molina
How does PyArrow cope with multiprocessing.Manager? I remember there were some inefficiencies when Pickle was used (mostly related to slicing) but that in theory it should work. That is probably an easy enough replacement for Plasma and is standard. On Wed, Mar 15, 2023 at 10:24 PM Will Jones wro

Re: [VOTE] Disable ASF Jira issue reporting

2022-11-25 Thread Alessandro Molina
+1 as far as for "now" we actually mean "as soon as the necessary scripts have been ported to github" I mean, I doubt the plan is to disable jira before we can actually ship PRs from github issues and thus block development. Il Mer 23 Nov 2022, 22:37 Todd Farmer ha scritto: > Hello, > > I wou

Re: Parser for ExecPlans

2022-11-08 Thread Alessandro Molina
To be honest I find this YAML based representation a bit confusing due to the unclear parameters of functions. In your specific example you have a JOIN taking two sources as their inputs. But how do I know that the two sources are meant to be inputs to the join? And not only that the last source is

Re: [DISCUSS] Move issue tracking to

2022-10-25 Thread Alessandro Molina
On Tue, Oct 25, 2022 at 1:55 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > > I think the main thing we will miss are the Links (relation between > issues), but we can try to promote some consistent usage of adding > "Duplicate of #...", "Related to #..." in top post of an issue

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-08 Thread Alessandro Molina
RLE would probably have some benefits that it makes sense to evaluate, I would personally go in the direction of having a minimal benchmarking suite for some of the cases where we expect to seem most benefit (IE: filtering) so we can discuss with real numbers. Also, the currently proposed format d

Re: Arrow sync call May 11 at 12:00 US/Eastern, 16:00 UTC

2022-05-13 Thread Alessandro Molina
I think Arrow should definitely consider adding a DataFrame-like API. There are multiple reasons why exposing Arrow to end users instead of restricting it to developers of framework would be beneficial for the Arrow project itself. A rough approximation of DataFrame like API has been growing duri

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

2022-05-11 Thread Alessandro Molina
As far as I understood, the idea is not to fully remove memory mapping, just turn the current mmap=True default arguments to mmap=False The goal is mostly to provide consistent behaviour for end users. At the moment users might face very different performances when they read locally or on a networ

Re: [DISC] (Python) Dropping support for manylinux2010

2022-05-05 Thread Alessandro Molina
non binding +1 On Thu, May 5, 2022 at 1:02 PM Jacob Wujciak wrote: > Hi all, > > I would like to propose that we drop support for manylinux2010. > > CentoOS 6, on which the manylinux2010 image is based, has been EOL for over > two years [1]. > There is now also an official announcement by pypa t

Re: [DISC] (Java) Add Windows binaries to Maven packages

2022-05-04 Thread Alessandro Molina
The proposal seems reasonable to me, we should do our best at providing users the same experience on the various systems whenever possible. As long as we don't receive complaints about the package size, I think we can live with it. If it becomes a problem for our users, we can always make per-syst

Re: Arrow sync call March 2 at 12:00 US/Eastern, 17:00 UTC

2022-03-02 Thread Alessandro Molina
Attendees: Alessandro Molina Micah Kornfield David Li Joris Van Den Bossche Discussion: Flight SQL Optimization for Small Results - Reference to https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html - Building directly in Flight as

Re: [Discuss] Best practice for storing key-value metadata for Extension Types

2022-02-10 Thread Alessandro Molina
Mentioned this already to Joris, but want to make sure we don't miss it. C-Data and thus ARROW:extension:metadata was mostly designed for shipping data to different processes within the same host. If we start using the spec for further uses, including saving it to files that could be read across d

Re: Release 7.0.0 Retrospective

2022-02-02 Thread Alessandro Molina
ing tomorrow. > > Ian > > > On Feb 1, 2022, at 9:23 AM, Alessandro Molina < > alessan...@ursacomputing.com> wrote: > > > > For anyone interested on the topic, I got some feedbacks that suggest it > > might be more effective to have a meeting dedicated to the

Re: Release 7.0.0 Retrospective

2022-02-01 Thread Alessandro Molina
have been involved in preparing release 7.0.0 itself so that it can then be discussed at the biweekly. On Tue, Feb 1, 2022 at 11:20 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > Given the unexpected amount of tries we had to go through to publish > version 7 (I don't

Re: Managing usage of the @ApacheArrow Twitter handle and other social media

2022-02-01 Thread Alessandro Molina
I never used https://github.com/gr2m/twitter-together previously, in the past I used Hootsuite to set up approval workflows, but I think that the idea of setting up a workflow through github PRs looks like a good idea. It would be able to leverage committer/pmc membership to merge the PRs and would

Release 7.0.0 Retrospective

2022-02-01 Thread Alessandro Molina
Given the unexpected amount of tries we had to go through to publish version 7 (I don't think there were past cases where RC10 was reached), it would be helpful to go through what happened, what didn't work and what we can do to prevent it from happening again in the future. I created a meeting fo

Re: Preparing for version 7.0.0 release

2022-01-13 Thread Alessandro Molina
n Tue, Jan 4, 2022 at 3:27 PM Alessandro Molina < alessan...@ursacomputing.com> wrote: > Quick note that all "Unassigned" issues that were not already started have > been moved to 8.0.0. > End of next week I'll do another pass and move all "Improvements/New >

Re: [RUST] Preparing for 7.0.0 release

2022-01-13 Thread Alessandro Molina
Hi Andrew, just wanted to update you on the fact that the skeleton for v7.0.0 blog post has been created, so you can freely make changes in that PR. https://github.com/apache/arrow-site/pull/178/files On Fri, Jan 7, 2022 at 12:20 AM Andrew Lamb wrote: > Greetings, fellow Rustaceans, and happy N

Re: Preparing for version 7.0.0 release

2022-01-04 Thread Alessandro Molina
> > Le 03/01/2022 à 15:44, Alessandro Molina a écrit : > > The plan seems to be to cut a release the 2nd or 3rd week of January, a > new > > confluence page was made to track progress of the release ( > > https://cwiki.apache.org/confluence/display/ARROW/Arrow+7.0.0+Releas

Preparing for version 7.0.0 release

2022-01-03 Thread Alessandro Molina
The plan seems to be to cut a release the 2nd or 3rd week of January, a new confluence page was made to track progress of the release ( https://cwiki.apache.org/confluence/display/ARROW/Arrow+7.0.0+Release ). It would greatly help in the process of preparing for the release if you could review tic

Re: [VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-25 Thread Alessandro Molina
For anyone willing to give a final check and merge the PR ( https://github.com/apache/arrow-site/pull/165/files ), I think that the blog post is good to go and hasn't got any new change in a few days On Fri, Nov 19, 2021 at 1:35 PM Alessandro Molina < alessan...@ursacomputing.com> wr

Re: [VOTE] Release Apache Arrow 6.0.1 - RC1

2021-11-19 Thread Alessandro Molina
For anyone interested I created the skeleton for the announcement blog post at https://github.com/apache/arrow-site/pull/165/files As it's a fairly small release I'll try to capture the major changes, but feel free to add or edit the blog post as you see fit through the usual commit suggestions O

Re: Question about Arrow Mutable/Immutable Arrays choice

2021-11-04 Thread Alessandro Molina
On Wed, Nov 3, 2021 at 11:34 PM Jacques Nadeau wrote: > In a perfect world we would have done a better job in the object > hierarchy/behavior of making this explicit but we don't live in that world, > unfortunately. Makes sense, but I thought that was exactly the reason why set/setSafe are onl

Question about Arrow Mutable/Immutable Arrays choice

2021-11-03 Thread Alessandro Molina
I recently noticed that in the Java implementation we expose a set/setSafe function that allows to mutate Arrow Arrays [1] This seems to be at odds with the general design of the C++ (and by consequence Python and R) library where Arrays are immutable and can be modified only through compute funct

Re: [VOTE] Release Apache Arrow 6.0.0 - RC3

2021-10-22 Thread Alessandro Molina
+1 (non binding) Verified on Mac OS 10.14 x86 Checked dev/release/verify-release-candidate.sh binaries 6.0.0 3 dev/release/verify-release-candidate.sh wheels 6.0.0 3 Only notice, I got a "OSError: [Errno 24] Too many open files" error initially and had to raise limit over open files. I don't kno

Re: Preparing for release 6.0.0

2021-10-14 Thread Alessandro Molina
to be updated when we actually publish the release. On Thu, Oct 14, 2021 at 10:24 AM Alessandro Molina < alessan...@ursacomputing.com> wrote: > Seems the tentative release date will probably slip to Monday/Tuesday next > week. There has been some delay generated by the release of P

Re: Preparing for release 6.0.0

2021-10-14 Thread Alessandro Molina
the owners could defer to v7.0.0 those that they don't think can close in time for Monday On Mon, Oct 4, 2021 at 1:38 PM Krisztián Szűcs wrote: > Aiming the first release candidate for Oct 14th/15th sounds good to me. > > On Mon, Oct 4, 2021 at 10:35 AM Alessandro Molina >

Re: Preparing for release 6.0.0

2021-10-04 Thread Alessandro Molina
; > > > I will tentatively aiim to create an arrow-rs 6.0 candidate on October 14 > > or October 15 (assuming it is approved, it would be released on or around > > October 18, 2021). > > > > Please let me know if there are any concerns with this schedule > > An

Preparing for release 6.0.0

2021-10-01 Thread Alessandro Molina
In preparation for release 6.0.0 which should probably happen within the next 2-3 weeks according to the usual release cycle the Confluence page for the release has been created ( https://cwiki.apache.org/confluence/display/ARROW/Arrow+6.0.0+Release ) Also all non Bug issues that were not started

Re: [DISCUSS][Python] Public Cython API

2021-08-25 Thread Alessandro Molina
ludes in Cython On Fri, Aug 20, 2021 at 12:24 PM Alessandro Molina < alessan...@ursacomputing.com> wrote: > While working on https://github.com/apache/arrow/pull/10162 it was raised > the concern that it's hard to change Cython code because it might break > third party librarie

[DISCUSS][Python] Public Cython API

2021-08-20 Thread Alessandro Molina
While working on https://github.com/apache/arrow/pull/10162 it was raised the concern that it's hard to change Cython code because it might break third party libraries and projects relying on pyarrow through Cython. Mostly the problem comes from the fact that the documentation suggests pyarrow.lib

Re: [DISCUSS][Python] Making NumPy optional dependency?

2021-08-17 Thread Alessandro Molina
alars) and a much > > simpler one that does not. pyarrow may have to detect at runtime > > whether numpy is in sys.modules to decide whether to import and invoke > > the more complicated function. > > > > On Mon, Aug 16, 2021 at 5:59 PM Alessandro Molina > &g

[DISCUSS][Python] Making NumPy optional dependency?

2021-08-16 Thread Alessandro Molina
As Arrow/PyArrow grows more compute functions and features we might move toward a world where the number of users relying on PyArrow without going through Pandas or NumPy might grow. NumPy is a compile time dependency for PyArrow as it's required to compile the C++ code needed to implement the pan

[DISCUSS][Python] Moving Python specific code into pyarrow

2021-08-16 Thread Alessandro Molina
PyArrow is currently full Cython codebase, but in reality it relies on some classes and functions that are implemented in C++ within the src/python directory ( https://github.com/apache/arrow/tree/master/cpp/src/arrow/python ). Especially for numpy/pandas conversion code that has to interface with

Re: Apache Arrow Cookbook

2021-07-28 Thread Alessandro Molina
re the new documentation gets deployed for 5.0.0 On Tue, Jul 20, 2021 at 12:24 PM Alessandro Molina < alessan...@ursacomputing.com> wrote: > The Pull Request for the Cookbook has been created ( > https://github.com/apache/arrow-cookbook/pull/1 ) > I left as comments in the PR the step

Re: Apache Arrow Cookbook

2021-07-20 Thread Alessandro Molina
> > > On Wed, Jul 14, 2021 at 8:33 AM Alessandro Molina > > wrote: > > > > > > On Tue, Jul 13, 2021 at 2:40 PM Wes McKinney > wrote: > > > > > > > I requested its creation here > > > > > > > > https://github.com/apac

Re: Apache Arrow Cookbook

2021-07-14 Thread Alessandro Molina
On Tue, Jul 13, 2021 at 2:40 PM Wes McKinney wrote: > I requested its creation here > > https://github.com/apache/arrow-cookbook > > If you can set up a PR into this repo (not sure if I need to push an > empty "initial commit" repo, but let me know), Seems your concern was correct, you can't op

Re: [DISCUSS] Should we start marking "feather" as deprecated?

2021-07-14 Thread Alessandro Molina
I think from users point of view it would be helpful to have only one clearly documented glossary and way to do things. At the moment, at least for the Python documentation, is not very clear what's the difference between feather and ipc.new_file Deprecating the Feather terminology would surely sol

Re: [DISCUSS] What is the Plasma status currently?

2021-07-14 Thread Alessandro Molina
I was wondering, for the benefit of lowering the entry barrier for users and especially future contributions who might find themselves confused by the amount of optional pieces that you can pick when building arrow, would it be reasonable to think of shipping plasma as a separate library? Like arro

Re: Apache Arrow Cookbook

2021-07-13 Thread Alessandro Molina
kbook" repository could also be a place to collect > recipes related to DataFusion. > > Either option is plenty reasonable, though, so feel free to choose > what makes the most sense to you. > > On Thu, Jul 8, 2021 at 12:09 PM Alessandro Molina > wrote: > > > > T

5.0.0 Release and Release Manager

2021-07-08 Thread Alessandro Molina
As mentioned in the biweekly sync call, we are approaching the wished date for the 5.0.0 release, which should happen at the end of next week, or worst case the week after. Apart from my usual recommendation to take a look at the TODO Backlog at https://cwiki.apache.org/confluence/display/ARROW/Ar

Re: Apache Arrow Cookbook

2021-07-08 Thread Alessandro Molina
find C++ versions of these recipes very useful. From > > our > > > experience the C++ API is much much harder to deal with and error prone > > > than the R/Python one. > > > > > > Cheers, > > > Rares > > > > > > On Wed, Jul 7, 2021

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina
bounds of the community's objectives. > > On Wed, Jul 7, 2021 at 5:52 PM Alessandro Molina > wrote: > > > > We finally have a first preview of the cookbook available for R and > Python, > > for anyone interested the two versions are visible at > >

Re: Apache Arrow Cookbook

2021-07-07 Thread Alessandro Molina
in the dedicated Google Docs ( https://docs.google.com/document/d/1v-jK_9osnLvAnAjLOM_frgzakjFhLpUi8OC0MlKpxzw/edit?ts=60c73189#heading=h.m7fas2talgy5 ) so if you have recipes to suggest feel free to leave comments on that document or suggest edits. On Mon, Jun 21, 2021 at 10:34 AM Alessandro

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

2021-07-06 Thread Alessandro Molina
I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully ava

Re: Moving "Improvements" and "New Features" to 6.0.0 release

2021-07-05 Thread Alessandro Molina
apache.org/confluence/display/ARROW/Arrow+5.0.0+Release ) On Sat, Jul 3, 2021 at 3:59 AM Weston Pace wrote: > Can you leave the ones marked “in progress” or that have the > pull-request-available label? > > On Thu, Jul 1, 2021 at 11:06 PM Alessandro Molina < > alessan...@ursaco

Moving "Improvements" and "New Features" to 6.0.0 release

2021-07-02 Thread Alessandro Molina
Hi everybody, Given that the expected time for release 5.0.0 is approaching and there are 160+ Jira issues assigned to that release ( https://cwiki.apache.org/confluence/display/ARROW/Arrow+5.0.0+Release ) I'd like to propose to do some cleanup of the TODO by bulk moving all 5.0.0 jira issues fla

Re: [Format] Bounded numbers?

2021-06-22 Thread Alessandro Molina
On Tue, Jun 22, 2021 at 12:27 PM Antoine Pitrou wrote: > On Mon, 21 Jun 2021 23:50:29 -0400 > Ying Zhou wrote: > > Hi, > > > > In data people use there are often bounded numbers, mostly integers with > clear and fixed upper and lower bounds but also decimals and floats as well > e.g. test scores

Apache Arrow Cookbook

2021-06-21 Thread Alessandro Molina
Hi, I'd like to share with the ML an idea which me and Nic Crane have been experimenting with. It's still in the early stage, but we hope to turn it into a PR for Arrow documentation soon. The idea is to work on a Cookbook, a collection of ready made recipes, on how to use Arrow that both end use

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Alessandro Molina
Another approach that could reduce the amount of heavy tests that we have to write (if the tests are written in Python) might be to drive the code to interleave in the ways we feel might introduce problems. Such an approach can be performed by introducing explicit breakpoints in the code and starti

Re: Pyarrow RecordBatchStreamWriter and dictionaries

2021-05-03 Thread Alessandro Molina
Hi Radu, I was trying to reproduce the issue you described, but I was unable to reproduce the problem. Could you provide an example of how you built the Table? I tried reproducing it with a table with following schema pa.schema([ pa.field('nums', pa.list_(pa.int32())), pa.field('chars', pa.list_

Re: [Python] Who has been able to use PyArrow 4.0.0?

2021-04-28 Thread Alessandro Molina
Are you sure you haven't installed `libarrow` (the CPP one) manually independently from pyarrow? In your traceback you have that the symbol has not been found in "/usr/local/lib/libarrow.400.dylib" But that smells like an independently installed libarrow, as the libarrow provided by pyarrow shoul

Re: [DISCUSS] [Rust] Python-datafusion

2021-04-26 Thread Alessandro Molina
Would "incorporate" mean that the codebase is moved into the arrow repository or is the plan to keep a separate repository for datafusion-python but under the apache org? On Sun, Apr 25, 2021 at 10:40 PM Daniël Heres wrote: > Hi Jorge, > > Awesome, I think this is a super valuable addition and m