Re: [C++] Purpose of C++ bundled dependencies

2022-08-03 Thread Antoine Pitrou
I would welcome trimming down our hand-written dependency bundling and delegate most of the work to vcpkg or conan, but I don't know how usable and flexible those alternatives are. Somehow more knowledgeable (probably Kou or perhaps Krisztian?) should answer. (also note that using an extern

Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-03 Thread Antoine Pitrou
nt at runtime? Only if you do things that are alignment-sensitive. That said, while it is formally allowed AFAIK, it probably occurs rarely so potential issues (if any) are probably not surfaced. Best regards Antoine. Best, Jorge On Tue, Aug 2, 2022 at 6:59 PM Antoine Pitrou wrote:

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-03 Thread Antoine Pitrou
v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$ On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou wrote: Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : Thanks f

Re: Replace conda with mamba in docs?

2022-08-02 Thread Antoine Pitrou
I would hope conda get their act together and improve on this. I have mixed feelings about complicating the documentation with explanations of how mamba is (often? usually?) a better replacement to conda. Generally we should focus on Arrow-specific issues and avoid distracting the user with

Re: [QUESTION] How is mmap implemented for 8bit padded files?

2022-08-02 Thread Antoine Pitrou
Hi Jorge, So there are two aspects to the answer: - ideally, the C++ implementation also works on non-aligned data (though this is poorly tested, if any) - when mmap'ing a file, you should get a page-aligned address As for int128 and int256, these usually don't exist at the hardware level

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-02 Thread Antoine Pitrou
Le 01/08/2022 à 19:13, Wes McKinney a écrit : If we start placing restrictions on how the out-of-line string buffers are managed and externalized, it risks undermining the zero-copy interoperability benefits that we're trying to achieve with this. But embedded pointers in turn undermine zero

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-01 Thread Antoine Pitrou
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit : Thanks for all the great feedback. To proceed forward, we seem to need decisions around the following: 1. Whether to use arrow extensions or first class types. The consensus is building towards using arrow extensions. +1 2. What do we do

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-08-01 Thread Antoine Pitrou
Potentially extending the IPC format to support these additional flexibilities is the easy part. The difficult part is to shoehorn the newstanding flexibility into existing APIs, also leaking into the expectations of downstream users. For example, in C++ it is expected that a RecordBatchRea

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-07-31 Thread Antoine Pitrou
Hi Wes, Le 31/07/2022 à 00:02, Wes McKinney a écrit : I understand there are still some aspects of this project that cause some squeamishness (like having arbitrary memory addresses embedded within array values whose lifetime a C ABI consumer may not know about -- we already export memory add

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-30 Thread Antoine Pitrou
Le 30/07/2022 à 01:02, Wes McKinney a écrit : I think either path: * Canonical extension type * First-class type in the Type union in Flatbuffers would be OK. The canonical extension type option is the preferable path here, I think, because it allows Arrow implementations without any special

Re: Help needed with PR #13659: Fixing build/unit test issues in msvc/win32

2022-07-22 Thread Antoine Pitrou
This isn't great since library users may have policies that disallow warnings. On Fri., Jul. 22, 2022, 05:47 Antoine Pitrou, wrote: We could perhaps suppress the integer downcast warnings, but only on 32-bit Windows (not 64-bit, not other platforms). Regards Antoine. Le 22/07/2022 à 14:4

Re: Help needed with PR #13659: Fixing build/unit test issues in msvc/win32

2022-07-22 Thread Antoine Pitrou
We could perhaps suppress the integer downcast warnings, but only on 32-bit Windows (not 64-bit, not other platforms). Regards Antoine. Le 22/07/2022 à 14:42, Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) a écrit : Hi James. I don't have strong feelings about whose PR is used and how exactly th

Re: [DISCUSS] Disable dependabot automated PRs

2022-07-21 Thread Antoine Pitrou
+1 for disabling. Le 21/07/2022 à 15:35, Raul Cumplido Dominguez a écrit : Hi, There was a discussion on Zulip dev about disabling dependabot alerts and updates [1] Based on this Apache INFRA wiki page we should be able to disable them [2]. There are currently several open PRs from dependa

Re: [C++] Moving from -O3 to -O2 optimization level in release builds

2022-07-21 Thread Antoine Pitrou
Le 21/07/2022 à 16:34, Wes McKinney a écrit : Based on the discussion in https://github.com/apache/arrow/pull/13661, it seems that one major issue with switching to -O2 is that auto-vectorization (which we rely on in places) and perhaps some other optimization passes would have to be manually

Re: [C++] Adding Run-Length Encoding to Arrow

2022-07-19 Thread Antoine Pitrou
Le 08/07/2022 à 15:19, Wes McKinney a écrit : * I believe that having a Type::RLE is the right approach in C++ and it makes dynamic dispatch everywhere in the library pretty straightforward. +1 on this, as it will raise a nice NotImplemented error for existing code rather than crash or corr

Re: [C++] Help with Parquet backward compatibility regression between 2.0.0 and 3.0.0

2022-07-18 Thread Antoine Pitrou
Le 18/07/2022 à 03:54, Wes McKinney a écrit : This patch caused Parquet files written with 2.0.0 to be unreadable in 3.0.0 onward https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69 This was reported on June 14 on dev@ and I git-bisected to the root cause: https:/

Re: Proposal: Unassign idle issues

2022-07-12 Thread Antoine Pitrou
On Fri, 8 Jul 2022 09:49:28 -0600 Todd Farmer wrote: > > In summary, here are the actions I propose: > > 1. Establish a threshold at which assigned, idle issues should be > unassigned and comment added. > 2. Define that threshold to be 90 days. > 3. Document the above as a project policy for iss

Re: Undefined symbol error using pyarrow

2022-07-07 Thread Antoine Pitrou
I don't think you need anything more on the PyArrow side, but you need to (re)compile Arrow C++ with ARROW_COMPUTE enabled, is that the case? Le 07/07/2022 à 22:16, Li Jin a écrit : Hello, I am trying to build Arrow/Pyarrow with our internal build system (cmake based) and encounter and e

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-07 Thread Antoine Pitrou
row/engine/substrait` module in Arrow C++ does this too. If the technical approach I just described would actually expose the classes, what would be a proper way to avoid exposing them? Perhaps the classes should be generated into a private package, e.g., under `python/_ep`? (ep stands for external pr

Re: accessing Substrait protobuf Python classes from PyArrow

2022-07-03 Thread Antoine Pitrou
I agree that giving direct access to protobuf classes is not Arrow's job. You can probably take the upstream (i.e. Substrait's) protobuf definitions and compile them yourself, using whatever settings required by your project. Regards Antoine. Le 03/07/2022 à 21:16, Jeroen van Straten a é

Re: [C++] Kernel function registry evolution

2022-06-29 Thread Antoine Pitrou
a whole chain of scalar functions that all write into preallocated memory can execute without having to touch shared_ptrs or deal with other objects with excess microperformance overhead) where such optimization can happen more easily. On Mon, Jun 6, 2022 at 4:08 AM Antoine Pitrou wrote: Le 06/06/202

Re: [Nightly builds] Crossbow nightly report page announcement + next steps

2022-06-29 Thread Antoine Pitrou
On Mon, 27 Jun 2022 12:46:40 +0200 Raul Cumplido Dominguez wrote: > Hi, > > During the last months there has been some work going on in order to > improve the visibility of our nightly builds, the failures, for how long > have they been failing, etcetera. > > We started by adding some notificati

Re: [ANNOUNCE] New Arrow committers: Dewey Dunnington, Alenka Frim, and Rok Mihevc

2022-06-22 Thread Antoine Pitrou
Welcome to our new committers! Le 22/06/2022 à 20:02, Andrew Lamb a écrit : Congratulations! On Wed, Jun 22, 2022 at 1:27 PM Dragoș Moldovan-Grünfeld < dragos.m...@gmail.com> wrote: Congratulations! Sent from my iPhone On 22 Jun 2022, at 18:13, Neal Richardson wrote: On behalf of t

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-16 Thread Antoine Pitrou
Can we name it miniarrow or nanoarrow? We don't want to convey the message that there is a parallel C API for Arrow. Le 15/06/2022 à 05:18, Dewey Dunnington a écrit : Hi all, I drafted a second PR [1] drafting a design for storing parsed information obtained from a struct ArrowSchema (i.e.

Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Antoine Pitrou
Le 08/06/2022 à 20:55, Jorge Cardoso Leitão a écrit : 0 (binding) - imo there is some unclarity over what is expected to be passed over the C streaming interface - an Array or a StructArray. I think the spec claims the former, but the C++ implementation (which I assume is the reference here) e

Re: [VOTE] Mark C Stream Interface as Stable

2022-06-08 Thread Antoine Pitrou
+1 (binding) Le 08/06/2022 à 20:15, Will Jones a écrit : Hi, Given all feedback to discussion [1] has been positive, I would like to propose marking the C Stream Interface as stable. I have prepared PRs in apache/arrow [2] and apache/arrow-rs [3] to remove all "experimental" markers from th

Re: int8_t vs size_t

2022-06-08 Thread Antoine Pitrou
No, Arrow should definitely compile in 32 bits. Feel free to open a JIRA and/or submit a PR for it. Le 08/06/2022 à 19:48, Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) a écrit : Hi Antoine, I need the 32 bit support because our project needs to support 32 bit. These are my constrains. As of

Re: int8_t vs size_t

2022-06-08 Thread Antoine Pitrou
Hi, It is a conscious decision of following the Google C++ style guide: https://google.github.io/styleguide/cppguide.html#Integer_Types I agree that size_t (or ssize_t) would have been a better choice for in-memory lengths and sizes. Unfortunately, that ship has sailed now. 32-bit systems a

Re: [C++] Can we remove cpp/src/arrow/dbi/hiveserver2?

2022-06-06 Thread Antoine Pitrou
+1 for removing it. On Fri, 03 Jun 2022 08:32:35 +0900 (JST) Sutou Kouhei wrote: > Hi, > > We have Hive adapter in cpp/src/arrow/dbi/hiveserver2 but > it's not maintained. Can we remove this? > > Reasons: > > 1. I got build errors when I build it on master by >-DARROW_HIVESERVER2=ON. Se

Re: [C++] Kernel function registry evolution

2022-06-06 Thread Antoine Pitrou
Le 06/06/2022 à 09:34, Sasha Krassovsky a écrit : Wow that's a lot of progress! Definitely agree on the scalar outputs point. One point about the ArraySpan - why does it need to know its data type? Once a kernel has been resolved by the registry, the kernel will only know how to execute on the

Re: [C++] Kernel function registry evolution

2022-06-02 Thread Antoine Pitrou
Le 02/06/2022 à 00:02, Weston Pace a écrit : I'd like to propose we add a second kernel function registry. There doesn't need to be any user facing API change. We could probably use an approach like [2] to proxy to the old function registry when the newer registry doesn't contain the asked-f

Re: [Discuss][Java] macOS minimum requirements

2022-06-01 Thread Antoine Pitrou
Sorry, I put "C++" in the title but this really affects Java via JNI. Le 01/06/2022 à 16:22, Antoine Pitrou a écrit : Hello, The topic came up recently of bumping up our minimal macOS requirements from 10.11 to 10.13 (*). Do people have any particular concerns about this?

[Discuss][C++] macOS minimum requirements

2022-06-01 Thread Antoine Pitrou
Hello, The topic came up recently of bumping up our minimal macOS requirements from 10.11 to 10.13 (*). Do people have any particular concerns about this? (*) https://github.com/apache/arrow/pull/13157#issuecomment-1143670152 Regards Antoine.

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Le 31/05/2022 à 21:41, Micah Kornfield a écrit : I'm currently working on adding Run-Length encoding to arrow. Nice What are the intended use cases for this: - external engines want to provide run-length encoded data to work on using arrow? It is more than just external engines. Many p

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Hi, Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general

Re: Arrow C-Data and DuckDB

2022-05-31 Thread Antoine Pitrou
For the record, https://github.com/apache/arrow/pull/13115 was merged with the proposed change. Regards Antoine. On Fri, 13 May 2022 17:48:21 +0200 Antoine Pitrou wrote: > I don't think this needs a vote, there is no functional change in the > spec, it's just an addi

Re: DISCUSS: Make bucket creation opt-in in C++ S3FileSystem?

2022-05-29 Thread Antoine Pitrou
+1 as well Le 25/05/2022 à 01:25, Weston Pace a écrit : +1 I think opt-in is the right way to go here. On Tue, May 24, 2022 at 12:40 PM Will Jones wrote: Hello Arrow devs, I've written a PR for the C++ S3FileSystem that adds an option "allow_create_buckets" which when false will error if

Re: DISCUSS: Stabilize Arrow C Stream Interface?

2022-05-29 Thread Antoine Pitrou
That sounds fair to me as well. Le 27/05/2022 à 13:01, Andrew Lamb a écrit : +1 to the idea on making it a stable interface On Thu, May 26, 2022 at 6:57 PM Jonathan Keane wrote: I too am +1 (nonbinding) to marking it as stable -Jon On Thu, May 26, 2022 at 1:05 PM Neal Richardson < nea

Re: [Python] Converting Python Schema Object to Java Schema Object

2022-05-21 Thread Antoine Pitrou
Hello Srinivas, No, there is not. Also, the pyarrow.jvm is a bit limited currently, there are plans to rewrite it to make it more general (you may want to help contribute if you feel interested and skilled enough): https://issues.apache.org/jira/browse/ARROW-14319 Regards Antoine. Le 2

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-19 Thread Antoine Pitrou
gle-noded-ness). Antoine did point out the ACE name is taken by a C++ library. The "Ace" name is also used by the javascript library [2], but I think is a general enough work that no single library has much specific claim to it. Some other names I thought of: Arrow Recurve Ace A

Re: Merge a pull request with GitHub API

2022-05-18 Thread Antoine Pitrou
That sounds ok to me, we should just ensure that commits are squashed and rebased on top of the main/master branch. (also, the commit title and description should inherit the PR's corresponding fields) Le 18/05/2022 à 05:43, Sutou Kouhei a écrit : Hi, How about using GitHub API instead

Re: Arrow C-Data and DuckDB

2022-05-13 Thread Antoine Pitrou
ll/13115 to add this for the C data/stream interfaces. On Mon, May 9, 2022, at 15:42, Antoine Pitrou wrote: Le 09/05/2022 à 20:28, Tomek Drabas a écrit : I am new to this board so please, let me know if any of this doesn't make sense. I am building a FligthSQL example with DuckDB backend. D

Re: Arrow sync call May 11 at 12:00 US/Eastern, 16:00 UTC

2022-05-13 Thread Antoine Pitrou
Le 13/05/2022 à 16:30, Alessandro Molina a écrit : I think Arrow should definitely consider adding a DataFrame-like API. There are multiple reasons why exposing Arrow to end users instead of restricting it to developers of framework would be beneficial for the Arrow project itself. A rough ap

Re: Datafusion's Java binding is available in Maven Central

2022-05-11 Thread Antoine Pitrou
Hi! Can you elaborate how the binding transfers data between Datafusion and Java Arrow? If I'm reading the code correctly, it seems to be writing an IPC stream? Le 11/05/2022 à 11:20, Jiayu Liu a écrit : Hi dev@arrow, Recently I've created and published a Java binding[1] to datafusion[2

Re: [DISCUSS][C++][Python]Switch default mmap behaviour to off

2022-05-11 Thread Antoine Pitrou
Le 11/05/2022 à 10:19, Alessandro Molina a écrit : As far as I understood, the idea is not to fully remove memory mapping, just turn the current mmap=True default arguments to mmap=False The goal is mostly to provide consistent behaviour for end users. At the moment users might face very diff

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 19:16, Antoine Pitrou a écrit : That said, tests which require should be skipped gracefully instead of failing. Oops... some words got swallowed: tests which require *the dataset module* should be skipped gracefully instead of failing. Le 10/05/2022 à 19:13, Weston

Re: PyArrow builds but fails to load pyarrow._dataset

2022-05-10 Thread Antoine Pitrou
That said, tests which require should be skipped gracefully instead of failing. Le 10/05/2022 à 19:13, Weston Pace a écrit : I think you need to add: export PYARROW_WITH_DATASET=1 On Tue, May 10, 2022 at 7:07 AM Yaron Gvili wrote: Hello, I ran into a problem with running PyArrow

Re: [DISC][Release] More control on Release Candidates commits

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 13:27, Raul Cumplido a écrit : I still think there is some value in standardising the "feature freeze" on new release candidates once a first release candidate has been created and only add required fixes for the follow up RCs. What I would like to avoid with that is rushing bi

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-05-10 Thread Antoine Pitrou
AQE as Adaptive Query Execution (especially Spark users). "Arrow Compute Engine" in full doesn't sound bad perhaps? With DataFusion, I made a list of words related to the project (data, query, compute, engine, etc) and then a list of completely unrelated words and then looked at the

Re: mmap only, read data later?

2022-05-10 Thread Antoine Pitrou
Le 10/05/2022 à 04:36, Andrew Piskorski a écrit : On Mon, May 09, 2022 at 07:00:47PM +0200, Antoine Pitrou wrote: Generally, the Arrow IPC file/stream formats are designed for large data. If you have many very small files you might try to rethink how you store your data on disk. Ah. Is

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Antoine Pitrou
Well, in any case, the release manager should make the final call, so a label would mostly be a sophisticated way of pinging them. Le 09/05/2022 à 20:45, Weston Pace a écrit : How should we indicate whether a JIRA is a bugfix, which should be included in the next RC, or something else that

Re: Arrow C-Data and DuckDB

2022-05-09 Thread Antoine Pitrou
Le 09/05/2022 à 20:28, Tomek Drabas a écrit : I am new to this board so please, let me know if any of this doesn't make sense. I am building a FligthSQL example with DuckDB backend. DuckDB already has an Arrow interface defined in duckdb.h that returns ArrowArray. However, the import is not gu

Re: mmap only, read data later?

2022-05-09 Thread Antoine Pitrou
Hi Andrew, If the Arrow files are small, chances are the metadata (which is always being read) is as large on disk as the actual data (which is "only" mmap'ed). Also, mmap'ing works on a page granularity (a page being typically 4 kB on x86, sometimes a bit larger on other architectures), an

Re: [DISC][Release] More control on Release Candidates commits

2022-05-09 Thread Antoine Pitrou
+1 from me. I'm actually surprised that we didn't do something like that already. Adding new features from one RC to another sounds like a very bad idea. Regards Antoine. Le 09/05/2022 à 14:33, Raul Cumplido a écrit : Hi, I would like to propose a change in our release process. The rat

Re: [DISC] (Python) Dropping support for manylinux2010

2022-05-05 Thread Antoine Pitrou
That sounds ok to me. Le 05/05/2022 à 13:01, Jacob Wujciak a écrit : Hi all, I would like to propose that we drop support for manylinux2010. CentoOS 6, on which the manylinux2010 image is based, has been EOL for over two years [1]. There is now also an official announcement by pypa that man

Re: [DISC] (Java) Add Windows binaries to Maven packages

2022-05-04 Thread Antoine Pitrou
Le 04/05/2022 à 17:21, Alessandro Molina a écrit : The proposal seems reasonable to me, we should do our best at providing users the same experience on the various systems whenever possible. As long as we don't receive complaints about the package size, I think we can live with it. If it becom

Re: [C++] Question about "Code style and Linting"

2022-04-28 Thread Antoine Pitrou
Le 28/04/2022 à 17:07, Li Jin a écrit : Aha thanks Antoine! After digging the log I think I found the issue: " -- clang-tidy 12 not found -- clang-format 12 not found " after installing those two it got me over that step.. A side question - does running " archery lint --cpplint --clang-form

Re: [C++] Question about "Code style and Linting"

2022-04-28 Thread Antoine Pitrou
Le 28/04/2022 à 16:54, Li Jin a écrit : Hello! I am preparing for submitting a PR and reading the "Code style and Linting" section of the development doc: https://github.com/apache/arrow/blob/master/docs/source/developers/cpp/development.rst#code-style-linting-and-ci I got to the point that

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
t file from js to WASM and de-serialized it to arrow directly in wasm - so memory was already being allocated from within WASM sandbox, not JS. Sorry for the confusion. [1] https://github.com/WebAssembly/design/issues/1439 Best, Jorge On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou wrote:

Re: [DISC] Improving Arrow's database support

2022-04-26 Thread Antoine Pitrou
Do we want something more flexible than dlopen() and runtime symbol lookup (a mechanism which constrains the way you can organize and distribute drivers)? For example, perhaps we could expose an API struct of function pointers that could be obtained through driver-specific means. Le 26/0

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
pr 26, 2022 at 10:22 AM Antoine Pitrou wrote: Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit : Would WASM be able to interact in-process with non-WASM buffers safely? AFAIK yes. My understanding from playing with it in JS is that a WASM-backed udf execution would be something like: 1. co

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
ython function, which has security implications since the Python interpreter allows everything by default. Best, Jorge On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou wrote: Le 25/04/2022 à 23:04, David Li a écrit : The WebAssembly documentation has a rundown of the techniques used: https://weba

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-26 Thread Antoine Pitrou
rguments are of different length - we'd need something like the ColumnBag proposal, so this might be a good reason to revive that). On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote: Le 25/04/2022 à 22:19, Wes McKinney a écrit : I was going to reply to this e-mail thread on user@ but tho

Re: [Python] [Docs] Framework to override docs for pyarrow.compute functions using native reStructured Text (?)

2022-04-26 Thread Antoine Pitrou
Hi Kevin, There are a couple of concerns to keep in mind: - we don't want to increase the import time of PyArrow too much - we would like to limit the required runtime dependencies for PyArrow (an issue is open to move docstring generation at package build time: https://issues.apache.org/jira/

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

2022-04-25 Thread Antoine Pitrou
Le 25/04/2022 à 22:19, Wes McKinney a écrit : I was going to reply to this e-mail thread on user@ but thought I would start a new thread on dev@. Executing user-defined functions in memory, especially untrusted functions, in general is unsafe. For "trusted" functions, having an in-memory API f

Re: [VOTE] Extend Arrow Flight SQL with more SQL type info in schemas

2022-04-21 Thread Antoine Pitrou
+1 from me (binding), with caveat that I'm not competent in JDBC and the proposed changes in PR [3] look reasonable to me. Best regards Antoine. Le 20/04/2022 à 22:13, David Li a écrit : Hello, Iury da Guia Salino has proposed an addition to Arrow Flight SQL, an experimental protocol fo

Re: Perf/Benchmark for temporal operations

2022-04-13 Thread Antoine Pitrou
Hello Li, The temporal rounding operations operate on localized times taking into account the timestamp's timezone, which is why they're more computationally intensive that raw floating point operations. Which operation in particular did you benchmark? Is it part of a significant workload

Re: [C++] Replacing xsimd with compiler autovectorization

2022-04-04 Thread Antoine Pitrou
ht? Indeed you can have an initial stab at that. Regards Antoine. Sasha 3 апр. 2022 г., в 11:47, Antoine Pitrou написал(а): It would be a very significant contributor, as the inconsistency can manifest under the form of up to 8-fold differences in performance (or perhaps more).

Re: [C++] Replacing xsimd with compiler autovectorization

2022-04-03 Thread Antoine Pitrou
Le 01/04/2022 à 08:43, Sasha Krassovsky a écrit : I agree that a potential inconsistent experience is a problem, but I disagree that SIMD would be the root of the problem, or even be a significant contributor to it. It would be a very significant contributor, as the inconsistency can manifes

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-31 Thread Antoine Pitrou
Le 31/03/2022 à 09:19, Sasha Krassovsky a écrit : As I showed, those auto-vectorized kernels may be vectorized only in some situations, depending on the compiler version, the input datatypes... I would more than anything interpret the fact that that code was vectorized at all as an amazing

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-30 Thread Antoine Pitrou
generate different vectorized code, and clang and gcc do not auto-vectorize at the same optimization level (O2 for clang and O3 or O2 -ftree-vectorize for gcc) Regards, Johan On Wed, Mar 30, 2022 at 10:10 AM Antoine Pitrou wrote: Hi Sasha, Le 30/03/2022 à 00:14, Sasha Krassovsky a écr

Re: [C++] Replacing xsimd with compiler autovectorization

2022-03-30 Thread Antoine Pitrou
Hi Sasha, Le 30/03/2022 à 00:14, Sasha Krassovsky a écrit : I've noticed that we include xsimd as an abstraction over all of the simd architectures. I'd like to propose a different solution which would result in fewer lines of code, while being more readable. My thinking is that anything simp

Re: [DISCUSS] "Naming" the Arrow C++ execution engine subproject?

2022-03-28 Thread Antoine Pitrou
ACE is already the name of a well-known C++ library, though I'm not sure how widely used it is nowadays : http://www.dre.vanderbilt.edu/~schmidt/ACE.html I would name it "execution engine" or "Arrow C++ execution engine" in full. Regards Antoine. Le 29/03/2022 à 00:15, Wes McKinney a écri

Re: C++ Helpers for Row and Arrow conversions

2022-03-24 Thread Antoine Pitrou
Hello Will, So the added value would simply be the automatic definition of iterator-returning methods? Or am I missing something? Regards Antoine. Le 23/03/2022 à 19:36, Will Jones a écrit : Hello Arrow devs, I recently created ARROW-16006 [1] ("Helpers for converting between rows and A

Re: [VOTE] Extend Arrow Flight SQL with GetXdbcTypeInfo, SQL type info in schemas

2022-03-21 Thread Antoine Pitrou
Moral +1 from me. I've posted minor comments on the specs changes in the PRs. Le 16/03/2022 à 20:50, David Li a écrit : Hello, Jose Almeida and James Duong have proposed two additions to Arrow Flight SQL, an experimental protocol for interacting with SQL databases over Arrow Flight. The

Re: [PROPOSAL][RELEASE] Arrow 7.0.1

2022-03-10 Thread Antoine Pitrou
If it's only about missing artifacts and something else needs to change, I would hope so indeed. Le 10/03/2022 à 15:16, David Li a écrit : Hmm, I'm not too sure how procedures work here, but is it possible to just vote and upload the missing 7.0.0 artifacts, instead of going through the wh

Re: [C++] [csv] Why do I keep getting the error - "CVS parser got out of sync with chunker"

2022-03-09 Thread Antoine Pitrou
On Mon, 7 Mar 2022 11:52:02 -0800 HK Verma wrote: > Thanks Antoine. Yes I have newlines_in_values set to false. Other configs > also look ok. > However I do have rows with less number of columns than the specified > numbers in convert options in column types. I have my own > invalid_row_handler w

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-08 Thread Antoine Pitrou
e appropriate size for decimal types. Relaxing from {128,256} to {32,64,128,256} seems a low risk from an integration perspective, as implementations already need to read the bitwidth to select the appropriate physical representation (if they support it). Best, Jorge On Mon, Mar 7, 2022, 11:41 Ant

Re: [C++] Why do I keep getting the error - "CVS parser got out of sync with chunker"

2022-03-07 Thread Antoine Pitrou
Hi HK, On Mon, 7 Mar 2022 10:16:07 -0800 HK Verma wrote: > I am integrating Arrow with another C++ library. For this, I wrote an input > stream which feeds CSV data into the streaming reader. It fails for very > large files with the error messages like - "CSV parser got out of sync with > chunk

Re: [Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-07 Thread Antoine Pitrou
this case, it might be argued we are just relaxing the constraints on an existing type. What do others think? Regards Antoine. On Thu, Mar 3, 2022 at 6:55 AM Antoine Pitrou wrote: Hello, Currently, the Arrow format specification restricts the bitwidth of decimal numbers to either 128 o

Re: [PyArrow] Arrow StructArray buffer allocation

2022-03-04 Thread Antoine Pitrou
I opened https://issues.apache.org/jira/browse/ARROW-15846 Regards Antoine. Le 04/03/2022 à 15:05, Antoine Pitrou a écrit : Le 04/03/2022 à 15:01, Hanqi Wu a écrit : Hi Antoine, I agree n_buffers should still be set to 1. But as per the below PyArrow doc, n_buffers’s value will be 0 if

Re: [PyArrow] Arrow StructArray buffer allocation

2022-03-04 Thread Antoine Pitrou
x27;s still "present" in the metadata (for example as a null pointer, if using the C data interface). This probably deserves clarifying, though. I'll open an issue. Regards Antoine. https://arrow.apache.org/docs/format/Columnar.html#struct-layout Thanks, Hanqi On Mar

Re: [PyArrow] Arrow StructArray buffer allocation

2022-03-04 Thread Antoine Pitrou
produced when exporting such an array. Regards Antoine. However, “import_from_c” expects StructArray to always have at least 1 buffer allocated, otherwise it throws an exception. Best, Hanqi On Mar 4, 2022, at 8:47 AM, Antoine Pitrou wrote: Le 04/03/2022 à 04:17, Hanqi Wu a écrit

Re: [C++] Ways to make CMake more robust across environments?

2022-03-04 Thread Antoine Pitrou
Hello Will, Le 04/03/2022 à 01:27, Will Jones a écrit : I've come across several different environments where Arrow either fails to configure with CMake or fails to link libraries. Some recent examples I've come across: * (Just fixed [1]) Windows, RTools4 (MSYS2), Debug, dynamic libraries

Re: [PyArrow] Arrow StructArray buffer allocation

2022-03-04 Thread Antoine Pitrou
Le 04/03/2022 à 04:17, Hanqi Wu a écrit : Hello community, As per the below documentation, for an Arrow StructArray, it won’t have any physical buffers backing it if it doesn’t contain any null value: https://arrow.apache.org/docs/format/Columnar.html#struct-layout However, in PyArrow, it co

[Discuss][Format] Add 32-bit and 64-bit Decimals

2022-03-03 Thread Antoine Pitrou
Hello, Currently, the Arrow format specification restricts the bitwidth of decimal numbers to either 128 or 256 bits. However, there is interest in allowing other bitwidths, at least 32 and 64 bits for this proposal. A 64-bit (respectively 32-bit) decimal datatype would allow for precision

Re: Flight/FlightSQL Optimization for Small Results?

2022-03-01 Thread Antoine Pitrou
Can we just add the following field to the FlightDescriptor message: bool accept_inline_data = 4; and this one to the FlightInfo message: FlightData inline_data = 100; Then new clients can `accept_inline_data` to true (the default being false if omitted) to signal servers that they can pu

[C++] Hiveserver2 support broken

2022-02-24 Thread Antoine Pitrou
Hello, Just a note that Hiveserver2 support is currently broken in Arrow C++, and it may have been for a long time (attempting to compile it just doesn't work): https://issues.apache.org/jira/browse/ARROW-15774 Is there any interested party to work on fixing this? Otherwise, we may want to rem

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-19 Thread Antoine Pitrou
post-condition for the resulting struct array is that its length is equal to the length of all of its children arrays. Cheers, Micah On Fri, Feb 18, 2022 at 1:12 PM Phillip Cloud wrote: On Fri, Feb 18, 2022 at 3:44 PM Antoine Pitrou wrote: Le 18/02/2022 à 21:32, Phillip Cloud a écrit : I am r

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Antoine Pitrou
Le 18/02/2022 à 21:32, Phillip Cloud a écrit : I am really struggling to see how anything I've said is inconsistent with the spec or what you are saying here. To recap what I've said: 1. Appending a null sentinel to the values buffer isn't _required_ unless the type requires it. Ex: "joemark

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Antoine Pitrou
Le 18/02/2022 à 20:26, Phillip Cloud a écrit : On Fri, Feb 18, 2022 at 2:06 PM Antoine Pitrou wrote: Le 18/02/2022 à 20:01, Phillip Cloud a écrit : I think I'm confused by where this appended value lives. Is it only a logical value or does the value show up in memory? The logical

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Antoine Pitrou
Le 18/02/2022 à 20:01, Phillip Cloud a écrit : I think I'm confused by where this appended value lives. Is it only a logical value or does the value show up in memory? The logical value is null. The appended value is only a physical value that shows up in memory but doesn't have any bearing

Re: [JavaScript] Appending Nulls to a Struct (Bug)

2022-02-18 Thread Antoine Pitrou
Le 18/02/2022 à 19:29, Phillip Cloud a écrit : The description underneath the example says: While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the validity bitmap. To me this suggests

Re: Proposal: renaming the 'master' branch to 'main'

2022-02-14 Thread Antoine Pitrou
Le 14/02/2022 à 21:45, Neal Richardson a écrit : There was discussion of this back in 2020 [1], and the consensus at the time seemed to be to wait and see where git and GitHub would land before making what could be a disruptive change. I support reopening the discussion. It looks like quite a

Re: [C++][Python] Connection caching for File Systems

2022-02-13 Thread Antoine Pitrou
Hi Micah, Le 12/02/2022 à 19:32, Micah Kornfield a écrit : Hi Arrow Dev, For filesystems that contact remote services is there any sort of connection caching to avoid the cost of reconnecting each time? If so, at what scope does the cache live (e.g. each FileSystem object, or process global)?

Re: [Discuss] Best practice for storing key-value metadata for Extension Types

2022-02-10 Thread Antoine Pitrou
Le 10/02/2022 à 14:09, Alessandro Molina a écrit : Mentioned this already to Joris, but want to make sure we don't miss it. C-Data and thus ARROW:extension:metadata was mostly designed for shipping data to different processes within the same host. ARROW:extension:metadata is unrelated to the

Re: [Python] Parquet CMake issue

2022-02-03 Thread Antoine Pitrou
Hi Ian, Did you run "make install" as well after compiling Arrow C++? Perhaps PyArrow is picking up an old installed version? Regards Antoine. Le 03/02/2022 à 08:35, Ian Joiner a écrit : Hi, In order to prevent problematic PRs from happening again I’m cleaning up my local env. Here is

Re: [VOTE] Release Apache Arrow 7.0.0 - RC10

2022-02-02 Thread Antoine Pitrou
rror as Antoine with the wheels; adding `conda activate base` also fixed it for me. I had to disable Gandiva for source verification due to a linking error with LLVM (though my system repositories don't have an appropriate version of LLVM in the first place). -David On Mon, Jan 31, 202

Re: [VOTE] Release Apache Arrow 7.0.0 - RC10

2022-01-31 Thread Antoine Pitrou
Le 30/01/2022 à 11:15, Krisztián Szűcs a écrit : (*) Here is the end of the logs: + pushd binaries /tmp/arrow-7.0.0.C5b7S/binaries /tmp/arrow-7.0.0.C5b7S ++ uname + '[' Linux == Darwin ']' + test_linux_wheels ++ uname -m + '[' x86_64 = aarch64 ']' + local arch=x86_64 + local 'py_arches=3.7m 3

Re: [VOTE] Release Apache Arrow 7.0.0 - RC10

2022-01-29 Thread Antoine Pitrou
+1 The following checks were succesful on Ubuntu 20.04, x86-64: - source release test with gcc 9.3.0, C++, Python, GLib, Ruby, CUDA - source release test with gcc 8.4.0, C++, Python, GLib, Ruby, CUDA - source release test with clang 10.0.0, C++, Python, GLib, Ruby, CUDA - source release test w

<    1   2   3   4   5   6   7   8   9   10   >