Re: Complex Number support in Arrow

2021-06-10 Thread Micah Kornfield
> > It might help this discussion and future discussions like it if we could > define how it is determined whether a type should be part of the Arrow > format, an extension type (and what does it mean to say there is a > "canonical" extension type), or just something that a language >

Re: Complex Number support in Arrow

2021-06-10 Thread Jorge Cardoso Leitão
Isn't an array of complexes represented by what arrow already supports? In particular, I see at least two valid in-memory representations to use, that depend on what we are going to do with it: * Struct[re, im] * FixedList[2] In the first case, we have two buffers, [x0, x1, ...] and [y0, y1,

Re: Long title on github page

2021-06-10 Thread Sutou Kouhei
It seems that we can use .asf.yaml to set the description on GitHub: https://cwiki.apache.org/confluence/display/INFRA/git+-+.asf.yaml+features#Git.asf.yamlfeatures-GitHubsettings github: description: "Apache Arrow is ..." In "Re: Long title on github page" on Thu, 10 Jun 2021 17:44:57

Re: Complex Number support in Arrow

2021-06-10 Thread Neal Richardson
It might help this discussion and future discussions like it if we could define how it is determined whether a type should be part of the Arrow format, an extension type (and what does it mean to say there is a "canonical" extension type), or just something that a language implementation or

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
This might help to get the size of the output buffer upfront: https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96 Though with "standard" allocators there is a risk of running into KiPageFaults when going for buffers over 1mb. This might be

Re: Complex Number support in Arrow

2021-06-10 Thread Micah Kornfield
> > My understanding is that it means having COMPLEX as an entry in the > arrow/type_fwd.h Type enum. I agree this would make implementation > work in the C++ library much more straightforward. One idea I proposed would be to do that, and implement the > serialization of the complex metadata

Re: C++ Segmentation Fault RecordBatchReader::ReadNext in CentOS only

2021-06-10 Thread Rares Vernica
Yes, the pre-built binaries are the official RPM packages. I recompilled 4.0.1 with the default gcc-g++ from CentOS 7 and Debug flag. The segmentation fault occurred. See below for the backtrace. Please note that the SciDB database as well as the Plug-in where the Arrow library is used are

Re: Complex Number support in Arrow

2021-06-10 Thread Wes McKinney
My understanding is that it means having COMPLEX as an entry in the arrow/type_fwd.h Type enum. I agree this would make implementation work in the C++ library much more straightforward. One idea I proposed would be to do that, and implement the serialization of the complex metadata using

Re: Complex Number support in Arrow

2021-06-10 Thread Weston Pace
> While dedicated types are not strictly required, compute functions would > be much easier to add for a first-class dedicated complex datatype > rather than for an extension type. @pitrou This is perhaps a naive question (and admittedly, I'm not up to speed on my compute kernels) but why is this

Re: Long title on github page

2021-06-10 Thread Wes McKinney
I'll wait a day or two for more feedback to percolate and then ask Infra to change the description on GitHub. On Thu, Jun 10, 2021 at 4:47 PM Adam Lippai wrote: > > +1 > > On Thu, Jun 10, 2021, 23:38 Antoine Pitrou wrote: > > > > > Sound good enough to me. > > > > > > Le 10/06/2021 à 23:35, Wes

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
>From this, it seems like seeding the RecordBatchStreamWriter's output stream with a much larger preallocated buffer would improve performance (depends on the allocator used of course). On Thu, Jun 10, 2021 at 5:40 PM Weston Pace wrote: > > Just for some reference times from my system I created

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Weston Pace
Just for some reference times from my system I created a quick test to dump a ~1.7GB table to buffer(s). Going to many buffers (just collecting the buffers): ~11,000ns Going to one preallocated buffer: ~160,000,000ns Going to one dynamically allocated buffer (using a grow factor of 2x):

Re: [VOTE][RUST] Release Apache Arrow Rust 4.3.0

2021-06-10 Thread Wes McKinney
+1 (binding) Verified RC using verification script on Apple aarch64 On Thu, Jun 10, 2021 at 5:05 PM Andrew Lamb wrote: > > Hi, > > I would like to propose a release of Apache Arrow Rust Implementation, > version 4.3.0. > > This release candidate is based on commit: >

[VOTE][RUST] Release Apache Arrow Rust 4.3.0

2021-06-10 Thread Andrew Lamb
Hi, I would like to propose a release of Apache Arrow Rust Implementation, version 4.3.0. This release candidate is based on commit: 1f7f4bc45afc5189ea0d7d4a588688ae00cceb86 [1] The proposed release tarball and signatures are hosted at [2]. The changelog is located at [3]. Please download,

Re: Long title on github page

2021-06-10 Thread Adam Lippai
+1 On Thu, Jun 10, 2021, 23:38 Antoine Pitrou wrote: > > Sound good enough to me. > > > Le 10/06/2021 à 23:35, Wes McKinney a écrit : > > I hate to reopen this can of worms again, but here is my effort to > > synthesize feedback: > > > > "Apache Arrow is a multi-language toolbox for accelerated

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
To be clear, we would like to help make this faster. I don't recall much effort being invested in optimizing this code path in the last couple of years, so there may be some low hanging fruit to improve the performance. Changing the in-memory data layout (the chunking) is one of the most likely

Re: Long title on github page

2021-06-10 Thread Antoine Pitrou
Sound good enough to me. Le 10/06/2021 à 23:35, Wes McKinney a écrit : I hate to reopen this can of worms again, but here is my effort to synthesize feedback: "Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing." On Thu, Jun 10, 2021 at 12:37

Re: Long title on github page

2021-06-10 Thread Wes McKinney
I hate to reopen this can of worms again, but here is my effort to synthesize feedback: "Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing." On Thu, Jun 10, 2021 at 12:37 PM Dominik Moritz wrote: > > I thought there were some good suggestions in

Re: Complex Number support in Arrow

2021-06-10 Thread Wes McKinney
I'd be supportive of starting with this as a "canonical" extension type so that all implementations are not expected to support complex types — this would encourage us to build sufficient integration e.g. with NumPy to get things working end-to-end with the on-wire representation being an

Re: Complex Number support in Arrow

2021-06-10 Thread Micah Kornfield
> > I'm convinced now that first-class types seem to be the way to go and I'm > happy to take this approach. I agree from an implementation effort it is simpler, but I'm still not convinced that we should be adding this as a first class type. As noted in the survey below it appears Complex

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

2021-06-10 Thread Wes McKinney
I agree that we need to implement the equivalent of pandas's "tz_localize" method which performs UTC normalization on tz-naive data and sets the timezone field. Here's a demo of this functionality (I originally implemented this years ago by porting pytz's logic to run against NumPy arrays in

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

2021-06-10 Thread Joris Van den Bossche
On Thu, 10 Jun 2021 at 18:06, Antoine Pitrou wrote: > > On Thu, 10 Jun 2021 17:33:23 +0200 > Joris Van den Bossche wrote: > > > > We just merged a PR to add some kernels to extract fields from timestamps > > (year, month, day, hour, etc -> ARROW-11759 > >

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
Hi Jayjeet, I wonder if you really need to serialize the whole table into a single buffer as you will end up with twice the memory while you could be sending chunks as they are generated by the RecordBatchStreamWriter. Also is the buffer resized beforehand? I'd suspect there might be relocations

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
hi Jayjeet — have you run prof to see where those 1000ms are being spent? How many arrays (the sum of the number of chunks across all columns) in total are there? I would guess that the problem is all the little Buffer memcopies. I don't think that the C Interface is going to help you. - Wes On

Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Jayjeet Chakraborty
Hello Arrow Community, I am a student working on a project where I need to serialize an in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am currently using the arrow::ipc::RecordBatchStreamWriter API to serialize the table to a arrow::Buffer, but it is taking nearly 1000ms to

Re: Javascript object => Arrow Table

2021-06-10 Thread Dominik Moritz
Hi Lana, We don’t right now but it’s certainly something we would like to add. I recently added support for typed arrays in Table.new ( https://github.com/apache/arrow/pull/10151) and filed a Jira to support constructing Tables from arrays of objects in

Re: Long title on github page

2021-06-10 Thread Dominik Moritz
I thought there were some good suggestions in this thread. @Wes, did you find a description you liked? On May 18, 2021 at 06:24:47, Adam Hooper wrote: > Poll question: why did you choose Arrow? > > Personally: I researched Arrow because it's a spec for IPC. (My requirement > was: "wrap

Javascript object => Arrow Table

2021-06-10 Thread Lana Ramjit
Hi all, As the subject line indicates, I'm wondering if there's a library standard way of converting a javascript object to an arrow table. Specifically, I'm trying to get interoperability between knex <=> Arrow, since knex returns an array of objects. I found this issue from a couple years ago

Re: [Discuss] Handling timezones in (C++) compute kernels for timestamp data

2021-06-10 Thread Antoine Pitrou
On Thu, 10 Jun 2021 17:33:23 +0200 Joris Van den Bossche wrote: > > We just merged a PR to add some kernels to extract fields from timestamps > (year, month, day, hour, etc -> ARROW-11759 > ). But once you start with > kernels for timestamp data, you

[Discuss] Handling timezones in (C++) compute kernels for timestamp data

2021-06-10 Thread Joris Van den Bossche
Hi all, There was recently a discussion on the interpretation of the spec about the "timezone" field of timestamp type (and different timestamp-related types that Arrow should have). See

Re: post-release tasks (4.0.1)

2021-06-10 Thread Krisztián Szűcs
On Thu, Jun 10, 2021 at 6:57 AM Jorge Cardoso Leitão wrote: > > I have been unable to generate the docs from any of my two machines (my > macbook and a VM on azure), and I do not think we should delay this > further. Could someone kindly create a PR with the generated docs to the > website? Hi!

Re: Complex Number support in Arrow

2021-06-10 Thread Simon Perkins
On Wed, Jun 9, 2021 at 7:56 PM Antoine Pitrou wrote: > > Le 09/06/2021 à 17:52, Micah Kornfield a écrit : > > > > Adding a new first-class type in Arrow requires working integration tests > > between C++ and Java libraries (once the idea is informally agreed upon) > > and then a final vote for

Re: Complex Number support in Arrow

2021-06-10 Thread Simon Perkins
On Wed, Jun 9, 2021 at 11:25 PM Wes McKinney wrote: > I think that having a top-level type for complex numbers would be > nicer than extension types Agreed. As Micha mentioned, adding these types don't seem to interfere with any existing protocol, I'd like to take this approach going forward.

Re: Complex Number support in Arrow

2021-06-10 Thread Antoine Pitrou
Le 10/06/2021 à 09:20, Simon Perkins a écrit : Ah so Arrow Structs are represented as a Struct of Arrays (SoA) vs an Array of Structs (AoS)? If you are not familiar with the Arrow format, I would suggest you start by reading https://arrow.apache.org/docs/format/Columnar.html (see "Struct

Re: Complex Number support in Arrow

2021-06-10 Thread Simon Perkins
Hi Micah Please see a recent discussion on adding new types [1] > Thanks, this is useful. > My understanding is that feather.fbs is for V1 feather files and probably > shouldn't be touched. Only updating schema.fbs should be required and the > type should be doable in a backwards/forwards

Re: Delta Lake support for DataFusion

2021-06-10 Thread Jorge Cardoso Leitão
Hi, I agree with all of you. ^_^ I created https://github.com/apache/arrow-datafusion/issues/533 to track this. I tried to encapsulate the three main use-cases for the SQL extension. Feel free to edit at will. Best, Jorge On Thu, Jun 10, 2021 at 8:37 AM QP Hou wrote: > Thanks Daniël for

Re: Complex Number support in Arrow

2021-06-10 Thread Antoine Pitrou
On Wed, 9 Jun 2021 15:34:41 -0700 Micah Kornfield wrote: > Hi Antoine, > In regards to conceptual simplicity, I might have misinterpreted when you > wrote: > > Since complex numbers are quite common in some domains, and since they > > are conceptually simply, > > > It seemed like a

Re: Delta Lake support for DataFusion

2021-06-10 Thread QP Hou
Thanks Daniël for starting the discussion! Looks like we are on the same page to take this as an opportunity to make datafusion more extensible :) I think Neville and Daniël nailed the biggest missing piece at the moment: being able to extend SQL parser and planner with new syntaxes and map them