Re: Language silos and transpilers

2021-05-19 Thread Jorge Cardoso Leitão
There are two examples: an example in DataFusion [1], and an example in python [2]. In DataFusion, the performance is the same because the UDF is compiled as Rust. It can even be compiled with SIMD intrinsics. In Python, it depends what is used inside the UDF: * If only pyarrow.compute

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Yibo Cai
On 5/20/21 4:15 AM, Rares Vernica wrote: Hello, I'm using Arrow for accessing data outside the SciDB database engine. It generally works fine but we are running into Segmentation Faults in a corner multi-threaded case. I identified two threads that work on the same Record Batch. I wonder if

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Rares Vernica
Is there a better (safer) way of accessing a specific Int64 cell in a RecordBatch? Currently I'm doing something like this: std::static_pointer_cast(batch->column(i))->raw_values()[j] On Wed, May 19, 2021 at 3:09 PM Rares Vernica wrote: > > /opt/rh/devtoolset-3/root/usr/bin/g++ -v > Using

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Rares Vernica
> /opt/rh/devtoolset-3/root/usr/bin/g++ -v Using built-in specs. COLLECT_GCC=/opt/rh/devtoolset-3/root/usr/bin/g++ COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-3/root/usr/libexec/gcc/x86_64-redhat-linux/4.9.2/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure

Re: C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Weston Pace
What compiler / glibc version are you using? arrow::SimpleRecordBatch::column does some non-trivial caching which uses std::atomic_load[1] which is not implemented properly on gcc < 5 so our behavior is different depending on the compiler version. [1]

C++ RecordBatch Debugging Segmentation Fault

2021-05-19 Thread Rares Vernica
Hello, I'm using Arrow for accessing data outside the SciDB database engine. It generally works fine but we are running into Segmentation Faults in a corner multi-threaded case. I identified two threads that work on the same Record Batch. I wonder if there is something internal about RecordBatch

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Weston Pace
> I would recommend writing such tests in Python, such as is already done > for the CSV reader. Agreed, that is my current thinking as well. > I'm not sure what you have in mind. You're intending to run this test > 40k minutes per day? 40k minutes per month. 24 hours * 60 minutes * 30 days ~

Re: Notes from Rust / Arrow sync from May 19, 2021

2021-05-19 Thread Andy Grove
Apologies for missing the call. I looked into Google Meet settings and it does not seem possible with the free version to have more than one organizer, so there is no way to let people join if the organizer is not there. Only people that are on the invite list can join. Perhaps we should find a

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-19 Thread Micah Kornfield
Hi Hendrik, If you want to drive this, I think the next step would be to propose a design and gather consensus on it. -Micah On Wed, May 12, 2021 at 11:01 AM Hendrik Makait wrote: > Having a way to encode sorting (and distribution) information is something > I'd also be very interested in. If

Notes from Rust / Arrow sync from May 19, 2021

2021-05-19 Thread Andrew Lamb
Attendees: --- Andrew Lamb Jorge Leitao Fernando Herrera Ruan Pearce-Authers Jorn Horstmann Ben Blodgett Paddy Horan Tyler Reid Discussions: --- Update on the Arrow release process Discussed some upcoming datafusion proposals such as sorted stream operator and Row group pruning

Re: Language silos and transpilers

2021-05-19 Thread Arun Sharma
On Tue, May 18, 2021 at 11:58 PM Antoine Pitrou wrote: > > > Le 19/05/2021 à 03:28, Arun Sharma a écrit : > > > Say we're talking arrow + datafusion (which is written in Rust). It > > sounded like your goal is to ensure that users of different language > > ecosystems get the same performance

Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

2021-05-19 Thread Andrew Lamb
I read the invariants doc and field output doc again and I think they all make sense to me. Thanks QP On Wed, May 19, 2021 at 3:09 AM QP Hou wrote: > Hi all, > > Following up on this. > > We have updated the output schema doc [1] and updated invariant doc > [2] for the final round of review. >

[NIGHTLY] Arrow Build Report for Job nightly-2021-05-19-0

2021-05-19 Thread Crossbow
Arrow Build Report for Job nightly-2021-05-19-0 All tasks: https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-19-0 Failed Tasks: - conda-osx-clang-py38: URL:

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Alessandro Molina
Another approach that could reduce the amount of heavy tests that we have to write (if the tests are written in Python) might be to drive the code to interleave in the ways we feel might introduce problems. Such an approach can be performed by introducing explicit breakpoints in the code and

Re: [DataFusion] [Discuss] Output Schema for queries with multiple relations

2021-05-19 Thread QP Hou
Hi all, Following up on this. We have updated the output schema doc [1] and updated invariant doc [2] for the final round of review. In the updated invariant doc, the main change we introduced compared to the previous version is as follows: We now enforce strict schema equality in all plan

Re: [Discuss] [Proposal] [C++] Arrow multithreaded stress test suite

2021-05-19 Thread Antoine Pitrou
Le 19/05/2021 à 07:37, Weston Pace a écrit : I spoke a while ago about working on a multithreaded stress test suite. I have put together some very early details[1]. I would appreciate any feedback. I would recommend writing such tests in Python, such as is already done for the CSV reader.

Re: Language silos and transpilers

2021-05-19 Thread Antoine Pitrou
Le 19/05/2021 à 03:28, Arun Sharma a écrit : On Tue, May 18, 2021 at 5:37 PM Wes McKinney wrote: You just sent this same e-mail 24 hours ago. I think the problems we are solving are different. We are addressing language siloing at the data level and the shared-computing-libraries level. I