There are two examples: an example in DataFusion [1], and an example in
python [2].
In DataFusion, the performance is the same because the UDF is compiled as
Rust. It can even be compiled with SIMD intrinsics.
In Python, it depends what is used inside the UDF:
* If only pyarrow.compute
On 5/20/21 4:15 AM, Rares Vernica wrote:
Hello,
I'm using Arrow for accessing data outside the SciDB database engine. It
generally works fine but we are running into Segmentation Faults in a
corner multi-threaded case. I identified two threads that work on the same
Record Batch. I wonder if
Is there a better (safer) way of accessing a specific Int64 cell in a
RecordBatch? Currently I'm doing something like this:
std::static_pointer_cast(batch->column(i))->raw_values()[j]
On Wed, May 19, 2021 at 3:09 PM Rares Vernica wrote:
> > /opt/rh/devtoolset-3/root/usr/bin/g++ -v
> Using
> /opt/rh/devtoolset-3/root/usr/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/opt/rh/devtoolset-3/root/usr/bin/g++
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-3/root/usr/libexec/gcc/x86_64-redhat-linux/4.9.2/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure
What compiler / glibc version are you using?
arrow::SimpleRecordBatch::column does some non-trivial caching which
uses std::atomic_load[1] which is not implemented properly on gcc < 5
so our behavior is different depending on the compiler version.
[1]
Hello,
I'm using Arrow for accessing data outside the SciDB database engine. It
generally works fine but we are running into Segmentation Faults in a
corner multi-threaded case. I identified two threads that work on the same
Record Batch. I wonder if there is something internal about RecordBatch
> I would recommend writing such tests in Python, such as is already done
> for the CSV reader.
Agreed, that is my current thinking as well.
> I'm not sure what you have in mind. You're intending to run this test
> 40k minutes per day?
40k minutes per month. 24 hours * 60 minutes * 30 days ~
Apologies for missing the call. I looked into Google Meet settings and it
does not seem possible with the free version to have more than one
organizer, so there is no way to let people join if the organizer is not
there. Only people that are on the invite list can join.
Perhaps we should find a
Hi Hendrik,
If you want to drive this, I think the next step would be to propose a
design and gather consensus on it.
-Micah
On Wed, May 12, 2021 at 11:01 AM Hendrik Makait wrote:
> Having a way to encode sorting (and distribution) information is something
> I'd also be very interested in. If
Attendees:
---
Andrew Lamb
Jorge Leitao
Fernando Herrera
Ruan Pearce-Authers
Jorn Horstmann
Ben Blodgett
Paddy Horan
Tyler Reid
Discussions:
---
Update on the Arrow release process
Discussed some upcoming datafusion proposals such as sorted stream
operator and Row group pruning
On Tue, May 18, 2021 at 11:58 PM Antoine Pitrou wrote:
>
>
> Le 19/05/2021 à 03:28, Arun Sharma a écrit :
>
> > Say we're talking arrow + datafusion (which is written in Rust). It
> > sounded like your goal is to ensure that users of different language
> > ecosystems get the same performance
I read the invariants doc and field output doc again and I think they all
make sense to me. Thanks QP
On Wed, May 19, 2021 at 3:09 AM QP Hou wrote:
> Hi all,
>
> Following up on this.
>
> We have updated the output schema doc [1] and updated invariant doc
> [2] for the final round of review.
>
Arrow Build Report for Job nightly-2021-05-19-0
All tasks:
https://github.com/ursacomputing/crossbow/branches/all?query=nightly-2021-05-19-0
Failed Tasks:
- conda-osx-clang-py38:
URL:
Another approach that could reduce the amount of heavy tests that we have
to write (if the tests are written in Python) might be to drive the code to
interleave in the ways we feel might introduce problems. Such an approach
can be performed by introducing explicit breakpoints in the code and
Hi all,
Following up on this.
We have updated the output schema doc [1] and updated invariant doc
[2] for the final round of review.
In the updated invariant doc, the main change we introduced compared
to the previous version is as follows:
We now enforce strict schema equality in all plan
Le 19/05/2021 à 07:37, Weston Pace a écrit :
I spoke a while ago about working on a multithreaded stress test
suite. I have put together some very early details[1]. I would
appreciate any feedback.
I would recommend writing such tests in Python, such as is already done
for the CSV reader.
Le 19/05/2021 à 03:28, Arun Sharma a écrit :
On Tue, May 18, 2021 at 5:37 PM Wes McKinney wrote:
You just sent this same e-mail 24 hours ago. I think the problems we
are solving are different. We are addressing language siloing at the
data level and the shared-computing-libraries level. I
17 matches
Mail list logo