Re: Out of band pickling in Python (pickle5)

2022-09-19 Thread Brian Hulette via dev
I got to thinking about this again and ran some benchmarks. The result is documented in the GitHub issue [1]. tl;dr: we can't realize a huge benefit since we don't actually have an out-of-band path for exchanging the buffers. However, pickle 5 can yield improved in-band performance as well, and I

beam.Create(range(N)) without building a sequence in memory

2022-09-19 Thread Stephan Hoyer via dev
Many of my Beam pipelines start with partitioning over some large, statically known number of inputs that could be created from a list of sequential integers. In Python, these sequential integers can be efficiently represented with a range() object, which stores the start/top and interval.

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev
On Mon, Sep 19, 2022 at 1:53 PM Stephan Hoyer wrote: >> >> > > My team has an internal implementation of a CartesianProduct transform, >> > > based on using hashing to split a pcollection into a finite number of >> > > groups and CoGroupByKey. >> > >> > Could this be contributed to Beam? > > >

Re: Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev
> > > > My team has an internal implementation of a CartesianProduct > transform, based on using hashing to split a pcollection into a finite > number of groups and CoGroupByKey. > > > > Could this be contributed to Beam? > If it would be of broader interest, I would be happy to work on this for

Re: Cartesian product of PCollections

2022-09-19 Thread Robert Bradshaw via dev
If one of your inputs fits into memory, using side inputs is definitely the way to go. If neither side fits into memory, the cross product may be prohibitively large to compute even on a distributed computing platform (a billion times a billion is big, though I suppose one may hit memory limits

Re: Cartesian product of PCollections

2022-09-19 Thread Brian Hulette via dev
In SQL we just don't support cross joins currently [1]. I'm not aware of an existing implementation of a cross join/cartesian product. > My team has an internal implementation of a CartesianProduct transform, based on using hashing to split a pcollection into a finite number of groups and

Cartesian product of PCollections

2022-09-19 Thread Stephan Hoyer via dev
I'm wondering if it would make sense to have a built-in Beam transformation for calculating the Cartesian product of PCollections. Just this past week, I've encountered two separate cases where calculating a Cartesian product was a bottleneck. The in-memory option of using something like Python's

Beam High Priority Issue Report (74)

2022-09-19 Thread beamactions
This is your daily summary of Beam's current high priority issues that may need attention. See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities. Unassigned P1 Issues: https://github.com/apache/beam/issues/23179 [Bug]: Parquet