Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-06-27 Thread Neville Li
Ping again. Any chance someone takes a look to get this thing going? It's just a design doc and basic metadata/IO impl. We're not talking about actual source/sink code yet (already done but saved for future PRs). On Fri, Jun 21, 2019 at 1:38 PM Ahmet Altay wrote: > Thank you Claire, this looks p

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-06-27 Thread Neville Li
cer side, by pre-grouping/sorting data and writing to bucket/shard output files, the consumer can sort/merge matching ones without a CoGBK. Essentially we're paying the shuffle cost upfront to avoid them repeatedly in each consumer pipeline that wants to join data. > Thanks, > Cham &g

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-01 Thread Neville Li
a way to move forward. On Thu, Jun 27, 2019 at 4:39 PM Neville Li wrote: > Thanks. I responded to comments in the doc. More inline. > > On Thu, Jun 27, 2019 at 2:44 PM Chamikara Jayalath > wrote: > >> Thanks added few comments. >> >> If I understood correctly, y

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-13 Thread Neville Li
such a major piece of work I don't > want it to sit with everyone thinking they are waiting on someone else, or > any such thing. (not saying this is happening, just pinging to be sure) > > Kenn > > On Mon, Jul 1, 2019 at 1:09 PM Neville Li wrote: > >> Updated t

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Neville Li
is >> promised to be in key order)) or support a single SMB aka >> "PreGrouping" source/sink pair that's aways used together (and whose >> underlying format is not necessarily public). >> >> On Sat, Jul 13, 2019 at 3:19 PM Neville Li wrote: >> &g

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Neville Li
ror the existing IO ones (from an API > perspective--how much implementation it makes sense to share is an > orthogonal issue that I'm sure can be worked out.) > > On Mon, Jul 15, 2019 at 4:18 PM Neville Li wrote: > > > > Hi Robert, > > > > I agree, it'd

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-16 Thread Neville Li
a lot of classes that are > nested (non-static) or non-public. I can understand why they were made > non-public, it's a hard abstraction to design well and keep compatibility. > As Neville mentioned, decoupling readers and writers would not only benefit > for this propo

Sort Merge Bucket - Action Items

2019-07-19 Thread Neville Li
e of the > logic (for example compression, temp file handling) that is already > implemented in Beam FileIO/WriteFiles transforms in your SMB sink transform. > >> >>> > >> >>> For reader, you are right that there's no FileIO.Read. What we have > are

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Neville Li
ts across files within a bucket and TBH I'm not even sure where to start. I'll file separate PRs for core changes needed for discussion. WDYT? On Mon, Jul 22, 2019 at 4:20 AM Robert Bradshaw wrote: > On Fri, Jul 19, 2019 at 5:16 PM Neville Li wrote: > > > > For

Re: Sort Merge Bucket - Action Items

2019-07-23 Thread Neville Li
Kirpichov >> wrote: >> > >> > On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw >> wrote: >> >> >> >> On Mon, Jul 22, 2019 at 4:04 PM Neville Li >> wrote: >> >> > >> >> > Thanks Robert. Agree with the FileIO p

Re: Sort Merge Bucket - Action Items

2019-07-24 Thread Neville Li
3, 2019 at 6:36 PM Neville Li wrote: > So I spent one afternoon trying some ideas for reusing the last few > transforms WriteFiles. > > WriteShardsIntoTempFilesFn extends DoFn*, > Iterable>, *FileResult*> > => GatherResults extends PTransform, > PCollection>> >

Re: [Proposal] Sharing Neville's post and upcoming meetups in the Twitter handle

2017-10-23 Thread Neville Li
Hi all, Part 2 is out: https://labs.spotify.com/2017/10/23/big-data-processing-at-spotify-the-road-to-scio-part-2/ We also have a meetup in Stockholm later today: https://www.meetup.com/stockholm-hug/events/244112281/ On Mon, Oct 23, 2017 at 3:03 PM Ismaël Mejía wrote: > Has anybody thought ab

Scio 0.5.0-alpha2 released

2018-01-29 Thread Neville Li
Hi all, We just released Scio 0.5.0-alpha2. This is mostly a bug fix release. We'll probably have one or 2 beta releases with the upcoming Beam 2.3.0. Stay tuned! Cheers, Neville https://github.com/spotify/scio/releases/tag/v0.5.0-alpha2 Breaking changes - BigQueryIO in JobTest#output now r

Re: [VOTE] Release 2.3.0, release candidate #3

2018-02-12 Thread Neville Li
I don't see a beam-sdks-java-io-hadoop-input-format artifact in the staging repo, but the Maven module still exists: https://github.com/apache/beam/tree/v2.3.0-RC3/sdks/java/io/hadoop-input-format Was it not published by mistake? We still have code that depends on this. On Mon, Feb 12, 2018 at 3: