Shading issues have been fixed. I believe it should be all good to use now.
On Wed, Jun 21, 2017 at 2:24 PM Eugene Kirpichov <[email protected]> wrote: > Yes, there's a pending PR https://github.com/apache/beam/pull/3360 > > Note that there are some shading issues with the Dataflow streaming runner > support. It passes ValidatesRunner tests, but will likely fail in a real > job :-| I'm working on resolving this with +Kenn Knowles <[email protected]> > right now. > > On Wed, Jun 21, 2017 at 2:18 PM peay <[email protected]> wrote: > >> Thanks, great news! Are there plans to get ProcessContinuation from the >> original proposal into the API? >> >> -------- Original Message -------- >> Subject: Re: Using watermarks with bounded sources >> Local Time: June 20, 2017 7:52 PM >> UTC Time: June 20, 2017 11:52 PM >> From: [email protected] >> >> To: peay <[email protected]>, [email protected] <[email protected]> >> [email protected] <[email protected]> >> >> Hi! >> >> The PR just got submitted. You can play with SDF in Dataflow streaming >> runner now :) Hope it doesn't get rolled back (fingers crossed)... >> >> On Mon, Jun 19, 2017 at 6:06 PM Eugene Kirpichov <[email protected]> >> wrote: >> >>> Hi, >>> The PR is ready and I'm just struggling with setup of tests - Dataflow >>> ValidatesRunner tests currently don't have a streaming execution. >>> I think +Kenn Knowles <[email protected]> was doing something about that, >>> or I might find a workaround. >>> >>> But basically if you want to experiment - if you patch in the PR, you >>> can experiment with SDF in Dataflow in streaming mode. It passes tests >>> against the current production Dataflow Service. >>> >>> >>> On Thu, Jun 15, 2017 at 8:54 AM peay <[email protected]> wrote: >>> >>>> Eugene, would you have an ETA on when splittable DoFn would be >>>> available in Dataflow in batch/streaming mode? I see that >>>> https://github.com/apache/beam/pull/1898 is still active >>>> >>>> I've started to experiment with those using the DirectRunner and this >>>> is a great API. >>>> >>>> Thanks! >>>> >>>> >>>> -------- Original Message -------- >>>> Subject: Re: Using watermarks with bounded sources >>>> >>>> Local Time: April 23, 2017 10:18 AM >>>> UTC Time: April 23, 2017 2:18 PM >>>> From: [email protected] >>>> To: Eugene Kirpichov <[email protected]> >>>> [email protected] <[email protected]> >>>> >>>> Ah, I didn't know about that. This is *really* great -- from a quick >>>> look, the API looks both very natural and very powerful. Thanks a lot for >>>> getting this into Beam! >>>> >>>> I see Flink support seems to have been merged already. Any idea on when >>>> https://github.com/apache/beam/pull/1898 will get merged? >>>> >>>> I see updateWatermark in the API but not in the proposal's examples >>>> which only uses resume/withFutureOutputWatermark. Any reason >>>> why updateWatermark is not called after each output in the examples from >>>> the proposal? I guess that would be "too fined-grained" to update it for >>>> each individual record of a mini-batch? >>>> >>>> In my case with existing hourly files, would `outputElement(01:00 >>>> file), updateWatermark(01:00), outputElement(02:00), >>>> updateWatermark(02:00), ...` be the proper way to output per-hour elements >>>> while gradually moving the watermark forward while going through an >>>> existing list? Or would you instead suggest to still use resume >>>> (potentially with were small timeouts)? >>>> >>>> Thanks, >>>> >>>> -------- Original Message -------- >>>> Subject: Re: Using watermarks with bounded sources >>>> Local Time: 22 April 2017 3:59 PM >>>> UTC Time: 22 April 2017 19:59 >>>> From: [email protected] >>>> To: peay <[email protected]>, [email protected] < >>>> [email protected]> >>>> >>>> Hi! This is an excellent question; don't have time to reply in much >>>> detail right now, but please take a look at >>>> http://s.apache.org/splittable-do-fn - it unifies the concepts of >>>> bounded and unbounded sources, and the use case you mentioned is one of the >>>> motivating examples. >>>> >>>> Also, see recent discussions on pipeline termination semantics: >>>> technically nothing should prevent an unbounded source from saying it's >>>> done "for real" (no new data will appear), just the current UnboundedSource >>>> API does not expose such a method. (but Splittable DoFn does) >>>> >>>> On Sat, Apr 22, 2017 at 11:15 AM peay <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> A use case I find myself running into frequently is the following: I >>>>> have daily or hourly files, and a Beam pipeline with a small to moderate >>>>> size windows. (Actually, I've just seen that support for per-window files >>>>> support in file based sinks was recently checked in, which is one way to >>>>> get there). >>>>> >>>>> Now, Beam has no clue about the fact that each file corresponds to a >>>>> given time interval. My understanding is that when running the pipeline in >>>>> batch mode with a bounded source, there is no notion watermark and we have >>>>> to load everything because we just don't know. This is pretty wasteful, >>>>> especially as you have to keep a lot of data in memory, while you could in >>>>> principle operate close to what you'd do in streaming mode: first read the >>>>> oldest files, then newest files, moving the watermark forward as you go >>>>> through the input list of files. >>>>> >>>>> I see one way around this. Let's say that I have hourly files and >>>>> let's not assume anything about the order of records within the file to >>>>> keep it simple: I don't want a very precise record-level watermark, but >>>>> more a rough watermark at the granularity of hours. Say we can easily get >>>>> the corresponding time interval from the filename. One can make an >>>>> unbounded source that essentially acts as a "List of bounded file-based >>>>> sources". If there are K splits, split k can read every file that has >>>>> `index % K == k` in the time-ordered list of files. `advance` can advance >>>>> the current file, and move on to the next one if no records were read. >>>>> >>>>> However, as far as I understand, this pipeline will never terminate >>>>> since this is an unbounded source and having the `advance` method of our >>>>> wrapping source return `false` won't make the pipeline terminate. Can >>>>> someone confirm if this is correct? If yes, what would be ways to work >>>>> around that? There's always the option to throw to make the pipeline fail, >>>>> but this is far from ideal. >>>>> >>>>> Thanks, >>>>> >>>> >>>>
