Hi Thomas, thanks for the update.
Stupid question: right now, most of the IO Sink use a simple DoFn (to write the PCollection elements).
Does it mean that, when possible, we should use Sink abstract class, with a Writer that can write bundles (using open(), write(), close() methods) ?
In that case, will the Window create the bundle ? Regards JB On 09/19/2016 10:46 PM, Thomas Groh wrote:
The model doesn't support dynamically creating PCollections, so the proposed transform producing a Bounded PCollection from a window of an Unbounded PCollection means that you end up with a Bounded Pipeline - in which case it is preferable to use a Bounded Read instead. As you're reading from an unbounded input, it may be necessary to write the windowed values to a sink that supports continuous updates, from which you can execute a Bounded Pipeline over the appropriate set of input instead of on some arbitrary chunk of records (e.g. the window). The current Write implementation works effectively only for sinks that can assume they can do something exactly once and finish. This is not an assumption that can be made within arbitrary pipelines. Sinks that currently only work for Bounded inputs are generally written with assumptions that mean they do not work in the presence of multiple trigger firings, late data, or across multiple windows, and handling these facets of the model requires different behavior or configuration from different sinks. On Mon, Sep 19, 2016 at 9:53 AM, Emanuele Cesena <emanu...@shopkick.com <mailto:emanu...@shopkick.com>> wrote: Oh yes, even better I’d say. Best, > On Sep 19, 2016, at 9:48 AM, Jean-Baptiste Onofré <j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote: > > Hi Emanuele, > > +1 to support Unbounded sink, but also, a very convenient function would be a Window to create a bounded collection as a subset of a unbounded collection. > > Regards > JB > > On 09/19/2016 05:59 PM, Emanuele Cesena wrote: >> Hi, >> >> This is a great insight. Is there any plan to support unbounded sink in Beam? >> >> On the temp kafka->kafka solution, this is exactly what we’re doing (and I wish to change). We have production stream pipelines that are kafka->kafka. Then we have 2 main use cases: kafka connect to dump into hive and go batch from there, and druid for real time reporting. >> >> However this makes prototyping really slow, and I wanted to introduce Beam to short cut from kafka to anywhere. >> >> Best, >> >> >>> On Sep 18, 2016, at 10:38 PM, Aljoscha Krettek <aljos...@apache.org <mailto:aljos...@apache.org>> wrote: >>> >>> Hi, >>> right now, writing to a Beam "Sink" is only supported for bounded streams, as you discovered. An unbounded stream cannot be transformed to a bounded stream using a window, this will just "chunk" the stream differently but it will still be unbounded. >>> >>> The options you have right now for writing are to write to your external datastore using a DoFn, using KafkaIO to write to a Kafka topic or to use UnboundedFlinkSink to wrap a Flink Sink for use in a Beam pipeline. The latter would allow you to use, for example, BucketingSink or RollingSink from Flink. I'm only mentioning UnboundedFlinkSink for completeness, I would not recommend using it since your program will only work on the Flink runner. The way to go, IMHO, would be to write to Kafka and then take the data from there and ship it to some final location such as HDFS. >>> >>> Cheers, >>> Aljoscha >>> >>> On Sun, 18 Sep 2016 at 23:17 Emanuele Cesena <emanu...@shopkick.com <mailto:emanu...@shopkick.com>> wrote: >>> Thanks I’ll look into it, even if it’s not really the feature I need (exactly because it will stop execution). >>> >>> >>>> On Sep 18, 2016, at 2:11 PM, Chawla,Sumit <sumitkcha...@gmail.com <mailto:sumitkcha...@gmail.com>> wrote: >>>> >>>> Hi Emanuele >>>> >>>> KafkaIO supports withMaxNumRecords(X) support which will create a bounded source from Kafka. However, the pipeline will finish once X number of records are read. >>>> >>>> Regards >>>> Sumit Chawla >>>> >>>> >>>> On Sun, Sep 18, 2016 at 2:00 PM, Emanuele Cesena <emanu...@shopkick.com <mailto:emanu...@shopkick.com>> wrote: >>>> Hi, >>>> >>>> Thanks for the hint - I’ll debug better but I thought I did that: >>>> https://github.com/ecesena/beam-starter/blob/master/src/main/java/com/dataradiant/beam/examples/StreamWordCount.java#L140 <https://github.com/ecesena/beam-starter/blob/master/src/main/java/com/dataradiant/beam/examples/StreamWordCount.java#L140> >>>> >>>> Best, >>>> >>>> >>>>> On Sep 18, 2016, at 1:54 PM, Jean-Baptiste Onofré <j...@nanthrax.net <mailto:j...@nanthrax.net>> wrote: >>>>> >>>>> Hi Emanuele >>>>> >>>>> You have to use a window to create a bounded collection from an unbounded source. >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On Sep 18, 2016, at 21:04, Emanuele Cesena <emanu...@shopkick.com <mailto:emanu...@shopkick.com>> wrote: >>>>> Hi, >>>>> >>>>> I wrote a while ago about a simple example I was building to test KafkaIO: >>>>> https://github.com/ecesena/beam-starter/blob/master/src/main/java/com/dataradiant/beam/examples/StreamWordCount.java <https://github.com/ecesena/beam-starter/blob/master/src/main/java/com/dataradiant/beam/examples/StreamWordCount.java> >>>>> >>>>> Issues with Flink should be fixed now, and I’m try to run the example on master and Flink 1.1.2. >>>>> I’m currently getting: >>>>> Caused by: java.lang.IllegalArgumentException: Write can only be applied to a Bounded PCollection >>>>> >>>>> What is the recommended way to go here? >>>>> - is there a way to create a bounded collection from an unbounded one? >>>>> - is there a plat to let TextIO write unbounded collections? >>>>> - is there another recommended “simple sink” to use? >>>>> >>>>> Thank you much! >>>>> >>>>> Best, >>>> >>>> -- >>>> Emanuele Cesena, Data Eng. >>>> http://www.shopkick.com >>>> >>>> Il corpo non ha ideali >>>> >>>> >>>> >>>> >>>> >>> >>> -- >>> Emanuele Cesena, Data Eng. >>> http://www.shopkick.com >>> >>> Il corpo non ha ideali >>> >>> >>> >>> >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org <mailto:jbono...@apache.org> > http://blog.nanthrax.net > Talend - http://www.talend.com -- Emanuele Cesena, Data Eng. http://www.shopkick.com Il corpo non ha ideali
-- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com