Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Raghu Angadi
On Wed, Oct 19, 2016 at 11:00 AM, Kenneth Knowles wrote: > I wanted to attempt to explicitly answer Raghu's question by saying that I > think Dan's starting points imply that the recommended behavior for start() > and advance() is to be "non-blocking" in the sense that they return quickly > if in

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Kenneth Knowles
I wanted to pull out the sub-thread that isn't about testing, parapharased: Amit: "Dan laid out these points: readers should return ASAP, runners may poll as they see fit [including quickly if they think the reader is in start-up time], runners need to be OK with startup delay" Raghu: "What is the

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Jean-Baptiste Onofré
Hi FYI when working on IO I already setup a docker image that I'm using for integration test. The IO unit tests embed and bootstrap the IO resources when possible. For instance JmsIO unit tests start a embedded ActiveMQ broker. However I also have a ActiveMQ docker image that I use for integra

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Thomas Weise
Hadoop FS has the local file system implementation that can be used for testing ("file" URL, no service needed). Thanks On Wed, Oct 19, 2016 at 10:43 AM, Amit Sela wrote: > Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka > native is best. > I'm pretty sure there's an embed

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Raghu Angadi
It will be very useful for existing KafkaIOTest as well. MockConsumer we use is too primitive. ~ 50% of KafkaIOTest deals with MockConsumer. On Wed, Oct 19, 2016 at 10:43 AM, Amit Sela wrote: > Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka > native is best. > I'm pretty s

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka native is best. I'm pretty sure there's an embedded HDFS for testing as well. While embedded Kafka/HDFS won't reflect "real-life" distributed environment, it could be a good place to start and provide some basic functional testi

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Satish Duggana
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/utils/EmbeddedKafkaCluster.java This is currently used in one of our repos and it comes as part of one of kafka libs. On Wed, Oct 19, 2016 at 10:49 PM, Amit Sela wrote: > The SparkRunner actual

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
The SparkRunner actually has an embedded Kafka for its unit tests. On Wed, Oct 19, 2016, 20:16 Thomas Weise wrote: > Kafka can be embedded for the integration testing, which should > significantly simplify the setup. > > Here is an example I found: > > https://gist.github.com/fjavieralba/7930018

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Thomas Weise
Kafka can be embedded for the integration testing, which should significantly simplify the setup. Here is an example I found: https://gist.github.com/fjavieralba/7930018 Thanks, Thomas On Wed, Oct 19, 2016 at 9:44 AM, Dan Halperin wrote: > My thoughts: > > * It's worth reading the Beam tes

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Dan Halperin
My thoughts: * It's worth reading the Beam testing document that Jason Kuster wrote! * Beam already has support for "End-to-end" integration tests, of examples (e.g., WordCountIT

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Thomas Weise
+1 those are probably the most used sources. Hadoop FS has a number of different implementations, HDFS is one of them. On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela wrote: > I agree with Aljoscha about Kafka. > > How about having one integration test for BoundedSource and one for > UnboundedSource

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Amit Sela
I agree with Aljoscha about Kafka. How about having one integration test for BoundedSource and one for UnboundedSource ? from apache perspective it makes sense to test this end-to-end on HDFS and Kafka (respectively). On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek wrote: > +Jason, looping hi

Re: [DISCUSS] Sources and Runners

2016-10-19 Thread Aljoscha Krettek
+Jason, looping him in directly because he might have an opinion on what I'm going to say. Should we maybe add integration tests that verify that all runners can correctly read from and write to an external system in a complete Pipeline. At least for Kafka, which seems to be the most used option i

Re: [DISCUSS] Sources and Runners

2016-10-18 Thread Robert Bradshaw
Eventually we'll be able to communicate intent with the runner much more directly via the ProcessContinuation object: https://github.com/apache/incubator-beam/blob/a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L658 On Tue, Oct 18, 2

Re: [DISCUSS] Sources and Runners

2016-10-18 Thread Jean-Baptiste Onofré
Thanks for the update and summary. Regards JB ⁣​ On Oct 18, 2016, 20:47, at 20:47, Amit Sela wrote: >I wanted to summarize here a couple of important points raised in some >PRs >I was involved with, and while those PRs were about KafkaIO and related >to >the Spark/Direct runners, some important

Re: [DISCUSS] Sources and Runners

2016-10-18 Thread Raghu Angadi
One way I would rephrase the concern from unbounded source developer point of view : - What is the recommended blocking behavior for start() and advance()? E.g. on one extreme should they wait till there is a record? Mostly this will be bad. I am glad at pull/1125