The problem of kafkaIO sdk for data latency

2018-02-28 Thread linrick
Dear all, I am using the kafkaIO sdk in my project (Beam 2.0.0 with Direct runner). With using this sdk, there are a situation about data latency, and the description of situation is in the following. The data come from kafak with a fixed speed: 100 data size/ 1 sec. I create a fixed window

Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-02-28 Thread Chamikara Jayalath
osition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); > > where ToDestination is a: > > SerializableFunction<ValueInSingleWindow, TableDestination> > > which returns a: > > new TableDestination(tableName, "")

Re: Running code before pipeline starts

2018-02-28 Thread Lukasz Cwik
You should use a side input and not an empty PCollection that you flatten. Since ReadA --> Flatten --> ParDo ReadB -/ can be equivalently executed as: ReadA --> ParDo ReadB --> ParDo Make sure you access the side input in case a runner evaluates the side input lazily. So your pipeline would

Running code before pipeline starts

2018-02-28 Thread Andrew Jones
Hi, What is the best way to run code before the pipeline starts? Anything in the `main` function doesn't get called when the pipeline is ran on Dataflow via a template - only the pipeline. If you're familiar with Spark, then I'm thinking of code that might be ran in the driver. Alternatively,