date:20180228

The problem of kafkaIO sdk for data latency

2018-02-28 Thread linrick

Dear all, I am using the kafkaIO sdk in my project (Beam 2.0.0 with Direct runner). With using this sdk, there are a situation about data latency, and the description of situation is in the following. The data come from kafak with a fixed speed: 100 data size/ 1 sec. I create a fixed window wi

Re: BigQueryIO streaming inserts - poor performance with multiple tables

2018-02-28 Thread Chamikara Jayalath

E_NEVER) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)); > > where ToDestination is a: > > SerializableFunction, TableDestination> > > which returns a: > > new TableDestination(tableName, "") > > where tableName looks like "m

Re: Running code before pipeline starts

2018-02-28 Thread Lukasz Cwik

You should use a side input and not an empty PCollection that you flatten. Since ReadA --> Flatten --> ParDo ReadB -/ can be equivalently executed as: ReadA --> ParDo ReadB --> ParDo Make sure you access the side input in case a runner evaluates the side input lazily. So your pipeline would look

Running code before pipeline starts

2018-02-28 Thread Andrew Jones

Hi, What is the best way to run code before the pipeline starts? Anything in the `main` function doesn't get called when the pipeline is ran on Dataflow via a template - only the pipeline. If you're familiar with Spark, then I'm thinking of code that might be ran in the driver. Alternatively,