Re: How to decrease latency when using PubsubIO.Read?

2017-06-02 Thread Josh
Hi guys, I just wanted to give a quick update on what happened with this - after looking into it some more, it appears the extra latency was entirely caused by the publisher of the messages rather than the Beam job/PubsubIO. After tweaking the publishers to use smaller batches and adhere to a 'max

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread jofo90
Hi Raghu, Thanks for the suggestions and for having a look at the metrics! I will try the reshuffle and tweaking the publisher batching over the next couple of days, along with Ankur's suggestions, and will report back if any of these have a significant difference. Best regards, Josh > On 24 M

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Raghu Angadi
Josh, Reshuffle might also be worth trying. To clarify, ~1s end-to-end is not always simple given number of systems and services involved between publisher and eventual sink. Raghu. On Wed, May 24, 2017 at 12:32 PM, Raghu Angadi wrote: > Thanks. Looked a some job system level metrics. I see mi

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Raghu Angadi
Thanks. Looked a some job system level metrics. I see minimal latency within Dataflow pipeline itself, you might not see much improvement from Reshuffle() (may be ~0.25 seconds). Do you have control on publisher? Publishers might be batching hundreds of messages, which adds latency. You can try to

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Raghu Angadi
On Wed, May 24, 2017 at 10:14 AM, Josh wrote: > Thanks Ankur, that's super helpful! I will give these optimisations a go. > > About the "No operations completed" message - there are a few of these in > the logs (but very few, like 1 an hour or something) - so probably no need > to scale up Bigtab

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Josh
Hi Raghu, My job ID is 2017-05-24_02_46_42-11524480684503077480 - thanks for taking a look! Yes I'm using BigtableIO for the sink and I am measuring the end-to-end latency. It seems to take 3-6 seconds typically, I would like to get it down to ~1s. Thanks, Josh On Wed, May 24, 2017 at 6:50 PM,

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Raghu Angadi
Josh, Can you share your job_id? I could take look. Are you measuring latency end-to-end (publisher to when it appears on BT?). Are you using BigtableIO for sink? There is no easy way to use more workers when auto-scaling is enabled. It thinks your backlog and CPU are low enough and does not need

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Josh
Thanks Ankur, that's super helpful! I will give these optimisations a go. About the "No operations completed" message - there are a few of these in the logs (but very few, like 1 an hour or something) - so probably no need to scale up Bigtable. I did however see a lot of INFO messages "Wrote 0 rec

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Ankur Chauhan
There are two main things to see here: * In the logs, are there any messages like "No operations completed within the last 61 seconds. There are still 1 simple operations and 1 complex operations in progress.” This means you are underscaled on the bigtable side and would benefit from increasin

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Josh
Ah ok - I am using the Dataflow runner. I didn't realise about the custom implementation being provided at runtime... Any ideas of how to tweak my job to either lower the latency consuming from PubSub or to lower the latency in writing to Bigtable? On Wed, May 24, 2017 at 4:14 PM, Lukasz Cwik w

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Lukasz Cwik
What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)? On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan wrote: > Sorry that was an autocorrect error. I meant to ask - what dataflow runner > are you using? If you are using google cloud dataflow then the PubsubIO > class is not

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Ankur Chauhan
Sorry that was an autocorrect error. I meant to ask - what dataflow runner are you using? If you are using google cloud dataflow then the PubsubIO class is not the one doing the reading from the pubsub topic. They provide a custom implementation at run time. Ankur Chauhan Sent from my iPhone

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Josh
Hi Ankur, What do you mean by runner address? Would you be able to link me to the comment you're referring to? I am using the PubsubIO.Read class from Beam 2.0.0 as found here: https://github.com/apache/beam/blob/release-2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/i

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Ankur Chauhan
What runner address you using. Google cloud dataflow uses a closed source version of the pubsub reader as noted in a comment on Read class. Ankur Chauhan Sent from my iPhone > On May 24, 2017, at 04:05, Josh wrote: > > Hi all, > > I'm using PubsubIO.Read to consume a Pubsub stream, and my jo

How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Josh
Hi all, I'm using PubsubIO.Read to consume a Pubsub stream, and my job then writes the data out to Bigtable. I'm currently seeing a latency of 3-5 seconds between the messages being published and being written to Bigtable. I want to try and decrease the latency to <1s if possible - does anyone ha