Hi guys,
I just wanted to give a quick update on what happened with this -
after looking into it some more, it appears the extra latency was entirely
caused by the publisher of the messages rather than the Beam job/PubsubIO.
After tweaking the publishers to use smaller batches and adhere to a
'max
Hi Raghu,
Thanks for the suggestions and for having a look at the metrics!
I will try the reshuffle and tweaking the publisher batching over the next
couple of days, along with Ankur's suggestions, and will report back if any of
these have a significant difference.
Best regards,
Josh
> On 24 M
Josh,
Reshuffle might also be worth trying. To clarify, ~1s end-to-end is not
always simple given number of systems and services involved between
publisher and eventual sink.
Raghu.
On Wed, May 24, 2017 at 12:32 PM, Raghu Angadi wrote:
> Thanks. Looked a some job system level metrics. I see mi
Thanks. Looked a some job system level metrics. I see minimal latency
within Dataflow pipeline itself, you might not see much improvement from
Reshuffle() (may be ~0.25 seconds).
Do you have control on publisher? Publishers might be batching hundreds of
messages, which adds latency. You can try to
On Wed, May 24, 2017 at 10:14 AM, Josh wrote:
> Thanks Ankur, that's super helpful! I will give these optimisations a go.
>
> About the "No operations completed" message - there are a few of these in
> the logs (but very few, like 1 an hour or something) - so probably no need
> to scale up Bigtab
Hi Raghu,
My job ID is 2017-05-24_02_46_42-11524480684503077480 - thanks for taking a
look!
Yes I'm using BigtableIO for the sink and I am measuring the end-to-end
latency. It seems to take 3-6 seconds typically, I would like to get it
down to ~1s.
Thanks,
Josh
On Wed, May 24, 2017 at 6:50 PM,
Josh,
Can you share your job_id? I could take look. Are you measuring latency
end-to-end (publisher to when it appears on BT?). Are you using BigtableIO
for sink?
There is no easy way to use more workers when auto-scaling is enabled. It
thinks your backlog and CPU are low enough and does not need
Thanks Ankur, that's super helpful! I will give these optimisations a go.
About the "No operations completed" message - there are a few of these in
the logs (but very few, like 1 an hour or something) - so probably no need
to scale up Bigtable.
I did however see a lot of INFO messages "Wrote 0 rec
There are two main things to see here:
* In the logs, are there any messages like "No operations completed within the
last 61 seconds. There are still 1 simple operations and 1 complex operations
in progress.” This means you are underscaled on the bigtable side and would
benefit from increasin
Ah ok - I am using the Dataflow runner. I didn't realise about the custom
implementation being provided at runtime...
Any ideas of how to tweak my job to either lower the latency consuming from
PubSub or to lower the latency in writing to Bigtable?
On Wed, May 24, 2017 at 4:14 PM, Lukasz Cwik w
What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?
On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan wrote:
> Sorry that was an autocorrect error. I meant to ask - what dataflow runner
> are you using? If you are using google cloud dataflow then the PubsubIO
> class is not
Sorry that was an autocorrect error. I meant to ask - what dataflow runner are
you using? If you are using google cloud dataflow then the PubsubIO class is
not the one doing the reading from the pubsub topic. They provide a custom
implementation at run time.
Ankur Chauhan
Sent from my iPhone
Hi Ankur,
What do you mean by runner address?
Would you be able to link me to the comment you're referring to?
I am using the PubsubIO.Read class from Beam 2.0.0 as found here:
https://github.com/apache/beam/blob/release-2.0.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/i
What runner address you using. Google cloud dataflow uses a closed source
version of the pubsub reader as noted in a comment on Read class.
Ankur Chauhan
Sent from my iPhone
> On May 24, 2017, at 04:05, Josh wrote:
>
> Hi all,
>
> I'm using PubsubIO.Read to consume a Pubsub stream, and my jo
Hi all,
I'm using PubsubIO.Read to consume a Pubsub stream, and my job then writes
the data out to Bigtable. I'm currently seeing a latency of 3-5 seconds
between the messages being published and being written to Bigtable.
I want to try and decrease the latency to <1s if possible - does anyone
ha
15 matches
Mail list logo