What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?

On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan <[email protected]> wrote:

> Sorry that was an autocorrect error. I meant to ask - what dataflow runner
> are you using? If you are using google cloud dataflow then the PubsubIO
> class is not the one doing the reading from the pubsub topic. They provide
> a custom implementation at run time.
>
> Ankur Chauhan
> Sent from my iPhone
>
> On May 24, 2017, at 07:52, Josh <[email protected]> wrote:
>
> Hi Ankur,
>
> What do you mean by runner address?
> Would you be able to link me to the comment you're referring to?
>
> I am using the PubsubIO.Read class from Beam 2.0.0 as found here:
> https://github.com/apache/beam/blob/release-2.0.0/sdks/
> java/io/google-cloud-platform/src/main/java/org/apache/beam/
> sdk/io/gcp/pubsub/PubsubIO.java
>
> Thanks,
> Josh
>
> On Wed, May 24, 2017 at 3:36 PM, Ankur Chauhan <[email protected]> wrote:
>
>> What runner address you using. Google cloud dataflow uses a closed source
>> version of the pubsub reader as noted in a comment on Read class.
>>
>> Ankur Chauhan
>> Sent from my iPhone
>>
>> On May 24, 2017, at 04:05, Josh <[email protected]> wrote:
>>
>> Hi all,
>>
>> I'm using PubsubIO.Read to consume a Pubsub stream, and my job then
>> writes the data out to Bigtable. I'm currently seeing a latency of 3-5
>> seconds between the messages being published and being written to Bigtable.
>>
>> I want to try and decrease the latency to <1s if possible - does anyone
>> have any tips for doing this?
>>
>> I noticed that there is a PubsubGrpcClient https://github.com/apache/beam
>> /blob/release-2.0.0/sdks/java/io/google-cloud-platform/src/m
>> ain/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubGrpcClient.java however
>> the PubsubUnboundedSource is initialised with a PubsubJsonClient, so the
>> Grpc client doesn't appear to be being used. Is there a way to switch to
>> the Grpc client - as perhaps that would give better performance?
>>
>> Also, I am running my job on Dataflow using autoscaling, which has only
>> allocated one n1-standard-4 instance to the job, which is running at
>> ~50% CPU. Could forcing a higher number of nodes help improve latency?
>>
>> Thanks for any advice,
>> Josh
>>
>>
>

Reply via email to