Re: Extremely Slow DirectRunner

2021-05-14 Thread Evan Galpin
Any further thoughts here? Or tips on profiling Beam DirectRunner?

Thanks,
Evan

On Wed, May 12, 2021 at 6:22 PM Evan Galpin  wrote:

> Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow
> behaviour regardless of use_deprecated_reads. Not sure if that points to
> something different then.
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 18:16 Steve Niemitz  wrote:
>
>> I think it was only broken in 2.29.
>>
>> On Wed, May 12, 2021 at 5:53 PM Evan Galpin 
>> wrote:
>>
>>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
>>> specifically in 2.29.0 (regression) or broken in all versions up to and
>>> including 2.29.0 (ie never worked)?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 17:12 Steve Niemitz  wrote:
>>>
 Yeah, sorry my email was confusing.  use_deprecated_reads is broken on
 the DirectRunner in 2.29.

 The behavior you describe is exactly the behavior I ran into as well
 when reading from pubsub with the new read method.  I believe that soon the
 default is being reverted back to the old read method, not using SDFs,
 which will fix your performance issue.

 On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang 
 wrote:

> Hi Evan,
>
> It seems like the slow step is not the read that use_deprecated_read
> targets for. Would you like to share your pipeline code if possible?
>
> On Wed, May 12, 2021 at 1:35 PM Evan Galpin 
> wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. Is it possible that use_deprecated_read is
>> broken in 2.29.0 as well?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
>> wrote:
>>
>>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>>
>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>>> wrote:
>>>
 Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the
 "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at
 v2.25.0 (for my use case at least) regardless of use_deprecated_read
 setting.

 Thanks,
 Evan


 On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
 wrote:

> use_deprecated_read was broken in 2.19 on the direct runner and
> didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
> will
> be in 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
> wrote:
>
>> I forgot to also mention that in all tests I was setting
>> --experiments=use_deprecated_read
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin <
>> evan.gal...@gmail.com> wrote:
>>
>>> Hmm, I think I spoke too soon. I'm still seeing an issue of
>>> overall DirectRunner slowness, not just pubsub. I have a pipeline 
>>> like so:
>>>
>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()
>>> |  FileIO.readMatches()  |  Read file contents  |  etc
>>>
>>> I have temporarily set up a transform between each step to log
>>> what's going on and illustrate timing issues.  I ran a series of 
>>> tests
>>> changing only the SDK version each time since I hadn't noticed this
>>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>>> test, I
>>> seeded the pubsub subscription with the exact same contents.
>>>
>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I
>>> couldn't seem to resolve) and onward show a significant slowdown.
>>>
>>> Here is a snippet of logging from v2.25.0:
>>>
>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:16:59 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Note 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow
behaviour regardless of use_deprecated_reads. Not sure if that points to
something different then.

Thanks,
Evan

On Wed, May 12, 2021 at 18:16 Steve Niemitz  wrote:

> I think it was only broken in 2.29.
>
> On Wed, May 12, 2021 at 5:53 PM Evan Galpin  wrote:
>
>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
>> specifically in 2.29.0 (regression) or broken in all versions up to and
>> including 2.29.0 (ie never worked)?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 17:12 Steve Niemitz  wrote:
>>
>>> Yeah, sorry my email was confusing.  use_deprecated_reads is broken on
>>> the DirectRunner in 2.29.
>>>
>>> The behavior you describe is exactly the behavior I ran into as well
>>> when reading from pubsub with the new read method.  I believe that soon the
>>> default is being reverted back to the old read method, not using SDFs,
>>> which will fix your performance issue.
>>>
>>> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang  wrote:
>>>
 Hi Evan,

 It seems like the slow step is not the read that use_deprecated_read
 targets for. Would you like to share your pipeline code if possible?

 On Wed, May 12, 2021 at 1:35 PM Evan Galpin 
 wrote:

> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
> observed slow behavior again. Is it possible that use_deprecated_read is
> broken in 2.29.0 as well?
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
> wrote:
>
>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>
>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>> wrote:
>>
>>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the
>>> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at
>>> v2.25.0 (for my use case at least) regardless of use_deprecated_read
>>> setting.
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
>>> wrote:
>>>
 use_deprecated_read was broken in 2.19 on the direct runner and
 didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
 will
 be in 2.21.

 [1] https://github.com/apache/beam/pull/14469

 On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
 wrote:

> I forgot to also mention that in all tests I was setting
> --experiments=use_deprecated_read
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
> wrote:
>
>> Hmm, I think I spoke too soon. I'm still seeing an issue of
>> overall DirectRunner slowness, not just pubsub. I have a pipeline 
>> like so:
>>
>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()
>> |  FileIO.readMatches()  |  Read file contents  |  etc
>>
>> I have temporarily set up a transform between each step to log
>> what's going on and illustrate timing issues.  I ran a series of 
>> tests
>> changing only the SDK version each time since I hadn't noticed this
>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>> test, I
>> seeded the pubsub subscription with the exact same contents.
>>
>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I
>> couldn't seem to resolve) and onward show a significant slowdown.
>>
>> Here is a snippet of logging from v2.25.0:
>>
>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement
>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:16:59 A.M.
>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>> INFO: Matched 2 files for pattern
>> gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file1.json
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file2.json
>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file1.json
>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file2.json
>>
>> Note that end-to-end, these steps took about *13 minutes*. With
>> SDK 2.23.0 and identical user code, the same section of the pipeline 
>> took *2
>> seconds*:
>>
>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
I think it was only broken in 2.29.

On Wed, May 12, 2021 at 5:53 PM Evan Galpin  wrote:

> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
> specifically in 2.29.0 (regression) or broken in all versions up to and
> including 2.29.0 (ie never worked)?
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 17:12 Steve Niemitz  wrote:
>
>> Yeah, sorry my email was confusing.  use_deprecated_reads is broken on
>> the DirectRunner in 2.29.
>>
>> The behavior you describe is exactly the behavior I ran into as well when
>> reading from pubsub with the new read method.  I believe that soon the
>> default is being reverted back to the old read method, not using SDFs,
>> which will fix your performance issue.
>>
>> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang  wrote:
>>
>>> Hi Evan,
>>>
>>> It seems like the slow step is not the read that use_deprecated_read
>>> targets for. Would you like to share your pipeline code if possible?
>>>
>>> On Wed, May 12, 2021 at 1:35 PM Evan Galpin 
>>> wrote:
>>>
 I just tried with v2.29.0 and use_deprecated_read but unfortunately I
 observed slow behavior again. Is it possible that use_deprecated_read is
 broken in 2.29.0 as well?

 Thanks,
 Evan

 On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
 wrote:

> oops sorry I was off by 10...I meant 2.29 not 2.19.
>
> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
> wrote:
>
>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the
>> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at
>> v2.25.0 (for my use case at least) regardless of use_deprecated_read
>> setting.
>>
>> Thanks,
>> Evan
>>
>>
>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
>> wrote:
>>
>>> use_deprecated_read was broken in 2.19 on the direct runner and
>>> didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
>>> will
>>> be in 2.21.
>>>
>>> [1] https://github.com/apache/beam/pull/14469
>>>
>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
>>> wrote:
>>>
 I forgot to also mention that in all tests I was setting
 --experiments=use_deprecated_read

 Thanks,
 Evan

 On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
 wrote:

> Hmm, I think I spoke too soon. I'm still seeing an issue of
> overall DirectRunner slowness, not just pubsub. I have a pipeline 
> like so:
>
> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()
> |  FileIO.readMatches()  |  Read file contents  |  etc
>
> I have temporarily set up a transform between each step to log
> what's going on and illustrate timing issues.  I ran a series of tests
> changing only the SDK version each time since I hadn't noticed this
> performance issue with 2.19.0 (effectively git-bisect). Before each 
> test, I
> seeded the pubsub subscription with the exact same contents.
>
> SDK version 2.25.0 (I had a build issue with 2.24.0 that I
> couldn't seem to resolve) and onward show a significant slowdown.
>
> Here is a snippet of logging from v2.25.0:
>
> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:16:59 A.M.
> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
> INFO: Matched 2 files for pattern
> gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file1.json
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file2.json
> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file1.json
> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file2.json
>
> Note that end-to-end, these steps took about *13 minutes*. With
> SDK 2.23.0 and identical user code, the same section of the pipeline 
> took *2
> seconds*:
>
> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M.
> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
> INFO: Matched 2 files for pattern
> gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Ah ok thanks for that. Do you mean use_deprecated_reads is broken
specifically in 2.29.0 (regression) or broken in all versions up to and
including 2.29.0 (ie never worked)?

Thanks,
Evan

On Wed, May 12, 2021 at 17:12 Steve Niemitz  wrote:

> Yeah, sorry my email was confusing.  use_deprecated_reads is broken on the
> DirectRunner in 2.29.
>
> The behavior you describe is exactly the behavior I ran into as well when
> reading from pubsub with the new read method.  I believe that soon the
> default is being reverted back to the old read method, not using SDFs,
> which will fix your performance issue.
>
> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang  wrote:
>
>> Hi Evan,
>>
>> It seems like the slow step is not the read that use_deprecated_read
>> targets for. Would you like to share your pipeline code if possible?
>>
>> On Wed, May 12, 2021 at 1:35 PM Evan Galpin 
>> wrote:
>>
>>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>>> observed slow behavior again. Is it possible that use_deprecated_read is
>>> broken in 2.29.0 as well?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
>>> wrote:
>>>
 oops sorry I was off by 10...I meant 2.29 not 2.19.

 On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
 wrote:

> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
> my use case at least) regardless of use_deprecated_read setting.
>
> Thanks,
> Evan
>
>
> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
> wrote:
>
>> use_deprecated_read was broken in 2.19 on the direct runner and
>> didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
>> will
>> be in 2.21.
>>
>> [1] https://github.com/apache/beam/pull/14469
>>
>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
>> wrote:
>>
>>> I forgot to also mention that in all tests I was setting
>>> --experiments=use_deprecated_read
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
>>> wrote:
>>>
 Hmm, I think I spoke too soon. I'm still seeing an issue of overall
 DirectRunner slowness, not just pubsub. I have a pipeline like so:

 Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
 FileIO.readMatches()  |  Read file contents  |  etc

 I have temporarily set up a transform between each step to log
 what's going on and illustrate timing issues.  I ran a series of tests
 changing only the SDK version each time since I hadn't noticed this
 performance issue with 2.19.0 (effectively git-bisect). Before each 
 test, I
 seeded the pubsub subscription with the exact same contents.

 SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
 seem to resolve) and onward show a significant slowdown.

 Here is a snippet of logging from v2.25.0:

 *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
 processElement
 INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:16:59 A.M.
 org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
 INFO: Matched 2 files for pattern
 gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file1.json
 May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file2.json
 May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file1.json
 *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file2.json

 Note that end-to-end, these steps took about *13 minutes*. With
 SDK 2.23.0 and identical user code, the same section of the pipeline 
 took *2
 seconds*:

 *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
 processElement
 INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:03:40 A.M.
 org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
 INFO: Matched 2 files for pattern
 gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file1.json
 May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file2.json
 May 12, 2021 11:03:41 A.M. 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
Yeah, sorry my email was confusing.  use_deprecated_reads is broken on the
DirectRunner in 2.29.

The behavior you describe is exactly the behavior I ran into as well when
reading from pubsub with the new read method.  I believe that soon the
default is being reverted back to the old read method, not using SDFs,
which will fix your performance issue.

On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang  wrote:

> Hi Evan,
>
> It seems like the slow step is not the read that use_deprecated_read
> targets for. Would you like to share your pipeline code if possible?
>
> On Wed, May 12, 2021 at 1:35 PM Evan Galpin  wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. Is it possible that use_deprecated_read is
>> broken in 2.29.0 as well?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
>> wrote:
>>
>>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>>
>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>>> wrote:
>>>
 Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
 behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
 my use case at least) regardless of use_deprecated_read setting.

 Thanks,
 Evan


 On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
 wrote:

> use_deprecated_read was broken in 2.19 on the direct runner and didn't
> do anything. [1]  I don't think the fix is in 2.20 either, but will be in
> 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
> wrote:
>
>> I forgot to also mention that in all tests I was setting
>> --experiments=use_deprecated_read
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
>> wrote:
>>
>>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>>
>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>>> FileIO.readMatches()  |  Read file contents  |  etc
>>>
>>> I have temporarily set up a transform between each step to log
>>> what's going on and illustrate timing issues.  I ran a series of tests
>>> changing only the SDK version each time since I hadn't noticed this
>>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>>> test, I
>>> seeded the pubsub subscription with the exact same contents.
>>>
>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
>>> seem to resolve) and onward show a significant slowdown.
>>>
>>> Here is a snippet of logging from v2.25.0:
>>>
>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:16:59 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>>> 2.23.0 and identical user code, the same section of the pipeline took *2
>>> seconds*:
>>>
>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:03:40 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Any thoughts on what could be causing this?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
I'd be happy to share what I can. The applicable portion is this `expand`
method of a PTransform (which does nothing more complex than group these
other transforms together for re-use). The input to this PTransform is
pubsub message bodies as strings. I'll paste it as plain-text.

@Override
public PCollection> expand(PCollection
input) {
return input
.apply(
"Convert OCNs to wildcard paths",
ParDo.of(new BlobPathToRootPath()))

.apply(FileIO.matchAll().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW))

.apply(FileIO.readMatches().withCompression(Compression.UNCOMPRESSED))
.apply(
MapElements
.into(kvs(strings(), strings()))
.via(
(ReadableFile f) -> {
try {
String[] blobUri =
f.getMetadata().resourceId().toString().split("/");
String doc_id = blobUri[4];
return KV.of(doc_id,
f.readFullyAsUTF8String());
} catch (IOException ex) {
throw new RuntimeException(
"Failed to read the file", ex);
}
}));
}

Thanks,
Evan

On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang  wrote:

> Hi Evan,
>
> It seems like the slow step is not the read that use_deprecated_read
> targets for. Would you like to share your pipeline code if possible?
>
> On Wed, May 12, 2021 at 1:35 PM Evan Galpin  wrote:
>
>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>> observed slow behavior again. Is it possible that use_deprecated_read is
>> broken in 2.29.0 as well?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz 
>> wrote:
>>
>>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>>
>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>>> wrote:
>>>
 Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
 behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
 my use case at least) regardless of use_deprecated_read setting.

 Thanks,
 Evan


 On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
 wrote:

> use_deprecated_read was broken in 2.19 on the direct runner and didn't
> do anything. [1]  I don't think the fix is in 2.20 either, but will be in
> 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
> wrote:
>
>> I forgot to also mention that in all tests I was setting
>> --experiments=use_deprecated_read
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
>> wrote:
>>
>>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>>
>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>>> FileIO.readMatches()  |  Read file contents  |  etc
>>>
>>> I have temporarily set up a transform between each step to log
>>> what's going on and illustrate timing issues.  I ran a series of tests
>>> changing only the SDK version each time since I hadn't noticed this
>>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>>> test, I
>>> seeded the pubsub subscription with the exact same contents.
>>>
>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
>>> seem to resolve) and onward show a significant slowdown.
>>>
>>> Here is a snippet of logging from v2.25.0:
>>>
>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:16:59 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>>> 2.23.0 and identical user code, the same section of the pipeline took *2
>>> seconds*:
>>>
>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Boyuan Zhang
Hi Evan,

It seems like the slow step is not the read that use_deprecated_read
targets for. Would you like to share your pipeline code if possible?

On Wed, May 12, 2021 at 1:35 PM Evan Galpin  wrote:

> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
> observed slow behavior again. Is it possible that use_deprecated_read is
> broken in 2.29.0 as well?
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz  wrote:
>
>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>
>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin 
>> wrote:
>>
>>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
>>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
>>> my use case at least) regardless of use_deprecated_read setting.
>>>
>>> Thanks,
>>> Evan
>>>
>>>
>>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
>>> wrote:
>>>
 use_deprecated_read was broken in 2.19 on the direct runner and didn't
 do anything. [1]  I don't think the fix is in 2.20 either, but will be in
 2.21.

 [1] https://github.com/apache/beam/pull/14469

 On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
 wrote:

> I forgot to also mention that in all tests I was setting
> --experiments=use_deprecated_read
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
> wrote:
>
>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>
>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>> FileIO.readMatches()  |  Read file contents  |  etc
>>
>> I have temporarily set up a transform between each step to log what's
>> going on and illustrate timing issues.  I ran a series of tests changing
>> only the SDK version each time since I hadn't noticed this performance
>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded 
>> the
>> pubsub subscription with the exact same contents.
>>
>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
>> seem to resolve) and onward show a significant slowdown.
>>
>> Here is a snippet of logging from v2.25.0:
>>
>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement
>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:16:59 A.M.
>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>> INFO: Matched 2 files for pattern
>> gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file1.json
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file2.json
>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file1.json
>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file2.json
>>
>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>> 2.23.0 and identical user code, the same section of the pipeline took *2
>> seconds*:
>>
>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement
>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:03:40 A.M.
>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>> INFO: Matched 2 files for pattern
>> gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file1.json
>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file2.json
>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file1.json
>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file2.json
>>
>> Any thoughts on what could be causing this?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin 
>> wrote:
>>
>>>
>>>
>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang 
>>> wrote:
>>>
 Hi Evan,

 What do you mean startup delay? Is it the time that from you start
 the pipeline to the time that you notice the first output record from
 PubSub?

>>>
>>> Yes that's what I meant, the seemingly idle system waiting for
>>> pubsub output despite data being in the subscription at pipeline start
>>> time.
>>>
>>> On 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
I just tried with v2.29.0 and use_deprecated_read but unfortunately I
observed slow behavior again. Is it possible that use_deprecated_read is
broken in 2.29.0 as well?

Thanks,
Evan

On Wed, May 12, 2021 at 3:21 PM Steve Niemitz  wrote:

> oops sorry I was off by 10...I meant 2.29 not 2.19.
>
> On Wed, May 12, 2021 at 2:55 PM Evan Galpin  wrote:
>
>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
>> my use case at least) regardless of use_deprecated_read setting.
>>
>> Thanks,
>> Evan
>>
>>
>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz 
>> wrote:
>>
>>> use_deprecated_read was broken in 2.19 on the direct runner and didn't
>>> do anything. [1]  I don't think the fix is in 2.20 either, but will be in
>>> 2.21.
>>>
>>> [1] https://github.com/apache/beam/pull/14469
>>>
>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
>>> wrote:
>>>
 I forgot to also mention that in all tests I was setting
 --experiments=use_deprecated_read

 Thanks,
 Evan

 On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
 wrote:

> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>
> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
> FileIO.readMatches()  |  Read file contents  |  etc
>
> I have temporarily set up a transform between each step to log what's
> going on and illustrate timing issues.  I ran a series of tests changing
> only the SDK version each time since I hadn't noticed this performance
> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
> pubsub subscription with the exact same contents.
>
> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
> seem to resolve) and onward show a significant slowdown.
>
> Here is a snippet of logging from v2.25.0:
>
> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:16:59 A.M.
> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
> INFO: Matched 2 files for pattern
> gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file1.json
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file2.json
> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file1.json
> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file2.json
>
> Note that end-to-end, these steps took about *13 minutes*. With SDK
> 2.23.0 and identical user code, the same section of the pipeline took *2
> seconds*:
>
> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M.
> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
> INFO: Matched 2 files for pattern
> gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file1.json
> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file2.json
> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file1.json
> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file2.json
>
> Any thoughts on what could be causing this?
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 9:53 AM Evan Galpin 
> wrote:
>
>>
>>
>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang 
>> wrote:
>>
>>> Hi Evan,
>>>
>>> What do you mean startup delay? Is it the time that from you start
>>> the pipeline to the time that you notice the first output record from
>>> PubSub?
>>>
>>
>> Yes that's what I meant, the seemingly idle system waiting for pubsub
>> output despite data being in the subscription at pipeline start time.
>>
>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía 
>>> wrote:
>>>
 Can you try running direct runner with the option
 `--experiments=use_deprecated_read`

>>>
>> This seems to work for me, thanks for this! 
>>
>>
 Seems like an instance of
 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
oops sorry I was off by 10...I meant 2.29 not 2.19.

On Wed, May 12, 2021 at 2:55 PM Evan Galpin  wrote:

> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
> my use case at least) regardless of use_deprecated_read setting.
>
> Thanks,
> Evan
>
>
> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz  wrote:
>
>> use_deprecated_read was broken in 2.19 on the direct runner and didn't do
>> anything. [1]  I don't think the fix is in 2.20 either, but will be in 2.21.
>>
>> [1] https://github.com/apache/beam/pull/14469
>>
>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin 
>> wrote:
>>
>>> I forgot to also mention that in all tests I was setting
>>> --experiments=use_deprecated_read
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
>>> wrote:
>>>
 Hmm, I think I spoke too soon. I'm still seeing an issue of overall
 DirectRunner slowness, not just pubsub. I have a pipeline like so:

 Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
 FileIO.readMatches()  |  Read file contents  |  etc

 I have temporarily set up a transform between each step to log what's
 going on and illustrate timing issues.  I ran a series of tests changing
 only the SDK version each time since I hadn't noticed this performance
 issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
 pubsub subscription with the exact same contents.

 SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
 seem to resolve) and onward show a significant slowdown.

 Here is a snippet of logging from v2.25.0:

 *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
 processElement
 INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:16:59 A.M.
 org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
 INFO: Matched 2 files for pattern
 gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file1.json
 May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file2.json
 May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file1.json
 *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file2.json

 Note that end-to-end, these steps took about *13 minutes*. With SDK
 2.23.0 and identical user code, the same section of the pipeline took *2
 seconds*:

 *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
 processElement
 INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:03:40 A.M.
 org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
 INFO: Matched 2 files for pattern
 gs://my-bucket/my-dir/5004728247517184/**
 May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file1.json
 May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
 processElement
 INFO: Got ReadableFile: my-file2.json
 May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file1.json
 *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
 processElement
 INFO: Got file contents for document_id my-file2.json

 Any thoughts on what could be causing this?

 Thanks,
 Evan

 On Wed, May 12, 2021 at 9:53 AM Evan Galpin 
 wrote:

>
>
> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang 
> wrote:
>
>> Hi Evan,
>>
>> What do you mean startup delay? Is it the time that from you start
>> the pipeline to the time that you notice the first output record from
>> PubSub?
>>
>
> Yes that's what I meant, the seemingly idle system waiting for pubsub
> output despite data being in the subscription at pipeline start time.
>
> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía 
>> wrote:
>>
>>> Can you try running direct runner with the option
>>> `--experiments=use_deprecated_read`
>>>
>>
> This seems to work for me, thanks for this! 
>
>
>>> Seems like an instance of
>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>>> also reported in
>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>>
>>> We should rollback using the SDF wrapper by default because of the
>>> usability and performance 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
my use case at least) regardless of use_deprecated_read setting.

Thanks,
Evan


On Wed, May 12, 2021 at 2:47 PM Steve Niemitz  wrote:

> use_deprecated_read was broken in 2.19 on the direct runner and didn't do
> anything. [1]  I don't think the fix is in 2.20 either, but will be in 2.21.
>
> [1] https://github.com/apache/beam/pull/14469
>
> On Wed, May 12, 2021 at 1:41 PM Evan Galpin  wrote:
>
>> I forgot to also mention that in all tests I was setting
>> --experiments=use_deprecated_read
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin 
>> wrote:
>>
>>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>>
>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>>> FileIO.readMatches()  |  Read file contents  |  etc
>>>
>>> I have temporarily set up a transform between each step to log what's
>>> going on and illustrate timing issues.  I ran a series of tests changing
>>> only the SDK version each time since I hadn't noticed this performance
>>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
>>> pubsub subscription with the exact same contents.
>>>
>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem
>>> to resolve) and onward show a significant slowdown.
>>>
>>> Here is a snippet of logging from v2.25.0:
>>>
>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:16:59 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>>> 2.23.0 and identical user code, the same section of the pipeline took *2
>>> seconds*:
>>>
>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>> processElement
>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:03:40 A.M.
>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>> INFO: Matched 2 files for pattern
>>> gs://my-bucket/my-dir/5004728247517184/**
>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file1.json
>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>> processElement
>>> INFO: Got ReadableFile: my-file2.json
>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file1.json
>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>>> processElement
>>> INFO: Got file contents for document_id my-file2.json
>>>
>>> Any thoughts on what could be causing this?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin 
>>> wrote:
>>>


 On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang 
 wrote:

> Hi Evan,
>
> What do you mean startup delay? Is it the time that from you start the
> pipeline to the time that you notice the first output record from PubSub?
>

 Yes that's what I meant, the seemingly idle system waiting for pubsub
 output despite data being in the subscription at pipeline start time.

 On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:
>
>> Can you try running direct runner with the option
>> `--experiments=use_deprecated_read`
>>
>
 This seems to work for me, thanks for this! 


>> Seems like an instance of
>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>> also reported in
>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>
>> We should rollback using the SDF wrapper by default because of the
>> usability and performance issues reported.
>>
>>
>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin 
>> wrote:
>>
>>> Hi all,
>>>
>>> I’m experiencing very slow performance and startup delay when
>>> testing a pipeline locally. I’m reading data 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Steve Niemitz
use_deprecated_read was broken in 2.19 on the direct runner and didn't do
anything. [1]  I don't think the fix is in 2.20 either, but will be in 2.21.

[1] https://github.com/apache/beam/pull/14469

On Wed, May 12, 2021 at 1:41 PM Evan Galpin  wrote:

> I forgot to also mention that in all tests I was setting
> --experiments=use_deprecated_read
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 1:39 PM Evan Galpin  wrote:
>
>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>
>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>> FileIO.readMatches()  |  Read file contents  |  etc
>>
>> I have temporarily set up a transform between each step to log what's
>> going on and illustrate timing issues.  I ran a series of tests changing
>> only the SDK version each time since I hadn't noticed this performance
>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
>> pubsub subscription with the exact same contents.
>>
>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem
>> to resolve) and onward show a significant slowdown.
>>
>> Here is a snippet of logging from v2.25.0:
>>
>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement
>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
>> process
>> INFO: Matched 2 files for pattern
>> gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file1.json
>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file2.json
>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file1.json
>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file2.json
>>
>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>> 2.23.0 and identical user code, the same section of the pipeline took *2
>> seconds*:
>>
>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>> processElement
>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
>> process
>> INFO: Matched 2 files for pattern
>> gs://my-bucket/my-dir/5004728247517184/**
>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file1.json
>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>> processElement
>> INFO: Got ReadableFile: my-file2.json
>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file1.json
>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>> processElement
>> INFO: Got file contents for document_id my-file2.json
>>
>> Any thoughts on what could be causing this?
>>
>> Thanks,
>> Evan
>>
>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin 
>> wrote:
>>
>>>
>>>
>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang  wrote:
>>>
 Hi Evan,

 What do you mean startup delay? Is it the time that from you start the
 pipeline to the time that you notice the first output record from PubSub?

>>>
>>> Yes that's what I meant, the seemingly idle system waiting for pubsub
>>> output despite data being in the subscription at pipeline start time.
>>>
>>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:

> Can you try running direct runner with the option
> `--experiments=use_deprecated_read`
>

>>> This seems to work for me, thanks for this! 
>>>
>>>
> Seems like an instance of
> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
> also reported in
> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>
> We should rollback using the SDF wrapper by default because of the
> usability and performance issues reported.
>
>
> On Sat, May 8, 2021 at 12:57 AM Evan Galpin 
> wrote:
>
>> Hi all,
>>
>> I’m experiencing very slow performance and startup delay when testing
>> a pipeline locally. I’m reading data from a Google PubSub subscription as
>> the data source, and before each pipeline execution I ensure that data is
>> present in the subscription (readable from GCP console).
>>
>> I’m seeing startup delay on the order of minutes with DirectRunner
>> (5-10 min). Is that expected? I did find a Jira ticket[1] that at first
>> seemed related, but I think it has more to do with BQ than 

Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
I forgot to also mention that in all tests I was setting
--experiments=use_deprecated_read

Thanks,
Evan

On Wed, May 12, 2021 at 1:39 PM Evan Galpin  wrote:

> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>
> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
> FileIO.readMatches()  |  Read file contents  |  etc
>
> I have temporarily set up a transform between each step to log what's
> going on and illustrate timing issues.  I ran a series of tests changing
> only the SDK version each time since I hadn't noticed this performance
> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
> pubsub subscription with the exact same contents.
>
> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem
> to resolve) and onward show a significant slowdown.
>
> Here is a snippet of logging from v2.25.0:
>
> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
> process
> INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file1.json
> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file2.json
> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file1.json
> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file2.json
>
> Note that end-to-end, these steps took about *13 minutes*. With SDK
> 2.23.0 and identical user code, the same section of the pipeline took *2
> seconds*:
>
> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
> processElement
> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
> process
> INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file1.json
> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
> processElement
> INFO: Got ReadableFile: my-file2.json
> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file1.json
> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
> processElement
> INFO: Got file contents for document_id my-file2.json
>
> Any thoughts on what could be causing this?
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 9:53 AM Evan Galpin  wrote:
>
>>
>>
>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang  wrote:
>>
>>> Hi Evan,
>>>
>>> What do you mean startup delay? Is it the time that from you start the
>>> pipeline to the time that you notice the first output record from PubSub?
>>>
>>
>> Yes that's what I meant, the seemingly idle system waiting for pubsub
>> output despite data being in the subscription at pipeline start time.
>>
>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:
>>>
 Can you try running direct runner with the option
 `--experiments=use_deprecated_read`

>>>
>> This seems to work for me, thanks for this! 
>>
>>
 Seems like an instance of
 https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
 also reported in
 https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E

 We should rollback using the SDF wrapper by default because of the
 usability and performance issues reported.


 On Sat, May 8, 2021 at 12:57 AM Evan Galpin 
 wrote:

> Hi all,
>
> I’m experiencing very slow performance and startup delay when testing
> a pipeline locally. I’m reading data from a Google PubSub subscription as
> the data source, and before each pipeline execution I ensure that data is
> present in the subscription (readable from GCP console).
>
> I’m seeing startup delay on the order of minutes with DirectRunner
> (5-10 min). Is that expected? I did find a Jira ticket[1] that at first
> seemed related, but I think it has more to do with BQ than DirectRunner.
>
> I’ve run the pipeline with a debugger connected and confirmed that
> it’s minutes before the first DoFn in my pipeline receives any data. Is
> there a way I can profile the direct runner to see what it’s churning on?
>
> Thanks,
> Evan
>
> [1]
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>



Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
Hmm, I think I spoke too soon. I'm still seeing an issue of overall
DirectRunner slowness, not just pubsub. I have a pipeline like so:

Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
FileIO.readMatches()  |  Read file contents  |  etc

I have temporarily set up a transform between each step to log what's going
on and illustrate timing issues.  I ran a series of tests changing only the
SDK version each time since I hadn't noticed this performance issue with
2.19.0 (effectively git-bisect). Before each test, I seeded the pubsub
subscription with the exact same contents.

SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem to
resolve) and onward show a significant slowdown.

Here is a snippet of logging from v2.25.0:

*May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
processElement
INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
process
INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file1.json
May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file2.json
May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement
INFO: Got file contents for document_id my-file1.json
*May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
processElement
INFO: Got file contents for document_id my-file2.json

Note that end-to-end, these steps took about *13 minutes*. With SDK 2.23.0
and identical user code, the same section of the pipeline took *2 seconds*:

*May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
processElement
INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
process
INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file1.json
May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file2.json
May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement
INFO: Got file contents for document_id my-file1.json
*May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
processElement
INFO: Got file contents for document_id my-file2.json

Any thoughts on what could be causing this?

Thanks,
Evan

On Wed, May 12, 2021 at 9:53 AM Evan Galpin  wrote:

>
>
> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang  wrote:
>
>> Hi Evan,
>>
>> What do you mean startup delay? Is it the time that from you start the
>> pipeline to the time that you notice the first output record from PubSub?
>>
>
> Yes that's what I meant, the seemingly idle system waiting for pubsub
> output despite data being in the subscription at pipeline start time.
>
> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:
>>
>>> Can you try running direct runner with the option
>>> `--experiments=use_deprecated_read`
>>>
>>
> This seems to work for me, thanks for this! 
>
>
>>> Seems like an instance of
>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>>> also reported in
>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>>
>>> We should rollback using the SDF wrapper by default because of the
>>> usability and performance issues reported.
>>>
>>>
>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin 
>>> wrote:
>>>
 Hi all,

 I’m experiencing very slow performance and startup delay when testing a
 pipeline locally. I’m reading data from a Google PubSub subscription as the
 data source, and before each pipeline execution I ensure that data is
 present in the subscription (readable from GCP console).

 I’m seeing startup delay on the order of minutes with DirectRunner
 (5-10 min). Is that expected? I did find a Jira ticket[1] that at first
 seemed related, but I think it has more to do with BQ than DirectRunner.

 I’ve run the pipeline with a debugger connected and confirmed that it’s
 minutes before the first DoFn in my pipeline receives any data. Is there a
 way I can profile the direct runner to see what it’s churning on?

 Thanks,
 Evan

 [1]
 https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548

>>>


Re: Extremely Slow DirectRunner

2021-05-12 Thread Evan Galpin
On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang  wrote:

> Hi Evan,
>
> What do you mean startup delay? Is it the time that from you start the
> pipeline to the time that you notice the first output record from PubSub?
>

Yes that's what I meant, the seemingly idle system waiting for pubsub
output despite data being in the subscription at pipeline start time.

On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:
>
>> Can you try running direct runner with the option
>> `--experiments=use_deprecated_read`
>>
>
This seems to work for me, thanks for this! 


>> Seems like an instance of
>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>> also reported in
>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>
>> We should rollback using the SDF wrapper by default because of the
>> usability and performance issues reported.
>>
>>
>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin 
>> wrote:
>>
>>> Hi all,
>>>
>>> I’m experiencing very slow performance and startup delay when testing a
>>> pipeline locally. I’m reading data from a Google PubSub subscription as the
>>> data source, and before each pipeline execution I ensure that data is
>>> present in the subscription (readable from GCP console).
>>>
>>> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
>>> min). Is that expected? I did find a Jira ticket[1] that at first seemed
>>> related, but I think it has more to do with BQ than DirectRunner.
>>>
>>> I’ve run the pipeline with a debugger connected and confirmed that it’s
>>> minutes before the first DoFn in my pipeline receives any data. Is there a
>>> way I can profile the direct runner to see what it’s churning on?
>>>
>>> Thanks,
>>> Evan
>>>
>>> [1]
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>>
>>


Re: Extremely Slow DirectRunner

2021-05-10 Thread Boyuan Zhang
Hi Evan,

What do you mean startup delay? Is it the time that from you start the
pipeline to the time that you notice the first output record from PubSub?

On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía  wrote:

> Can you try running direct runner with the option
> `--experiments=use_deprecated_read`
>
> Seems like an instance of
> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
> also reported in
> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>
> We should rollback using the SDF wrapper by default because of the
> usability and performance issues reported.
>
>
> On Sat, May 8, 2021 at 12:57 AM Evan Galpin  wrote:
>
>> Hi all,
>>
>> I’m experiencing very slow performance and startup delay when testing a
>> pipeline locally. I’m reading data from a Google PubSub subscription as the
>> data source, and before each pipeline execution I ensure that data is
>> present in the subscription (readable from GCP console).
>>
>> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
>> min). Is that expected? I did find a Jira ticket[1] that at first seemed
>> related, but I think it has more to do with BQ than DirectRunner.
>>
>> I’ve run the pipeline with a debugger connected and confirmed that it’s
>> minutes before the first DoFn in my pipeline receives any data. Is there a
>> way I can profile the direct runner to see what it’s churning on?
>>
>> Thanks,
>> Evan
>>
>> [1]
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>
>


Re: Extremely Slow DirectRunner

2021-05-08 Thread Ismaël Mejía
Can you try running direct runner with the option
`--experiments=use_deprecated_read`

Seems like an instance of
https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
also reported in
https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E

We should rollback using the SDF wrapper by default because of the
usability and performance issues reported.


On Sat, May 8, 2021 at 12:57 AM Evan Galpin  wrote:

> Hi all,
>
> I’m experiencing very slow performance and startup delay when testing a
> pipeline locally. I’m reading data from a Google PubSub subscription as the
> data source, and before each pipeline execution I ensure that data is
> present in the subscription (readable from GCP console).
>
> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
> min). Is that expected? I did find a Jira ticket[1] that at first seemed
> related, but I think it has more to do with BQ than DirectRunner.
>
> I’ve run the pipeline with a debugger connected and confirmed that it’s
> minutes before the first DoFn in my pipeline receives any data. Is there a
> way I can profile the direct runner to see what it’s churning on?
>
> Thanks,
> Evan
>
> [1]
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>