Beam tracks the amount of time spent in each transform in profile counters.
There is ongoing work to expose these in a uniform way for all runners
(e.g. in Dataflow they're displayed on the UI), but for the direct runner
you can see an example at
https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/portability/fn_api_runner_test.py#L1046
.
For a raw dump you could do something like:
p = beam.Pipeline(...)
p | beam.Read...
results = p.run()
results.wait_until_finish()
import pprint
pprint.pprint(results._metrics_by_stage)
On Wed, Jul 24, 2019 at 4:07 PM Yu Watanabe <[email protected]> wrote:
> Hello .
>
> I have a pipeline built on apache beam 2.13.0 using python 3.7.3.
> My pipeline lasts about 5 hours to ingest 2 sets of approximately 70000
> Json objects using Direct Runner.
>
> I want to diagnose which transforms are taking time and improve code for
> better performance. I saw below module for profiling but it seems it does
> not report about speed of each transform.
>
>
> https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.utils.profiler.html
>
> Is there any module that you could use to monitor speed of each transform
> ? If not, I appreciate if I could get some help for how to monitor speed
> for each transform.
>
> Best Regards,
> Yu Watanabe
>
> --
> Yu Watanabe
> Weekend Freelancer who loves to challenge building data platform
> [email protected]
> [image: LinkedIn icon] <https://www.linkedin.com/in/yuwatanabe1> [image:
> Twitter icon] <https://twitter.com/yuwtennis>
>