Thanks for bringing this question Holden. I have also been thinking about this for a while and I have the impression that Beam needs to expose more ‘system’ metrics to the users, so far we have mostly cared about filling the user-defined metrics space. However once anyone starts using Beam it is normal to need some metrics to monitor the progress of the pipelines in production.
We have discussed in the past about some possible metrics for the IOs without too much progress. See this proposal for example: https://lists.apache.org/thread.html/18dd491f704e7bbcf1b6ce895c82e7c3b35981b0300dbc4142a32105@%3Cdev.beam.apache.org%3E I think this was/is an excellent idea, we had to probably bring it back and add some other metrics provided by the runners and expose all of those in a unified way (with the same Beam API). For some inspiration on possible metrics for the runners maybe we can look at what each system has. I just saw some weeks ago the talk on monitoring for Dataflow from google next and there are some interesting ones there. Monitoring and improving your big data applications (Google Cloud Next '17) https://www.youtube.com/watch?v=hEteVlEHa60 I know that each data processing system (e.g. Spark, Flink, Dataflow, etc) has their own metrics sub-system, and we can argue if this should be a task of Beam that so far has been just a ‘translation’ layer, but if we really want to get more users into Beam we need to offer at least some convenient and unified methods for this kind of tasks. Also with the ‘portability’ effort we will probably need to have some basic set of metrics too to know what is going on inside of the SDK harnesses.
