Just realized I had been replying back to only Takeshi. Thanks for tip as it got me on the right track. Running into an issue with private [spark] methods though. It looks like the input metrics start out as None and are not initialized (verified by throwing new Exception on pattern match cases when it is None and when its not). Looks like NewHadoopRDD calls getInputMetricsForReadMethod which sets _inputMetrics if it is None, but it is unfortunately it is private [spark]. Is there a way for external RDDs to access this method or somehow initialize _inputMetrics in 1.6.X (looks like 2.0 makes more of this API public)?
Using reflection I was able to implement it mimicking the NewHadoopRDD code, but if possible would like to avoid using reflection. Below is the source code for the method that works. RDD code: https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105 Reflection code: https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn On July 3, 2016 at 10:31:30 PM, Takeshi Yamamuro (linguin....@gmail.com) wrote: How about using `SparkListener`? You can collect IO statistics thru TaskMetrics#inputMetrics by yourself. // maropu On Mon, Jul 4, 2016 at 11:46 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: Hi All, I noticed on some Spark jobs it shows you input/output read size. I am implementing a custom RDD which reads files and would like to report these metrics to Spark since they are available to me. I looked through the RDD source code and a couple different implementations and the best I could find were some Hadoop metrics. Is there a way to simply report the number of bytes a partition read so Spark can put it on the UI? Thanks, — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Scientist UC Berkeley AMPLab Alumni pedrorodriguez.io | 909-353-4423 github.com/EntilZha | LinkedIn -- --- Takeshi Yamamuro