Re: Custom RDD: Report Size of Partition in Bytes to Spark

Pedro Rodriguez Mon, 04 Jul 2016 07:34:00 -0700

Just realized I had been replying back to only Takeshi.

Thanks for tip as it got me on the right track. Running into an issue with 
private [spark] methods though. It looks like the input metrics start out as 
None and are not initialized (verified by throwing new Exception on pattern 
match cases when it is None and when its not). Looks like NewHadoopRDD calls 
getInputMetricsForReadMethod which sets _inputMetrics if it is None, but it is 
unfortunately it is private [spark]. Is there a way for external RDDs to access 
this method or somehow initialize _inputMetrics in 1.6.X (looks like 2.0 makes 
more of this API public)?


Using reflection I was able to implement it mimicking the NewHadoopRDD code, 
but if possible would like to avoid using reflection. Below is the source code 
for the method that works.

RDD code: 
https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105
Reflection code: 
https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 3, 2016 at 10:31:30 PM, Takeshi Yamamuro (linguin....@gmail.com) wrote:

How about using `SparkListener`?
You can collect IO statistics thru TaskMetrics#inputMetrics by yourself.

// maropu

On Mon, Jul 4, 2016 at 11:46 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> 
wrote:
Hi All,

I noticed on some Spark jobs it shows you input/output read size. I am 
implementing a custom RDD which reads files and would like to report these 
metrics to Spark since they are available to me.

I looked through the RDD source code and a couple different implementations and 
the best I could find were some Hadoop metrics. Is there a way to simply report 
the number of bytes a partition read so Spark can put it on the UI?

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn



--
---
Takeshi Yamamuro

Re: Custom RDD: Report Size of Partition in Bytes to Spark

Reply via email to