Hi,

to get DataFrame level write metrics you can take a look at the following
trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala


and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178

- about the good practise - it depends on your use case but Generally
speaking I would not do it - at least not for checking your logic/ checking
spark is working correctly.

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪
manjunathshe...@live.com‬‏>:‬

> Hi all,
>
> Basically my use case is to validate the DataFrame rows count before and
> after writing to HDFS. Is this even to good practice ? Or Should relay on
> spark for guaranteed writes ?.
>
> If it is a good practice to follow then how to get the DataFrame level
> write metrics ?
>
> Any pointers would be helpful.
>
>
> Thanks and Regards
> Manjunath
>

Reply via email to