Hi all, I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to s3.
Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. I disabled `yarn.log-aggregation-enable` by setting it to False. I am not writing any data to hdfs(/mnt/hdfs) however is that spark is creating blocks and writing data into it. We are going all the operations in memory. Any specific operation writing data to datanode(HDFS)? Here is the hdfs dirs created. ``` *15.4G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized/subdir1 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current/finalized 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812/current 129G /mnt/hdfs/current/BP-6706123673-10.xx.xx.xxx-1588026945812 129G /mnt/hdfs/current 129G /mnt/hdfs* ``` <https://stackoverflow.com/collectives/aws>