Hi Attila,
I will check why INVALID is getting appended in mailing address. What is your use case here? Client Driver Application not using collect but internally calling python script which is reading part files records [comma separated string] of each cluster separately and copying records in other final csv file, so merging all part files data in single csv file. This script runs on every node and later they all combine to single file. *On the other hand is your data really just a collection of strings without any repetitions* [Ranju]: Yes It is comma separated string. And I just checked the 2nd argument of saveAsTextFile and I believe read and write will be faster on disk after use of compression. I will try this. So I think there is no special requirement on type of disk for execution of saveAsTextFile as they are local I/O operations. Regards Ranju ------------ Hi! I would like to reflect only to the first part of your mail: I have a large RDD dataset of around 60-70 GB which I cannot send to driver using *collect* so first writing that to disk using *saveAsTextFile* and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage. What is your use case here? As you mention *collect()* I can assume you have to process the data outside of Spark maybe with a 3rd party tool, isn't it? If you have 60-70 GB of data and you write it to text file then read it back within the same application then you still cannot call *collect()* on it as it is still 60-70GB data, right? On the other hand is your data really just a collection of strings without any repetitions? I ask this because of the fileformat you are using: text file. Even for text file at least you can pass a compression codec as the 2nd argument of *saveAsTextFile()* <https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#saveAsTextFile(path:String,codec:Class[_%3C:org.apache.hadoop.io.compress.CompressionCodec]):Unit> (when you use this link you might need to scroll up a little bit.. at least my chrome displays the the *saveAsTextFile* method without the 2nd arg codec). As IO is slow a compressed data could be read back quicker: as there will be less data in the disk. Check the Snappy <https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example. But if there is a structure of your data and you have plan to process this data further within Spark then please consider something way better: a columnar storage format namely ORC or Parquet. Best Regards, Attila *From:* Ranju Jain <ranju.j...@ericsson.com.INVALID> *Sent:* Sunday, March 21, 2021 8:10 AM *To:* user@spark.apache.org *Subject:* Spark saveAsTextFile Disk Recommendation Hi All, I have a large RDD dataset of around 60-70 GB which I cannot send to driver using *collect* so first writing that to disk using *saveAsTextFile* and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage. I have a question like *spark.local.dir* is the directory which is used as a scratch space where mapoutputs files and RDDs might need to write by spark for shuffle operations etc. And there it is strongly recommended to use *local and fast disk *to avoid any failure or performance impact. *Do we have any such recommendation for storing multiple part files of large dataset [ or Big RDD ] in fast disk?* This will help me to configure the write type of disk for resulting part files. Regards Ranju