Hi Attila,

What is your use case here?
Client Driver Application not using collect but  internally calling python 
script which is reading part files records [comma separated string] of each 
cluster separately and copying records in other final csv file, so merging all 
part files data in single csv file. This script runs on every node and later 
they all combine to single file.

On the other hand is your data really just a collection of strings without any 
repetitions
[Ranju]:
Yes It is comma separated string.
And I just checked the 2nd argument of saveAsTextFile and I believe read and 
write will be faster on disk after use of compression. I will try this.

So I think there is no special requirement on type of disk for execution of 
saveAsTextFile as they are local I/O operations.

Regards
Ranju

------------
Hi!

I would like to reflect only to the first part of your mail:


I have a large RDD dataset of around 60-70 GB which I cannot send to driver 
using collect so first writing that to disk using  saveAsTextFile and then this 
data gets saved in the form of multiple part files on each node of the cluster 
and after that driver reads the data from that storage.

What is your use case here?

As you mention collect() I can assume you have to process the data outside of 
Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it back 
within the same application then you still cannot call collect() on it as it is 
still 60-70GB data, right?

On the other hand is your data really just a collection of strings without any 
repetitions? I ask this because of the fileformat you are using: text file. 
Even for text file at least you can pass a compression codec as the 2nd 
argument of 
saveAsTextFile()<https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#saveAsTextFile(path:String,codec:Class[_%3C:org.apache.hadoop.io.compress.CompressionCodec]):Unit>
 (when you use this link you might need to scroll up a little bit.. at least my 
chrome displays the the saveAsTextFile method without the 2nd arg codec). As IO 
is slow a compressed data could be read back quicker: as there will be less 
data in the disk. Check the 
Snappy<https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example.

But if there is a structure of your data and you have plan to process this data 
further within Spark then please consider something way better: a columnar 
storage format namely ORC or Parquet.

Best Regards,
Attila


From: Ranju Jain <ranju.j...@ericsson.com.INVALID>
Sent: Sunday, March 21, 2021 8:10 AM
To: user@spark.apache.org
Subject: Spark saveAsTextFile Disk Recommendation

Hi All,

I have a large RDD dataset of around 60-70 GB which I cannot send to driver 
using collect so first writing that to disk using  saveAsTextFile and then this 
data gets saved in the form of multiple part files on each node of the cluster 
and after that driver reads the data from that storage.

I have a question like spark.local.dir is the directory which is used as a 
scratch space where mapoutputs files and RDDs might need to write by spark for 
shuffle operations etc.
And there it is strongly recommended to use local and fast disk to avoid any 
failure or performance impact.

Do we have any such recommendation for storing multiple part files of large 
dataset [ or Big RDD ] in fast disk?
This will help me to configure the write type of disk for resulting part files.

Regards
Ranju

Reply via email to