Hi Attila,


I will check why INVALID is getting appended in mailing address.



What is your use case here?

Client Driver Application not using collect but  internally calling python
script which is reading part files records [comma separated string] of each
cluster separately and copying records in other final csv file, so merging
all part files data in single csv file. This script runs on every node and
later they all combine to single file.



*On the other hand is your data really just a collection of strings without
any repetitions*

[Ranju]:

Yes It is comma separated string.

And I just checked the 2nd argument of saveAsTextFile and I believe read
and write will be faster on disk after use of compression. I will try this.



So I think there is no special requirement on type of disk for execution of
saveAsTextFile as they are local I/O operations.



Regards

Ranju



------------

Hi!

I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver
using *collect* so first writing that to disk using  *saveAsTextFile* and
then this data gets saved in the form of multiple part files on each node
of the cluster and after that driver reads the data from that storage.


What is your use case here?

As you mention *collect()* I can assume you have to process the data
outside of Spark maybe with a 3rd party tool, isn't it?

If you have 60-70 GB of data and you write it to text file then read it
back within the same application then you still cannot call *collect()* on
it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without
any repetitions? I ask this because of the fileformat you are using: text
file. Even for text file at least you can pass a compression codec as the
2nd argument of *saveAsTextFile()*
<https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/rdd/RDD.html#saveAsTextFile(path:String,codec:Class[_%3C:org.apache.hadoop.io.compress.CompressionCodec]):Unit>
(when
you use this link you might need to scroll up a little bit.. at least my
chrome displays the the *saveAsTextFile* method without the 2nd arg codec).
As IO is slow a compressed data could be read back quicker: as there will
be less data in the disk. Check the Snappy
<https://en.wikipedia.org/wiki/Snappy_(compression)> codec for example.

But if there is a structure of your data and you have plan to process this
data further within Spark then please consider something way better: a columnar
storage format namely ORC or Parquet.

Best Regards,

Attila





*From:* Ranju Jain <ranju.j...@ericsson.com.INVALID>
*Sent:* Sunday, March 21, 2021 8:10 AM
*To:* user@spark.apache.org
*Subject:* Spark saveAsTextFile Disk Recommendation



Hi All,



I have a large RDD dataset of around 60-70 GB which I cannot send to driver
using *collect* so first writing that to disk using  *saveAsTextFile* and
then this data gets saved in the form of multiple part files on each node
of the cluster and after that driver reads the data from that storage.



I have a question like *spark.local.dir* is the directory which is used as
a scratch space where mapoutputs files and RDDs might need to write by
spark for shuffle operations etc.

And there it is strongly recommended to use *local and fast disk *to avoid
any failure or performance impact.



*Do we have any such recommendation for storing multiple part files of
large dataset [ or Big RDD ] in fast disk?*

This will help me to configure the write type of disk for resulting part
files.



Regards

Ranju

Reply via email to