You are putting all together and this does not make sense. Writing data to HDFS does not require that all data should be transfered back to the driver and THEN saved to HDFS.

This would be a disaster and it would never scale. I suggest to check the documentation more carefully because I believe you are a bit confused.

regards,

Apostolos



On 07/09/2018 05:39 μμ, James Starks wrote:
Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking doc shows that 
my spark doesn't use those actions functions. But saveXXXX functions looks resembling the function 
df.write.mode(overwrite).parquet("hdfs://path/to/parquet-file") used by my spark job 
uses. Therefore I am thinking maybe that's the reason why my spark job driver consumes such amount 
of memory.

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

My spark job's driver program consumes too much memory, so I want to prevent 
that by writing data to hdfs at the executor side, instead of waiting those 
data to be sent back to the driver program (then writing to hdfs). This is 
because our worker servers have bigger memory size than the one that runs 
driver program. If I can write data to hdfs at executor, then the driver memory 
for my spark job can be reduced.

Otherwise does Spark support streaming read from database (i.e. spark streaming 
+ spark sql)?

Thanks for your reply.



‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 7 September 2018 4:15 PM, Apostolos N. Papadopoulos <papad...@csd.auth.gr> 
wrote:

Dear James,

-   check the Spark documentation to see the actions that return a lot of
     data back to the driver. One of these actions is collect(). However,
     take(x) is an action, also reduce() is an action.

     Before executing collect() find out what is the size of your RDD/DF.

-   I cannot understand the phrase "hdfs directly from the executor". You
     can specify an hdfs file as your input and also you can use hdfs to
     store your output.

     regards,

     Apostolos

     On 07/09/2018 05:04 μμ, James Starks wrote:


I have a Spark job that read data from database. By increasing submit
parameter '--driver-memory 25g' the job can works without a problem
locally but not in prod env because prod master do not have enough
capacity.
So I have a few questions:
-  What functions such as collecct() would cause the data to be sent
back to the driver program?
   My job so far merely uses `as`, `filter`, `map`, and `filter`.

-   Is it possible to write data (in parquet format for instance) to
     hdfs directly from the executor? If so how can I do (any code snippet,
     doc for reference, or what keyword to search cause can't find by e.g.
     `spark direct executor hdfs write`)?


Thanks
--

Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org



--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to