With Spark, the processing is performed lazily. This means nothing much is 
really happening until you call an "action" - an example that is collect(). 
Another way is to write the output in a distributed manner - see write.df() in 
R.

With SparkR dapply() passing the data from Spark to R to process by your UDF 
could have significant overhead. Could you provide more information on your 
case?


_____________________________
From: Xiao Liu1 <liux...@us.ibm.com<mailto:liux...@us.ibm.com>>
Sent: Wednesday, January 18, 2017 11:30 AM
Subject: what does dapply actually do?
To: <user@spark.apache.org<mailto:user@spark.apache.org>>



Hi,
I'm really new and trying to learn sparkR. I have defined a relatively 
complicated user-defined function, and use dapply() to apply the function on a 
SparkDataFrame. It was very fast. But I am not sure what has actually been done 
by dapply(). Because when I used collect() to see the output, which is very 
simple, it took a long time to get the result. I suppose maybe I don't need to 
use collect(), but without using it, how can I output the final results, say, 
in a .csv file?
Thank you very much for the help.

Best Regards,
Xiao


[Inactive hide details for Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi 
Team, Is there a standard way of generating a uniqu]Ninad Shringarpure 
---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a 
unique id for each row in from

From: Ninad Shringarpure <ni...@cloudera.com<mailto:ni...@cloudera.com>>
To: user <user@spark.apache.org<mailto:user@spark.apache.org>>
Date: 01/18/2017 02:24 PM
Subject: Creating UUID using SparksSQL

________________________________



Hi Team,

Is there a standard way of generating a unique id for each row in from Spark 
SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad




Reply via email to