With Spark, the processing is performed lazily. This means nothing much is really happening until you call an "action" - an example that is collect(). Another way is to write the output in a distributed manner - see write.df() in R.
With SparkR dapply() passing the data from Spark to R to process by your UDF could have significant overhead. Could you provide more information on your case? _____________________________ From: Xiao Liu1 <liux...@us.ibm.com<mailto:liux...@us.ibm.com>> Sent: Wednesday, January 18, 2017 11:30 AM Subject: what does dapply actually do? To: <user@spark.apache.org<mailto:user@spark.apache.org>> Hi, I'm really new and trying to learn sparkR. I have defined a relatively complicated user-defined function, and use dapply() to apply the function on a SparkDataFrame. It was very fast. But I am not sure what has actually been done by dapply(). Because when I used collect() to see the output, which is very simple, it took a long time to get the result. I suppose maybe I don't need to use collect(), but without using it, how can I output the final results, say, in a .csv file? Thank you very much for the help. Best Regards, Xiao [Inactive hide details for Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a uniqu]Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a unique id for each row in from From: Ninad Shringarpure <ni...@cloudera.com<mailto:ni...@cloudera.com>> To: user <user@spark.apache.org<mailto:user@spark.apache.org>> Date: 01/18/2017 02:24 PM Subject: Creating UUID using SparksSQL ________________________________ Hi Team, Is there a standard way of generating a unique id for each row in from Spark SQL. I am looking for functionality similar to UUID generation in hive. Let me know if you need any additional information. Thanks, Ninad