I am not 100% as I haven't tried this out, but there is a huge difference between the two. Both foreach and collect are actions irregardless of whether or not the data frame is empty.
Doing a collect will bring all the results back to the driver, possibly forcing it to run out of memory. Foreach will apply your function to each element of the DataFrame, but will do so across the cluster. This behavior is useful for when you need to do something custom for each element (perhaps save to a db for which there is no driver or something custom like make an http request per element, careful here though due to overhead cost). In your example, I am going to assume that hrecords is something like a list buffer. The reason that will be empty is that each worker will get sent an empty list (its captured in the closure for foreach) and append to it. The instance of the list at the driver doesn't know about what happened at the workers so its empty. I don't know why Chanh's comment applies here since I am guessing the df is not empty. On Tue, Jul 26, 2016 at 1:53 AM, kevin <kiss.kevin...@gmail.com> wrote: > thank you Chanh > > 2016-07-26 15:34 GMT+08:00 Chanh Le <giaosu...@gmail.com>: > >> Hi Ken, >> >> *blacklistDF -> just DataFrame * >> Spark is lazy until you call something like* collect, take, write* it >> will execute the hold process *like you do map or filter before you >> collect*. >> That mean until you call collect spark* do nothing* so you df would not >> have any data -> can’t call foreach. >> Call collect execute the process -> get data -> foreach is ok. >> >> >> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin...@gmail.com> wrote: >> >> blacklistDF.collect() >> >> >> > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience