Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > workbook = xlsxwriter.Workbook('output_excel.xlsx') > worksheet = workbook.add_worksheet() > > data = sc.textFile("xyz.txt") > # xyz.txt is a file whose each line contains string delimited by <SPACE> > > row=0 > > def mapperFunc(x): > for i in range(0,4): > worksheet.write(row, i , x.split(" ")[i]) > row++ > return len(x.split()) > > data2 = data.map(mapperFunc)
> Is using row in 'mapperFunc' like this is a correct way? Will it > increment row each time? No. "mapperFunc" will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of "row" in the executor, and your driver's copy will remain at "0". In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. "row" in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. > Is writing in the excel file using worksheet.write() in side the > mapper function a correct way? No, for the same reasons. Your executor will have a copy of your "workbook" variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. "data.map()" is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo