Hi Rahul, Marcelo's explanation is correct. Here's a possible approach to your program, in pseudo-Python:
# connect to Spark cluster sc = SparkContext(...) # load input data input_data = load_xls(file("input.xls")) input_rows = input_data['Sheet1'].rows # create RDD on cluster input_rdd = sc.parallelize(input_rows) # munge RDD result_rdd = input_rdd.map(munge_row) # collect result RDD to local process result_rows = result_rdd.collect() # write output file write_xls(file("output.xls", "w"), result_rows) Hope that helps, -Jey On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin <van...@cloudera.com> wrote: > Hello there, > > On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> wrote: >> workbook = xlsxwriter.Workbook('output_excel.xlsx') >> worksheet = workbook.add_worksheet() >> >> data = sc.textFile("xyz.txt") >> # xyz.txt is a file whose each line contains string delimited by <SPACE> >> >> row=0 >> >> def mapperFunc(x): >> for i in range(0,4): >> worksheet.write(row, i , x.split(" ")[i]) >> row++ >> return len(x.split()) >> >> data2 = data.map(mapperFunc) > >> Is using row in 'mapperFunc' like this is a correct way? Will it >> increment row each time? > > No. "mapperFunc" will be executed somewhere else, not in the same > process running this script. I'm not familiar with how serializing > closures works in Spark/Python, but you'll most certainly be updating > the local copy of "row" in the executor, and your driver's copy will > remain at "0". > > In general, in a distributed execution environment like Spark you want > to avoid as much as possible using state. "row" in your code is state, > so to do what you want you'd have to use other means (like Spark's > accumulators). But those are generally expensive in a distributed > system, and to be avoided if possible. > >> Is writing in the excel file using worksheet.write() in side the >> mapper function a correct way? > > No, for the same reasons. Your executor will have a copy of your > "workbook" variable. So the write() will happen locally to the > executor, and after the mapperFunc() returns, that will be discarded - > so your driver won't see anything. > > As a rule of thumb, your closures should try to use only their > arguments as input, or at most use local variables as read-only, and > only produce output in the form of return values. There are cases > where you might want to break these rules, of course, but in general > that's the mindset you should be in. > > Also note that you're not actually executing anything here. > "data.map()" is a transformation, so you're just building the > execution graph for the computation. You need to execute an action > (like collect() or take()) if you want the computation to actually > occur. > > -- > Marcelo