Hello there,

On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
> workbook = xlsxwriter.Workbook('output_excel.xlsx')
> worksheet = workbook.add_worksheet()
>
> data = sc.textFile("xyz.txt")
> # xyz.txt is a file whose each line contains string delimited by <SPACE>
>
> row=0
>
> def mapperFunc(x):
>     for i in range(0,4):
>         worksheet.write(row, i , x.split(" ")[i])
>     row++
>     return len(x.split())
>
> data2 = data.map(mapperFunc)

> Is using row in 'mapperFunc' like this is a correct way? Will it
> increment row each time?

No. "mapperFunc" will be executed somewhere else, not in the same
process running this script. I'm not familiar with how serializing
closures works in Spark/Python, but you'll most certainly be updating
the local copy of "row" in the executor, and your driver's copy will
remain at "0".

In general, in a distributed execution environment like Spark you want
to avoid as much as possible using state. "row" in your code is state,
so to do what you want you'd have to use other means (like Spark's
accumulators). But those are generally expensive in a distributed
system, and to be avoided if possible.

> Is writing in the excel file using worksheet.write() in side the
> mapper function a correct way?

No, for the same reasons. Your executor will have a copy of your
"workbook" variable. So the write() will happen locally to the
executor, and after the mapperFunc() returns, that will be discarded -
so your driver won't see anything.

As a rule of thumb, your closures should try to use only their
arguments as input, or at most use local variables as read-only, and
only produce output in the form of return values. There are cases
where you might want to break these rules, of course, but in general
that's the mindset you should be in.

Also note that you're not actually executing anything here.
"data.map()" is a transformation, so you're just building the
execution graph for the computation. You need to execute an action
(like collect() or take()) if you want the computation to actually
occur.

-- 
Marcelo

Reply via email to