I am learning more about Spark (and in this case Spark Streaming) and am
getting that a functions like dstream.map()  takes a function call and does
something to each element of the rdd and that in turn returns a new rdd
based on the original.

That's cool for the simple map functions in the examples where a lambda is
used to to take x and do x * x but what happens in Python  (specifically)
with more complex functions? Especially those that use modules (that ARE
build in on all nodes).

For example, instead of a simple map, I want to take  line of data and
regex parse it into fields. It's still not a map (not a flat map) in that
it's a one to one return. (One record of the RDD, a line of text, would
return on parsed record in a Python dict)

in my Spark Streaming Job, I have import re in the "main" part of the file,
and this all seems to work, but I want to ensure I am not "by default"
forcing computations in the driver rather than distributed.

This is "working" as in it's returning the expected data, however I want to
ensure I am not doing something weird by having a transform function using
a module that's imported only at the driver.  (Should I be calling import
re IN the functioon?)

If there are any good docs on this, I'd love to understand it more.

Thanks!

John



Example

def parseLine(line):


    restr = "^(\w\w\w  ?\d\d? \d\d:\d\d:\d\d) ([^ ]+) "

    logre = re.compile(restr)

    m = logre.search(line[1]) # Why does every record of he RDD have a NONE
value in the first position of the tuple?

    rec = {}

    if m:

        rec['field1'] = m.group(1)

        rec['field2] = m.group(2)

        return rec


fwlog_dstream = KafkaUtils.createStream(ssc, zkQuorum,
"sparkstreaming-fwlog_parsed", {kafka_src_topic: 1})

recs = fwlog_dstream.map(parseLine)

Reply via email to