Hi Mich, >> val lines = dstream.map(_._2)
This maps the record into components? Is that the correct understanding of it << Not sure what you refer to when said record into components. The above function is basically giving you the tuple (key/value pair) that you would have inserted into Kafka. say my Kafka producer puts data as 1=>"abc" 2 => "def" Then the above map would give you tuples as below (1,"abc") (2,"abc") >> The following splits the line into comma separated fields. val words = lines.map(_.split(',').view(2)) << Right, basically the value portion of your kafka data is being handled here >> val words = lines.map(_.split(',').view(2)) I am interested in column three So view(2) returns the value. I have also seen other ways like val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2) ... << The split operation is returning back an array of String [a immutable StringLike collection]....calling the view method is creating a IndexedSeqView on the iterable while as in the second way you are iterating through it accessing the elements directly via the index position [line(0), line(1) ]. You would have to decide what is best for your use case based on evaluations should be lazy or immediate [see references below]. References: http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqLike , http://www.scala-lang.org/files/archive/api/2.10.6/index.html#scala.collection.mutable.IndexedSeqView Thanking You --------------------------------------------------------------------------------- Praveen Devarao Spark Technology Centre IBM India Software Labs --------------------------------------------------------------------------------- "Courage doesn't always roar. Sometimes courage is the quiet voice at the end of the day saying I will try again" From: Mich Talebzadeh <mich.talebza...@gmail.com> To: "user @spark" <user@spark.apache.org> Date: 26/04/2016 12:58 pm Subject: Splitting spark dstream into separate fields Hi, Is there any optimum way of splitting a dstream into components? I am doing Spark streaming and this the dstream I get val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) Now that dstream consists of 10,00 price lines per second like below ID, TIMESTAMP, PRICE 31,20160426-080924,93.53608929178084896656 The columns are separated by commas/ Now couple of questions: val lines = dstream.map(_._2) This maps the record into components? Is that the correct understanding of it The following splits the line into comma separated fields. val words = lines.map(_.split(',').view(2)) I am interested in column three So view(2) returns the value. I have also seen other ways like val words = lines.map(_.split(',').map(line => (line(0), (line(1),line(2) ... line(0), line(1) refer to the position of the fields? Which one is the adopted one or the correct one? Thanks Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com