choice of RDD function

Sivakumaran S Tue, 14 Jun 2016 13:30:12 -0700

Dear friends,

I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is sending a 
json string periodically to a topic in kafka. I am able to consume this topic 
using Spark Streaming and print it. The schema of the source json is as 
follows:


{ “id”: 121156, “ht”: 42, “rotor_rpm”: 180, “temp”: 14.2, “time”:146593512}
{ “id”: 121157, “ht”: 18, “rotor_rpm”: 110, “temp”: 12.2, “time”: 146593512}
{ “id”: 121156, “ht”: 36, “rotor_rpm”: 160, “temp”: 14.4, “time”: 146593513}
{ “id”: 121157, “ht”: 19, “rotor_rpm”: 120, “temp”: 12.0, “time”: 146593513}
and so on.


In Spark streaming, I want to find the average of “ht” (height), “rotor_rpm” 
and “temp” for each “id". I also want to find the max and min of the same 
fields in the time window (300 seconds in this case). 

Q1.     Can this be done using plain RDD and streaming functions or does it 
require Dataframes/SQL? There may be more fields added to the json at a later 
stage. There will be a lot of “id”s at a later stage.

Q2.     If it can be done using either, which one would render to be more 
efficient and fast?

As of now, the entire set up is in a single laptop. 

Thanks in advance.

Regards,

Siva

choice of RDD function

Reply via email to