Dear friends, I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is sending a json string periodically to a topic in kafka. I am able to consume this topic using Spark Streaming and print it. The schema of the source json is as follows:
{ “id”: 121156, “ht”: 42, “rotor_rpm”: 180, “temp”: 14.2, “time”:146593512} { “id”: 121157, “ht”: 18, “rotor_rpm”: 110, “temp”: 12.2, “time”: 146593512} { “id”: 121156, “ht”: 36, “rotor_rpm”: 160, “temp”: 14.4, “time”: 146593513} { “id”: 121157, “ht”: 19, “rotor_rpm”: 120, “temp”: 12.0, “time”: 146593513} and so on. In Spark streaming, I want to find the average of “ht” (height), “rotor_rpm” and “temp” for each “id". I also want to find the max and min of the same fields in the time window (300 seconds in this case). Q1. Can this be done using plain RDD and streaming functions or does it require Dataframes/SQL? There may be more fields added to the json at a later stage. There will be a lot of “id”s at a later stage. Q2. If it can be done using either, which one would render to be more efficient and fast? As of now, the entire set up is in a single laptop. Thanks in advance. Regards, Siva