I wanted to post for validation to understand if there is more efficient way to achieve my goal. I'm currently performing this flow for two distinct calculations executing in parallel: 1) Sum key/value pair, by using a simple witnessed count(apply 1 to a mapToPair() and then groupByKey() 2) Sum the actual values, in my key/value pair and transform the data so group properly by groupByKey()
DataSource: RDDStream_in Workflow1: Generate DStream using flatmap() from input RDDStream_in, which splits data into: <StringKey1, StringKey2, Value1_to_be_inspected> Next I apply a filter() to pull the values I only want to see as witnessed...which creates a smaller DStream <<StringKey1, StringKey2, Value1_inspected> I generate a PairDStream from mapToPair() from the previous step, providing a way to append a summable value yielding: <StringKey1, StringKey2, Value1_inspected>, to_be_summed_valueof 1> Next I apply the groupByKey() to the PairDStream get: <<StringKey1, StringKey2, Value1>, summed_value by key/values> Workflow 2: Generate DStream using flatmap() from input RDDStream_in, which splits data into: <StringKey1, StringKey2, Value1_to_be_summed> Next, I apply mapToPair() from the previous DStream, thus providing a way to sum the Value1 and remove the Value1 from the original StringKey, thus yielding: <<StringKey1, StringKey2>, Value1_to_be_summed> Next I apply the groupByKey() and I get: <<StringKey1, StringKey2>, Value1_summed by Keys> Are there more efficient approaches I should be considering, such as method.chaining or another technique to increase work flow efficiency? Thanks for your feedback in advance. DH -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org