Spark Streaming Workflow Validation

Dan H. Thu, 07 Aug 2014 10:18:54 -0700

I wanted to post for validation to understand if there is more efficient way
to achieve my goal.  I'm currently performing this flow for two distinct
calculations executing in parallel:
  1) Sum key/value pair, by using a simple witnessed count(apply 1 to a
mapToPair() and then groupByKey()
  2) Sum the actual values, in my key/value pair and transform the data so
group properly by groupByKey()


DataSource: RDDStream_in

Workflow1:
Generate DStream using flatmap() from input RDDStream_in, which splits data
into:
<StringKey1, StringKey2, Value1_to_be_inspected>

Next I apply a filter() to pull the values I only want to see as
witnessed...which creates a smaller DStream
<<StringKey1, StringKey2, Value1_inspected>

I generate a PairDStream from mapToPair() from the previous step, providing
a way to append a summable value yielding:
<StringKey1, StringKey2, Value1_inspected>, to_be_summed_valueof 1>

Next I apply the groupByKey() to the PairDStream get:
<<StringKey1, StringKey2, Value1>, summed_value by key/values>


Workflow 2:
Generate DStream using flatmap() from input RDDStream_in, which splits data
into:
<StringKey1, StringKey2, Value1_to_be_summed>

Next, I apply mapToPair() from the previous DStream, thus providing a way to
sum the Value1 and remove the Value1 from the original StringKey, thus
yielding:
<<StringKey1, StringKey2>, Value1_to_be_summed>

Next I apply the groupByKey() and I get:
<<StringKey1, StringKey2>, Value1_summed by Keys>
 
Are there more efficient approaches I should be considering, such as
method.chaining or another technique to increase work flow efficiency?  

Thanks for your feedback in advance.

DH





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Workflow-Validation-tp11677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming Workflow Validation

Reply via email to