We have a streaming application containing approximately 12 jobs every batch,
running in streaming mode (4 sec batches). Each  job has several
transformations and 1 action (output to cassandra) which causes the
execution of the job (DAG)

For example the first job,

/job 1
---> receive Stream A --> map --> filter -> (union with another stream B)
--> map -->/ groupbykey --> transform --> reducebykey --> map

Likewise we go thro' few more transforms and save to database (job2,
job3...)

Recently we added a new transformation further downstream wherein we union
the output of DStream from job 1 (in italics) with output from a new
transformation(job 5). It appears the whole execution thus far is repeated
which is redundant (I can see this in execution graph & also performance ->
processing time).

That is, with this additional transformation (union with a stream processed
upstream) each batch runs as much as 2.5 times slower compared to runs
without the union. If I cache the DStream from job 1(italics), performance
improves substantially but hit out of memory errors within few hours.

What is the recommended way to cache/unpersist in such a scenario? there is
no dstream level "unpersist"
setting "spark.streaming.unpersist" to true and
streamingContext.remember("duration") did not help.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to