.nabble.com/Join-two-Spark-Streaming-tp9052p27108.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
1. Since the RDD of the previous batch is used to create the RDD of the
next batch, the lineage of dependencies in the RDDs continues to grow
infinitely. Thats not good because of it increases fault-recover times,
task sizes, etc. Checkpointing saves the data of an RDD to HDFS and
truncates the lin
Hi Tathagata,
Thanks for the solution. Actually, I will use the number of unique integers
in the batch instead of accumulative number of unique integers.
I do have two questions about your code:
1. Why do we need uniqueValuesRDD? Why do we need to call
uniqueValuesRDD.checkpoint()?
2. Where is
Do you want to continuously maintain the set of unique integers seen since
the beginning of stream?
var uniqueValuesRDD: RDD[Int] = ...
dstreamOfIntegers.transform(newDataRDD => {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
//
Hi all,
I am working on a pipeline that needs to join two Spark streams. The input
is a stream of integers. And the output is the number of integer's
appearance divided by the total number of unique integers. Suppose the
input is:
1
2
3
1
2
2
There are 3 unique integers and 1 appears twice. Ther