Hello, I have two RDDs and my goal is to calculate the Pearson's correlation between them using sliding window. I want to have 200 samples in each window from rdd1 and rdd2 and calculate the correlation between them and then slide the window with 120 samples and calculate the correlation between next 200 samples of windows.I know sliding window works for DStream but I have to use RDD instead of DStream. When I use window function for RDD i get an error saying RDD doesn't have window attribute. The reason that I need to use window operation here is that 1) rdd1 and rdd2 are infinite streams and I need to partition it to the smaller chunks like windows 2) This built-in Pearson's correlation function in Pyspark only works for the partitions with equal size so in my case I chose 200 samples per window and 120 samples for sliding interval. I'd appreciate it if you have any idea how to solve it.
My code is here: if __name__ == "__main__": sc = SparkContext(appName="CorrelationsExample") input_path1 = sys.argv[1] input_path2 = sys.argv[2] num_of_partitions = 1 rdd1 = sc.textFile(input_path1, num_of_partitions).flatMap(lambda line1: line1.strip().split("\n")).map(lambda strelem1: float(strelem1)) rdd2 = sc.textFile(input_path2, num_of_partitions).flatMap(lambda line2: line2.strip().split("\n")).map(lambda strelem2: float(strelem2)) 1 = rdd1.collect() l2 = rdd2.collect() seriesX = sc.parallelize(l1) seriesY = sc.parallelize(l2) print("Correlation is: " + str(Statistics.corr(seriesX, seriesY, method="pearson"))) sc.stop() -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org