I'm new to PySpark and spark streaming.

According to the documentation about  countByValue
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>   and  countByValueAndWindow
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations>
 
:

*countByValue*: When called on a DStream of elements of type K, return a new
DStream of (K, Long) pairs where the value of each key is its frequency in
each RDD of the source DStream.

*countByValueAndWindow*: When called on a DStream of (K, V) pairs, returns a
new DStream of (K, Long) pairs where the value of each key is its frequency
within a sliding window. Like in reduceByKeyAndWindow, the number of reduce
tasks is configurable through an optional argument.

So basically the return value for these two functions should be *a list of
(K, Long) pairs*, right?

However, when I am doing some experiments, the return value is a list of
integers, instead of pairs.

What's more, in the official test codes on Github for pyspark, which are 
here
<https://github.com/apache/spark/blob/master/python/pyspark/streaming/tests.py#L282>
  
and  here
<https://github.com/apache/spark/blob/master/python/pyspark/streaming/tests.py#L634>
 
:

You can see the "expected results" are *a list of integers*!

I thought I misunderstand the documentation somehow until I saw the same
test codes on Github for scala, which are  here
<https://github.com/apache/spark/blob/d83c2f9f0b08d6d5d369d9fae04cdb15448e7f0d/streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala#L159
>   and  here
<https://github.com/apache/spark/blob/d83c2f9f0b08d6d5d369d9fae04cdb15448e7f0d/streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala#L260>
 
.

Similar test cases, but the results are *a list of pairs* this time! 

So in summary, the documentation and scala test cases told us the result are
pairs. But the python test cases and my own experiments showed that the
result are integers.

Could someone help me explain this a little bit?

Any helps will be much appreciated!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-in-python-questions-about-countByValue-and-countByValueAndWindow-tp25584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to