Jeremy Freeman created SPARK-2012:
-------------------------------------

             Summary: PySpark StatCounter with numpy arrays
                 Key: SPARK-2012
                 URL: https://issues.apache.org/jira/browse/SPARK-2012
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 1.0.0
            Reporter: Jeremy Freeman
            Priority: Minor


In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
arrays just as with an RDD of scalars, which was very useful (e.g. for 
computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
because the added functionality for computing the minimum and maximum, as 
implemented, doesn't work on arrays.

I have a PR ready that re-enables this functionality by having StatCounter use 
the numpy element-wise functions "maximum" and "minimum", which work on both 
numpy arrays and scalars (and I've added new tests for this capability). 

However, I realize this adds a dependency on NumPy outside of MLLib. If that's 
not ok, maybe it'd be worth adding this functionality as a util within PySpark 
MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to