Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

Joseph Bradley Thu, 12 Mar 2015 18:24:24 -0700

The checks against maxCategories are not for statistical purposes; they are
to make sure communication does not blow up.  There currently are not
checks to make sure that there are enough entries for statistically
significant results.  That is up to the user.


I do like the idea of adding a warning.  A reasonable fix for now might be
to print a logWarning message and add a note to the documentation.  On the
JIRA, we could also discuss whether the result should be set to some value
to indicate a meaningless test (e.g., a very bad fixed pValue).

I made a JIRA to track this issue: SPARK-6312

Joseph

On Thu, Mar 12, 2015 at 12:13 AM, Chunnan Yao <yaochun...@gmail.com> wrote:

> Hi everyone!
> I am digging into MLlib of Spark 1.2.1 currently. When reading codes of
> MLlib.stat.test, in the file ChiSqTest.scala under
> /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused
> by the usage of mapPartitions API in the function
> def chiSquaredFeatures(data: RDD[LabeledPoint],
>       methodName: String = PEARSON.name): Array[ChiSqTestResult]
>
> According to my statistical testing knowledge, Chi-Square test requires
> large numbers (>5 for 80% entries) in its contingency matrix in order to
> satisfy good approximation
> (http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). Thus the
> number
> of feature & label categories cannot be too large because if otherwise,
> there would be too few items in each categories, which fails to meet  the
> constraint in usage of Chi-square test.
>
> I do see in the function above, Spark will throw exceptions when
> distinctLabels.size and distinctFeatures.size exceed maxCategories defined
> as 10000, but the  two HashSets distinctLabels and distinctFeatures are
> initialized inside mapPartition, which means Spark will only be sensitive
> to
> the number of feature & label categories in one partition. This will make
> the reduced result---contingency matrix still have exceeded number of
> categories and thus small matrix entries which makes Chi-Square inaccurate.
> I've made a unit test on this function, which proves the case.
>
> Maybe I am just being trapped by a misunderstanding. Could any one please
> give me a hint on this issue?
>
>
>
> -----
> Feel the sparking Spark!
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Is-this-a-bug-in-MLlib-stat-test-About-the-mapPartitions-API-used-in-Chi-Squared-test-tp11015.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

Reply via email to