Barry Becker created SPARK-21986:
------------------------------------

             Summary: QuantileDiscretizer picks wrong split point for data with 
lots of 0's
                 Key: SPARK-21986
                 URL: https://issues.apache.org/jira/browse/SPARK-21986
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.1.1
            Reporter: Barry Becker


I have some simple test cases to help illustrate (see below).
I discovered this with data that had 96,000 rows, but can reproduce with much 
smaller data that has roughly the same distribution of values.

If I have data like
  Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)

and ask for 3 buckets, then it does the right thing and yields splits of 
Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)

However, if I add just one more zero, such that I have data like
 Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
then it will do the wrong thing and give splits of 
  Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))

I'm not bothered that it gave fewer buckets than asked for (that is to be 
expected), but I am bothered that it picked 0.0 instead of 40 as the one split 
point.
The way it did it, now I have 1 bucket with all the data, and a second with 
none of the data.
Am I interpreting something wrong?
Here are my 2 test cases in scala:
{code}
class QuantileDiscretizerSuite extends FunSuite {

  test("Quantile discretizer on data with lots of 0") {
    verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
      Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
  }

  test("Quantile discretizer on data with one less 0") {
    verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
      Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
  }
  
  def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
    val theData: Seq[(Int, Double)] = data.map {
      case x: Int => (x, 0.0)
      case _ => (0, 0.0)
    }

    val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
"unused")

    val qb = new QuantileDiscretizer()
      .setInputCol("rawCol")
      .setOutputCol("binnedColumn")
      .setRelativeError(0.0)
      .setNumBuckets(3)
      .fit(df)

    assertResult(expectedSplits) {qb.getSplits}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to