[ 
https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandini malempati updated SPARK-30882:
--------------------------------------
    Summary: Inaccurate ApproximatePercentile results   (was: 
ApproximatePercentile results )

> Inaccurate ApproximatePercentile results 
> -----------------------------------------
>
>                 Key: SPARK-30882
>                 URL: https://issues.apache.org/jira/browse/SPARK-30882
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.4
>            Reporter: Nandini malempati
>            Priority: Major
>
> Results of ApproximatePercentile should have better accuracy with increased 
> precision as per the documentation provided here: 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala]
> But i'm seeing nondeterministic behavior . On a data set of size 4500000 with 
> Accuracy 1000000 is returning better results than 5000000. And accuracy with 
> 7000000 gives exact results. But this behavior is not consistent. 
> {code:java}
> // code placeholder
> {code}
> package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, 
> DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} 
> import org.apache.spark.sql.functions.\{lit, _} import 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile 
> import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, 
> StructType}import scala.collection.mutable.ListBuffer class PercentilesTest 
> extends FunSpec with Assertions \{ it("check percentiles with different 
> precision") { val schema = List(StructField("MetricName", StringType), 
> StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i 
> <- 1 to 4500000) { data += Row("metric", i)} import spark.implicits._ val df 
> = createDF(schema, data.toSeq) val accuracy1000 = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 
> ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000000)) val accuracy5M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 5000000)) val accuracy7M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 7000000)) val accuracy10M = 
> df.groupBy("MetricName").agg(percentile_approx($"DataPoint", 
> typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 10000000)) accuracy1000.show(1, false) 
> accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) 
> accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: 
> Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( 
> col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new 
> Column(expr) } def percentile_approx(col: Column, percentage: Column, 
> accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) 
> lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") 
> .appName("spark tests") .getOrCreate() } def createDF(schema: 
> List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( 
> spark.sparkContext.parallelize(data), StructType(schema)) } }
> Above is a test run to reproduce the error.  In this example with accuracy 
> 5000000 , P25 and P90 looks fine. But P75 is the max of the column.
> +----------+----------------------------------------------------------+
> |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 5000000)|
> +----------+----------------------------------------------------------+
> |metric        |[450000, 1125000, 4500000, 4050000]                       |
> +----------+----------------------------------------------------------+
> This is breaking our reports as there is no proper definition of accuracy . 
> we have data sets of size more than 27000000. After studying the pattern 
> found that inaccurate percentiles always have "max" of the column as value. 
> P50 and P99 might be right in few cases but P75 can be wrong. 
> Is there a way to define what the correct accuracy would be for a given 
> dataset size ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to