Reynold Xin created SPARK-18853: ----------------------------------- Summary: Project is way too aggressive in estimating statistics Key: SPARK-18853 URL: https://issues.apache.org/jira/browse/SPARK-18853 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin
We currently define statistics in UnaryNode: {code} override def statistics: Statistics = { // There should be some overhead in Row object, the size should not be zero when there is // no columns, this help to prevent divide-by-zero error. val childRowSize = child.output.map(_.dataType.defaultSize).sum + 8 val outputRowSize = output.map(_.dataType.defaultSize).sum + 8 // Assume there will be the same number of rows as child has. var sizeInBytes = (child.statistics.sizeInBytes * outputRowSize) / childRowSize if (sizeInBytes == 0) { // sizeInBytes can't be zero, or sizeInBytes of BinaryNode will also be zero // (product of children). sizeInBytes = 1 } child.statistics.copy(sizeInBytes = sizeInBytes) } {code} This has a few issues: 1. This can aggressively underestimate the size for Project. We assume each array/map has 100 elements, which is an overestimate. If the user projects a single field out of a deeply nested field, this would lead to huge underestimation. A safer sane default is probably 2. 2. It is not a property of UnaryNode to propagate statistics this way. It should be a property of Project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org