[jira] [Created] (SPARK-26289) cleanup enablePerfMetrics parameter from BytesToBytesMap
caoxuewen created SPARK-26289: - Summary: cleanup enablePerfMetrics parameter from BytesToBytesMap Key: SPARK-26289 URL: https://issues.apache.org/jira/browse/SPARK-26289 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.5.0 Reporter: caoxuewen enablePerfMetrics was originally designed in BytesToBytesMap to control getNumHashCollisions getTimeSpentResizingNs getAverageProbesPerLookup. However, as the Spark version gradual progress. This parameter is only used for getAverageProbesPerLookup and always given to true when using BytesToBytesMap. it is also dangerous to determine whether getAverageProbesPerLookup opens and throws an IllegalStateException exception. So this pr will be remove enablePerfMetrics parameter from BytesToBytesMap. thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26271) remove unuse object SparkPlan
caoxuewen created SPARK-26271: - Summary: remove unuse object SparkPlan Key: SPARK-26271 URL: https://issues.apache.org/jira/browse/SPARK-26271 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: caoxuewen this code come from PR: [#11190|https://github.com/apache/spark/pull/11190], but this code has never been used, only since PR: [#14548|https://github.com/apache/spark/pull/14548], Let's continue fix it. thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26180) Add a withCreateTempDir function to the SparkCore test case
caoxuewen created SPARK-26180: - Summary: Add a withCreateTempDir function to the SparkCore test case Key: SPARK-26180 URL: https://issues.apache.org/jira/browse/SPARK-26180 Project: Spark Issue Type: Improvement Components: Spark Core, Tests Affects Versions: 2.5.0 Reporter: caoxuewen Currently, the common withTempDir function is used in Spark SQL test cases. To handle val dir = Utils. createTempDir () and Utils. deleteRecursively (dir). Unfortunately, the withTempDir function cannot be used in the SparkCore test case. This PR adds a common withCreateTempDir function to clean up SparkCore test cases. thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
caoxuewen created SPARK-26117: - Summary: use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception Key: SPARK-26117 URL: https://issues.apache.org/jira/browse/SPARK-26117 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.5.0 Reporter: caoxuewen the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the entire executor when an OutOfMemoryError is thrown. so apply for memory using MemoryConsumer. allocatePage when catch exception, use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26073) remove invalid comment as we don't use it anymore
caoxuewen created SPARK-26073: - Summary: remove invalid comment as we don't use it anymore Key: SPARK-26073 URL: https://issues.apache.org/jira/browse/SPARK-26073 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: caoxuewen remove invalid comment as we don't use it anymore More details: [https://github.com/apache/spark/pull/22976] (comment) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26072) remove invalid comment as we don't use it anymore
caoxuewen created SPARK-26072: - Summary: remove invalid comment as we don't use it anymore Key: SPARK-26072 URL: https://issues.apache.org/jira/browse/SPARK-26072 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: caoxuewen remove invalid comment as we don't use it anymore More details: [https://github.com/apache/spark/pull/22976] (comment) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26001) Reduce memory copy when writing decimal
caoxuewen created SPARK-26001: - Summary: Reduce memory copy when writing decimal Key: SPARK-26001 URL: https://issues.apache.org/jira/browse/SPARK-26001 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: caoxuewen this PR fix 2 here: - when writing non-null decimals, we not zero-out all the 16 allocated bytes. if the number of bytes needed for a decimal is greater than 8. then we not need zero-out between 0-byte and 8-byte. The first 8-byte will be covered when writing decimal. - when writing null decimals, we not zero-out all the 16 allocated bytes. BitSetMethods.set the label for null and the length of decimal to 0. when we get the decimal, will not access the 16 byte memory value, so this is safe. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order
caoxuewen created SPARK-25974: - Summary: Optimizes Generates bytecode for ordering based on the given order Key: SPARK-25974 URL: https://issues.apache.org/jira/browse/SPARK-25974 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: caoxuewen Currently, when generates the code for ordering based on the given order, too many variables and assignment statements will be generated, which is not necessary. This PR will eliminate redundant variables. Optimizes Generates bytecode for ordering based on the given order. The generated code looks like: spark.range(1).selectExpr( "id as key", "(id & 1023) as value1", "cast(id & 1023 as double) as value2", "cast(id & 1023 as int) as value3" ).select("value1", "value2", "value3").orderBy("value1", "value2").collect() before PR(codegen size: 178) Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, false] ASC NULLS FIRST: /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ public int compare(InternalRow a, InternalRow b) { /* 016 */ /* 017 */ InternalRow i = null; /* 018 */ /* 019 */ i = a; /* 020 */ boolean isNullA_0; /* 021 */ long primitiveA_0; /* 022 */ { /* 023 */ long value_0 = i.getLong(0); /* 024 */ isNullA_0 = false; /* 025 */ primitiveA_0 = value_0; /* 026 */ } /* 027 */ i = b; /* 028 */ boolean isNullB_0; /* 029 */ long primitiveB_0; /* 030 */ { /* 031 */ long value_0 = i.getLong(0); /* 032 */ isNullB_0 = false; /* 033 */ primitiveB_0 = value_0; /* 034 */ } /* 035 */ if (isNullA_0 && isNullB_0) { /* 036 */ // Nothing /* 037 */ } else if (isNullA_0) { /* 038 */ return -1; /* 039 */ } else if (isNullB_0) { /* 040 */ return 1; /* 041 */ } else { /* 042 */ int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < primitiveB_0 ? -1 : 0); /* 043 */ if (comp != 0) { /* 044 */ return comp; /* 045 */ } /* 046 */ } /* 047 */ /* 048 */ i = a; /* 049 */ boolean isNullA_1; /* 050 */ double primitiveA_1; /* 051 */ { /* 052 */ double value_1 = i.getDouble(1); /* 053 */ isNullA_1 = false; /* 054 */ primitiveA_1 = value_1; /* 055 */ } /* 056 */ i = b; /* 057 */ boolean isNullB_1; /* 058 */ double primitiveB_1; /* 059 */ { /* 060 */ double value_1 = i.getDouble(1); /* 061 */ isNullB_1 = false; /* 062 */ primitiveB_1 = value_1; /* 063 */ } /* 064 */ if (isNullA_1 && isNullB_1) { /* 065 */ // Nothing /* 066 */ } else if (isNullA_1) { /* 067 */ return -1; /* 068 */ } else if (isNullB_1) { /* 069 */ return 1; /* 070 */ } else { /* 071 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1); /* 072 */ if (comp != 0) { /* 073 */ return comp; /* 074 */ } /* 075 */ } /* 076 */ /* 077 */ /* 078 */ return 0; /* 079 */ } /* 080 */ /* 081 */ /* 082 */ } After PR(codegen size: 89) Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, false] ASC NULLS FIRST: /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ public int compare(InternalRow a, InternalRow b) { /* 016 */ /* 017 */ /* 018 */ long value_0 = a.getLong(0); /* 019 */ long value_2 = b.getLong(0); /* 020 */ if (false && false) { /* 021 */ // Nothing /* 022 */ } else if (false) { /* 023 */ return -1; /* 024 */ } else if (false) { /* 025 */ return 1; /* 026 */ } else { /* 027 */ int comp = (value_0 > value_2 ? 1 : value_0 < value_2 ? -1 : 0); /* 028 */ if (comp != 0) { /* 029 */ return comp; /* 030 */ } /* 031 */ } /* 032 */ /* 033 */ double value_1 = a.getDouble(1); /* 034 */ double value_3 = b.getDouble(1); /* 035 */ if (false && false) { /* 036 */ // Nothing /* 037 */ } else if (false) { /* 038 */ return -1; /* 039 */ } else if (false) { /* 040 */
[jira] [Updated] (SPARK-24066) Add new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions
[ https://issues.apache.org/jira/browse/SPARK-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-24066: -- Description: Currently, when two adjacent window functions have the same partition and the same intersection of order, There will be two sorted after shuffling, which is not necessary. This PR adds a new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions. For example: val df = Seq( ("a", "p1", 10.0, 20.0, 30.0), ("a", "p2", 20.0, 10.0, 40.0)).toDF("key", "value", "value1", "value2", "value3").select( $"key", sum("value1").over(Window.partitionBy("key").orderBy("value")), max("value2").over(Window.partitionBy("key").orderBy("value", "value1")), avg("value3").over(Window.partitionBy("key").orderBy("value", "value1", "value2")) ).queryExecution.executedPlan Before this PR: *(5) Project [key#16, sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29, max(value2) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30, avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31] +- Window [max(value2#19) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS max(value2) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30], [key#16], [value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST] +- *(4) Project [key#16, value1#18, value#17, value2#19, sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29, avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31] +- Window [avg(value3#20) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31], [key#16], [value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST] +- *(3) Sort [key#16 ASC NULLS FIRST, value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST], false, 0 +- Window [sum(value1#18) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29], [key#16], [value#17 ASC NULLS FIRST] +- *(2) Sort [key#16 ASC NULLS FIRST, value#17 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(key#16, 5) +- *(1) Project [_1#5 AS key#16, _3#7 AS value1#18, _2#6 AS value#17, _4#8 AS value2#19, _5#9 AS value3#20] +- LocalTableScan [_1#5, _2#6, _3#7, _4#8, _5#9] After this PR: *(5) Project [key#16, sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29, max(value2) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30, avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31] +- Window [sum(value1#18) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29], [key#16], [value#17 ASC NULLS FIRST] +- *(4) Project [key#16, value1#18, value#17, avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31, max(value2) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30] +- Window [max(value2#19) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS max(value2) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30], [key#16], [value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST] +- *(3) Project [key#16, value1#18, value#17, value2#19, avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31] +- Window [avg(value3#20) windowspecdefinition(key#16, value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST,
[jira] [Updated] (SPARK-24066) Add new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions
[ https://issues.apache.org/jira/browse/SPARK-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-24066: -- Summary: Add new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions (was: Add a window exchange rule to eliminate redundant physical plan SortExec) > Add new optimization rule to eliminate unnecessary sort by exchanged adjacent > Window expressions > > > Key: SPARK-24066 > URL: https://issues.apache.org/jira/browse/SPARK-24066 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Major > > Currently, when the order field of window function has a subset relationship, > SparkSQL will randomly generate different physical plan. > Similar like: > case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) > val df = spark.sparkContext.parallelize( > DistinctAgg(8, 2, 3, 4, "a") :: > DistinctAgg(9, 3, 4, 5, "b") :: > DistinctAgg(3, 4, 5, 6, "c") :: > DistinctAgg(3, 4, 5, 7, "c") :: > DistinctAgg(3, 4, 5, 8, "c") :: > DistinctAgg(3, 6, 6, 9, "d") :: > DistinctAgg(30, 40, 50, 60, "e") :: > DistinctAgg(41, 51, 61, 71, null) :: > DistinctAgg(42, 52, 62, 72, null) :: > DistinctAgg(43, 53, 63, 73, "k") ::Nil).toDF() > df.createOrReplaceTempView("distinctAgg") > select a, b, c, > avg(b) over(partition by a order by b) as sumIb, > sum(d) over(partition by a order by b, c) as sumId, d > from distinctAgg > The physics plan will produce different results randomly. > One: there is only one sort of physical plan > == Physical Plan == > *(3) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184] > +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 ASC > NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), > currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST] > +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 > ASC NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, > unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC > NULLS FIRST, c#183 ASC NULLS FIRST] > +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC > NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#181, 5) > +- *(1) Project [a#181, b#182, c#183, d#184] > +- *(1) SerializeFromObject [assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).b AS b#182, assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).d AS d#184, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).e, true, false) AS e#185] > +- Scan ExternalRDDScan[obj#180] > Another one: there is two sort of physical plans > == Physical Plan == > *(4) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184] > +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC > NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, > unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC > NULLS FIRST, c#183 ASC NULLS FIRST] > +- *(3) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC > NULLS FIRST], false, 0 > +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 > ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), > currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST] > +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#181, 5) > +- *(1) Project [a#181, b#182, c#183, d#184] > +- *(1) SerializeFromObject [assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).b AS b#182, assertnotnull(input[0, > org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).d AS d#184, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, > true]).e, true, false) AS e#185] > +- Scan ExternalRDDScan[obj#180] > this
[jira] [Created] (SPARK-25848) Refactor CSVBenchmarks to use main method
caoxuewen created SPARK-25848: - Summary: Refactor CSVBenchmarks to use main method Key: SPARK-25848 URL: https://issues.apache.org/jira/browse/SPARK-25848 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.1 Reporter: caoxuewen use spark-submit: bin/spark-submit --class org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar Generate benchmark result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25847) Refactor JSONBenchmarks to use main method
caoxuewen created SPARK-25847: - Summary: Refactor JSONBenchmarks to use main method Key: SPARK-25847 URL: https://issues.apache.org/jira/browse/SPARK-25847 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.1 Reporter: caoxuewen use spark-submit: bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar Generate benchmark result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmarks" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
caoxuewen created SPARK-25846: - Summary: Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method Key: SPARK-25846 URL: https://issues.apache.org/jira/browse/SPARK-25846 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.1 Reporter: caoxuewen use spark-submit: bin/spark-submit --class org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar Generate benchmark result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec
caoxuewen created SPARK-25755: - Summary: Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec Key: SPARK-25755 URL: https://issues.apache.org/jira/browse/SPARK-25755 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.1 Reporter: caoxuewen Currently, the BroadcastHashJoinExec physical plan supports CodeGen and non-codegen, but only CodeGen code is tested in the unit tests of InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not tested. This PR supplements this part of the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24999) Reduce unnecessary 'new' memory operations
caoxuewen created SPARK-24999: - Summary: Reduce unnecessary 'new' memory operations Key: SPARK-24999 URL: https://issues.apache.org/jira/browse/SPARK-24999 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen This PR is to solve the CodeGen code generated by fast hash, and there is no need to apply for a block of memory for every new entry, because unsafeRow's memory can be reused. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.
caoxuewen created SPARK-24978: - Summary: Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation. Key: SPARK-24978 URL: https://issues.apache.org/jira/browse/SPARK-24978 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: caoxuewen this pr add a configuration parameter to configure the capacity of fast aggregation. Performance comparison: /* Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel Aggregate w multiple keys: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative fasthash = default 5612 / 5882 3.7 267.6 1.0X fasthash = config 3586 / 3595 5.8 171.0 1.6X */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false
caoxuewen created SPARK-24901: - Summary: Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false Key: SPARK-24901 URL: https://issues.apache.org/jira/browse/SPARK-24901 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, Generate code of update UnsafeRow in hash aggregation. FastHashMap and RegularHashMap are two separate codes,These two separate codes need only when VectorizedHashMap is true. but other cases, we can merge together to reduce compiler maxCodesize. thanks. case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) spark.sparkContext.parallelize( DistinctAgg(8, 2, 3, 4, "a") :: DistinctAgg(9, 3, 4, 5, "b") ::Nil).toDF()createOrReplaceTempView("distinctAgg") val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e") println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan)) df.show() Generate code like: *Before modified:* Generated code: /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage1(references); /* 003 */ } /* 004 */ ... /* 354 */ /* 355 */ if (agg_fastAggBuffer_0 != null) { /* 356 */ // common sub-expressions /* 357 */ /* 358 */ // evaluate aggregate function /* 359 */ agg_agg_isNull_31_0 = true; /* 360 */ int agg_value_34 = -1; /* 361 */ /* 362 */ boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0); /* 363 */ int agg_value_35 = agg_isNull_32 ? /* 364 */ -1 : (agg_fastAggBuffer_0.getInt(0)); /* 365 */ /* 366 */ if (!agg_isNull_32 && (agg_agg_isNull_31_0 || /* 367 */ agg_value_34 > agg_value_35)) { /* 368 */ agg_agg_isNull_31_0 = false; /* 369 */ agg_value_34 = agg_value_35; /* 370 */ } /* 371 */ /* 372 */ if (!false && (agg_agg_isNull_31_0 || /* 373 */ agg_value_34 > agg_expr_2_0)) { /* 374 */ agg_agg_isNull_31_0 = false; /* 375 */ agg_value_34 = agg_expr_2_0; /* 376 */ } /* 377 */ agg_agg_isNull_34_0 = true; /* 378 */ int agg_value_37 = -1; /* 379 */ /* 380 */ boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1); /* 381 */ int agg_value_38 = agg_isNull_35 ? /* 382 */ -1 : (agg_fastAggBuffer_0.getInt(1)); /* 383 */ /* 384 */ if (!agg_isNull_35 && (agg_agg_isNull_34_0 || /* 385 */ agg_value_37 > agg_value_38)) { /* 386 */ agg_agg_isNull_34_0 = false; /* 387 */ agg_value_37 = agg_value_38; /* 388 */ } /* 389 */ /* 390 */ byte agg_caseWhenResultState_1 = -1; /* 391 */ do { /* 392 */ boolean agg_value_40 = false; /* 393 */ agg_value_40 = agg_expr_0_0 > 10; /* 394 */ if (!false && agg_value_40) { /* 395 */ agg_caseWhenResultState_1 = (byte)(false ? 1 : 0); /* 396 */ agg_agg_value_39_0 = agg_expr_0_0; /* 397 */ continue; /* 398 */ } /* 399 */ /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0); /* 401 */ agg_agg_value_39_0 = -1; /* 402 */ /* 403 */ } while (false); /* 404 */ // TRUE if any condition is met and the result is null, or no any condition is met. /* 405 */ final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 0); /* 406 */ /* 407 */ if (!agg_isNull_36 && (agg_agg_isNull_34_0 || /* 408 */ agg_value_37 > agg_agg_value_39_0)) { /* 409 */ agg_agg_isNull_34_0 = false; /* 410 */ agg_value_37 = agg_agg_value_39_0; /* 411 */ } /* 412 */ agg_agg_isNull_42_0 = true; /* 413 */ int agg_value_45 = -1; /* 414 */ /* 415 */ boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2); /* 416 */ int agg_value_46 = agg_isNull_43 ? /* 417 */ -1 : (agg_fastAggBuffer_0.getInt(2)); /* 418 */ /* 419 */ if (!agg_isNull_43 && (agg_agg_isNull_42_0 || /* 420 */ agg_value_45 > agg_value_46)) { /* 421 */ agg_agg_isNull_42_0 = false; /* 422 */ agg_value_45 = agg_value_46; /* 423 */ } /* 424 */ /* 425 */ if (!false && (agg_agg_isNull_42_0 || /* 426 */ agg_value_45 > agg_expr_0_0)) { /* 427 */ agg_agg_isNull_42_0 = false; /* 428 */ agg_value_45 = agg_expr_0_0; /* 429 */ } /* 430 */ // update fast row /* 431 */ agg_fastAggBuffer_0.setInt(0, agg_value_34); /* 432 */ /* 433 */ if (!agg_agg_isNull_34_0) { /* 434 */ agg_fastAggBuffer_0.setInt(1, agg_value_37); /* 435 */ } else { /* 436 */ agg_fastAggBuffer_0.setNullAt(1); /* 437 */ } /* 438 */ /* 439 */ agg_fastAggBuffer_0.setInt(2, agg_value_45); /* 440 */ } else { /* 441 */ // common
[jira] [Created] (SPARK-24066) Add a window exchange rule to eliminate redundant physical plan SortExec
caoxuewen created SPARK-24066: - Summary: Add a window exchange rule to eliminate redundant physical plan SortExec Key: SPARK-24066 URL: https://issues.apache.org/jira/browse/SPARK-24066 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, when the order field of window function has a subset relationship, SparkSQL will randomly generate different physical plan. Similar like: case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String) val df = spark.sparkContext.parallelize( DistinctAgg(8, 2, 3, 4, "a") :: DistinctAgg(9, 3, 4, 5, "b") :: DistinctAgg(3, 4, 5, 6, "c") :: DistinctAgg(3, 4, 5, 7, "c") :: DistinctAgg(3, 4, 5, 8, "c") :: DistinctAgg(3, 6, 6, 9, "d") :: DistinctAgg(30, 40, 50, 60, "e") :: DistinctAgg(41, 51, 61, 71, null) :: DistinctAgg(42, 52, 62, 72, null) :: DistinctAgg(43, 53, 63, 73, "k") ::Nil).toDF() df.createOrReplaceTempView("distinctAgg") select a, b, c, avg(b) over(partition by a order by b) as sumIb, sum(d) over(partition by a order by b, c) as sumId, d from distinctAgg The physics plan will produce different results randomly. One: there is only one sort of physical plan == Physical Plan == *(3) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184] +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST] +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST] +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#181, 5) +- *(1) Project [a#181, b#182, c#183, d#184] +- *(1) SerializeFromObject [assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).b AS b#182, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).d AS d#184, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).e, true, false) AS e#185] +- Scan ExternalRDDScan[obj#180] Another one: there is two sort of physical plans == Physical Plan == *(4) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184] +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST] +- *(3) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC NULLS FIRST], false, 0 +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST] +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(a#181, 5) +- *(1) Project [a#181, b#182, c#183, d#184] +- *(1) SerializeFromObject [assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).b AS b#182, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).d AS d#184, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).e, true, false) AS e#185] +- Scan ExternalRDDScan[obj#180] this PR add an exchange rule to ensure that no redundant physical plan SortExec is generated. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23676) Support left join codegen in SortMergeJoinExec
caoxuewen created SPARK-23676: - Summary: Support left join codegen in SortMergeJoinExec Key: SPARK-23676 URL: https://issues.apache.org/jira/browse/SPARK-23676 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen This PR generates java code to directly complete the function of LeftOuter in `SortMergeJoinExec` without using an iterator. This PR improves runtime performance by this generates java code. joinBenchmark result: **1.3x** ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz left sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -- left merge join wholestage=off 2439 / 2575 0.9 1163.0 1.0X left merge join wholestage=on 1890 / 1904 1.1 901.1 1.3X ``` Benchmark program ``` val N = 2 << 20 runBenchmark("left sort merge join", N) { val df1 = sparkSession.range(N) .selectExpr(s"(id * 15485863) % ${N*10} as k1") val df2 = sparkSession.range(N) .selectExpr(s"(id * 15485867) % ${N*10} as k2") val df = df1.join(df2, col("k1") === col("k2"), "left") assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.count() ``` code example ``` val df1 = spark.range(2 << 20).selectExpr("id as k1", "id * 2 as v1") val df2 = spark.range(2 << 20).selectExpr("id as k2", "id * 3 as v2") df1.join(df2, col("k1") === col("k2") && col("v1") < col("v2"), "left").collect ``` Generated code ``` /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage5(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=5 /* 006 */ final class GeneratedIteratorForCodegenStage5 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator smj_leftInput; /* 010 */ private scala.collection.Iterator smj_rightInput; /* 011 */ private InternalRow smj_leftRow; /* 012 */ private InternalRow smj_rightRow; /* 013 */ private long smj_value2; /* 014 */ private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches; /* 015 */ private long smj_value3; /* 016 */ private long smj_value4; /* 017 */ private long smj_value5; /* 018 */ private long smj_value6; /* 019 */ private boolean smj_isNull2; /* 020 */ private long smj_value7; /* 021 */ private boolean smj_isNull3; /* 022 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[] smj_mutableStateArray1 = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[1]; /* 023 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray2 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; /* 024 */ private UnsafeRow[] smj_mutableStateArray = new UnsafeRow[1]; /* 025 */ /* 026 */ public GeneratedIteratorForCodegenStage5(Object[] references) { /* 027 */ this.references = references; /* 028 */ } /* 029 */ /* 030 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 031 */ partitionIndex = index; /* 032 */ this.inputs = inputs; /* 033 */ smj_leftInput = inputs[0]; /* 034 */ smj_rightInput = inputs[1]; /* 035 */ /* 036 */ smj_matches = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483647, 2147483647); /* 037 */ smj_mutableStateArray[0] = new UnsafeRow(4); /* 038 */ smj_mutableStateArray1[0] = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_mutableStateArray[0], 0); /* 039 */ smj_mutableStateArray2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(smj_mutableStateArray1[0], 4); /* 040 */ /* 041 */ } /* 042 */ /* 043 */ private void writeJoinRows() throws java.io.IOException { /* 044 */ smj_mutableStateArray2[0].zeroOutNullBytes(); /* 045 */ /* 046 */ smj_mutableStateArray2[0].write(0, smj_value4); /* 047 */ /* 048 */ smj_mutableStateArray2[0].write(1, smj_value5); /* 049 */ /* 050 */ if (smj_isNull2) { /* 051 */ smj_mutableStateArray2[0].setNullAt(2); /* 052 */ } else { /* 053 */ smj_mutableStateArray2[0].write(2, smj_value6); /* 054 */ } /* 055 */ /* 056 */ if (smj_isNull3) { /* 057 */ smj_mutableStateArray2[0].setNullAt(3); /* 058 */ } else { /* 059 */ smj_mutableStateArray2[0].write(3, smj_value7); /* 060 */ } /* 061 */ append(smj_mutableStateArray[0].copy()); /* 062 */ /* 063 */ } /* 064 */ /* 065 */ private boolean findNextJoinRows( /* 066 */ scala.collection.Iterator leftIter, /* 067 */ scala.collection.Iterator rightIter) { /* 068 */ smj_leftRow = null; /* 069 */ int comp = 0; /* 070 */ while (smj_leftRow == null) { /* 071 */ if (!leftIter.hasNext()) return false; /* 072 */ smj_leftRow = (InternalRow) leftIter.next(); /* 073 */ /*
[jira] [Updated] (SPARK-23609) Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected
[ https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-23609: -- Summary: Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected (was: Test code does not conform to the test title) > Test EnsureRequirements's test cases to eliminate ShuffleExchange while is > not expected > --- > > Key: SPARK-23609 > URL: https://issues.apache.org/jira/browse/SPARK-23609 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Minor > > Currently, In testing EnsureRequirements's test cases to eliminate > ShuffleExchange, The test code is not in conformity with the purpose of the > test.These test cases are as follows: > 1、test("EnsureRequirements eliminates Exchange if child has same > partitioning") > The checking condition is that there is no ShuffleExchange in the physical > plan. = = 2 It's not accurate here. > 2、test("EnsureRequirements does not eliminate Exchange with different > partitioning") > The purpose of the test is to not eliminate ShuffleExchange, but its test > code is the same as test("EnsureRequirements eliminates Exchange if child has > same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23609) Test code does not conform to the test title
caoxuewen created SPARK-23609: - Summary: Test code does not conform to the test title Key: SPARK-23609 URL: https://issues.apache.org/jira/browse/SPARK-23609 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.4.0 Reporter: caoxuewen Currently, In testing EnsureRequirements's test cases to eliminate ShuffleExchange, The test code is not in conformity with the purpose of the test.These test cases are as follows: 1、test("EnsureRequirements eliminates Exchange if child has same partitioning") The checking condition is that there is no ShuffleExchange in the physical plan. = = 2 It's not accurate here. 2、test("EnsureRequirements does not eliminate Exchange with different partitioning") The purpose of the test is to not eliminate ShuffleExchange, but its test code is the same as test("EnsureRequirements eliminates Exchange if child has same partitioning") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23544) Remove redundancy ShuffleExchange in the planner
[ https://issues.apache.org/jira/browse/SPARK-23544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-23544: -- Description: Currently, we execute the SQL statement: select MDTTemp.* from (select * from distinctAgg where a > 2 distribute by a, b,c) MDTTemp left join (select * from testData5 where f > 1) ProjData on MDTTemp.b = ProjData.g and MDTTemp.c = ProjData.h and MDTTemp.d < (ProjData.j - 3) and MDTTemp.d >= (ProjData.j + 3) the physical plan the explain looks like: == Physical Plan == *Project [a#203, b#204, c#205, d#206, e#207] +- SortMergeJoin [b#204, c#205], [g#222, h#223], LeftOuter, ((d#206 < (j#224 - 3)) && (d#206 >= (j#224 + 3))) :- *Sort [b#204 ASC, c#205 ASC], false, 0 : +- Exchange hashpartitioning(b#204, c#205, 5) : +- Exchange hashpartitioning(a#203, b#204, c#205, 5) : +- *Filter (a#203 > 2) : +- Scan ExistingRDD[a#203,b#204,c#205,d#206,e#207] +- *Sort [g#222 ASC, h#223 ASC], false, 0 +- Exchange hashpartitioning(g#222, h#223, 5) +- *Project [g#222, h#223, j#224] +- *Filter (f#221 > 1) +- Scan ExistingRDD[f#221,g#222,h#223,j#224,k#225] There is a redundancy ShuffleExchange that is not necessary. This PR will provide a rule to remove redundancy ShuffleExchange in the planner. now the explain looks like: == Physical Plan == *Project [a#203, b#204, c#205, d#206, e#207] +- SortMergeJoin [b#204, c#205], [g#222, h#223], LeftOuter, ((d#206 < (j#224 - 3)) && (d#206 >= (j#224 + 3))) :- *Sort [b#204 ASC, c#205 ASC], false, 0 : +- Exchange hashpartitioning(b#204, c#205, 5) : +- *Filter (a#203 > 2) : +- Scan ExistingRDD[a#203,b#204,c#205,d#206,e#207] +- *Sort [g#222 ASC, h#223 ASC], false, 0 +- Exchange hashpartitioning(g#222, h#223, 5) +- *Project [g#222, h#223, j#224] +- *Filter (f#221 > 1) +- Scan ExistingRDD[f#221,g#222,h#223,j#224,k#225] and I have add a test case: val N = 2 << 20 runJoinBenchmark("sort merge join", N) { val df1 = sparkSession.range(N) .selectExpr(s"(id * 15485863) % ${N*10} as k1") val df2 = sparkSession.range(N) .selectExpr(s"(id * 15485867) % ${N*10} as k2") val df = df1.join(df2.repartition(20), col("k1") === col("k2")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.count() } To test the performance of the following: Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X was: Currently, when the children of the join are Repartition or RepartitionByExpression, Repartition operation is not necessary, I think that we can remove the Repartition operation in the Optimizer, and it is safe for the join operation. now the explain looks like: === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition === Input LogicalPlan: Join Inner :- Repartition 10, false : +- LocalRelation , [a#0, b#1] +- Repartition 10, false +- LocalRelation , [c#2, d#3] Output LogicalPlan: Join Inner :- LocalRelation , [a#0, b#1] +- LocalRelation , [c#2, d#3] h3. and I have add a test case: val N = 2 << 20 runJoinBenchmark("sort merge join", N) { val df1 = sparkSession.range(N) .selectExpr(s"(id * 15485863) % ${N*10} as k1") val df2 = sparkSession.range(N) .selectExpr(s"(id * 15485867) % ${N*10} as k2") val df = df1.join(df2.repartition(20), col("k1") === col("k2")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.count() } To test the performance of the following: Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X > Remove redundancy ShuffleExchange in the planner > > > Key: SPARK-23544 > URL: https://issues.apache.org/jira/browse/SPARK-23544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Major > > Currently, we execute the SQL statement: > select MDTTemp.* from (select * from distinctAgg where a > 2 distribute by a, > b,c) MDTTemp > left join (select * from testData5 where f > 1) ProjData > on MDTTemp.b = ProjData.g and > MDTTemp.c = ProjData.h and > MDTTemp.d < (ProjData.j - 3) and > MDTTemp.d >= (ProjData.j + 3) > the physical plan the explain
[jira] [Updated] (SPARK-23544) Remove redundancy ShuffleExchange in the planner
[ https://issues.apache.org/jira/browse/SPARK-23544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-23544: -- Summary: Remove redundancy ShuffleExchange in the planner (was: Remove repartition operation from join in the optimizer) > Remove redundancy ShuffleExchange in the planner > > > Key: SPARK-23544 > URL: https://issues.apache.org/jira/browse/SPARK-23544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: caoxuewen >Priority: Major > > Currently, when the children of the join are Repartition or > RepartitionByExpression, Repartition operation is not necessary, I think that > we can remove the Repartition operation in the Optimizer, and it is safe for > the join operation. now the explain looks like: > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition > === > > Input LogicalPlan: > Join Inner > :- Repartition 10, false > : +- LocalRelation , [a#0, b#1] > +- Repartition 10, false > +- LocalRelation , [c#2, d#3] > Output LogicalPlan: > Join Inner > :- LocalRelation , [a#0, b#1] > +- LocalRelation , [c#2, d#3] > > h3. and I have add a test case: > val N = 2 << 20 > runJoinBenchmark("sort merge join", N) { > val df1 = sparkSession.range(N) > .selectExpr(s"(id * 15485863) % ${N*10} as k1") > val df2 = sparkSession.range(N) > .selectExpr(s"(id * 15485867) % ${N*10} as k2") > val df = df1.join(df2.repartition(20), col("k1") === col("k2")) > > assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) > df.count() > } > > To test the performance of the following: > Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 > Intel64 Family 6 Model 94 Stepping 3, GenuineIntel > sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative > > sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X > sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23544) Remove repartition operation from join in the optimizer
caoxuewen created SPARK-23544: - Summary: Remove repartition operation from join in the optimizer Key: SPARK-23544 URL: https://issues.apache.org/jira/browse/SPARK-23544 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, when the children of the join are Repartition or RepartitionByExpression, Repartition operation is not necessary, I think that we can remove the Repartition operation in the Optimizer, and it is safe for the join operation. now the explain looks like: === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition === Input LogicalPlan: Join Inner :- Repartition 10, false : +- LocalRelation , [a#0, b#1] +- Repartition 10, false +- LocalRelation , [c#2, d#3] Output LogicalPlan: Join Inner :- LocalRelation , [a#0, b#1] +- LocalRelation , [c#2, d#3] h3. and I have add a test case: val N = 2 << 20 runJoinBenchmark("sort merge join", N) { val df1 = sparkSession.range(N) .selectExpr(s"(id * 15485863) % ${N*10} as k1") val df2 = sparkSession.range(N) .selectExpr(s"(id * 15485867) % ${N*10} as k2") val df = df1.join(df2.repartition(20), col("k1") === col("k2")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined) df.count() } To test the performance of the following: Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23356) Pushes Project to both sides of Union when expression is non-deterministic
caoxuewen created SPARK-23356: - Summary: Pushes Project to both sides of Union when expression is non-deterministic Key: SPARK-23356 URL: https://issues.apache.org/jira/browse/SPARK-23356 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, PushProjectionThroughUnion optimizer only supports pushdown project operator to both sides of a Union operator when expression is deterministic , in fact, we can be like pushdown filters, also support pushdown project operator to both sides of a Union operator when expression is non-deterministic , this PR description fix this problem。now the explain looks like: === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion === Input LogicalPlan: Project [a#0, rand(10) AS rnd#9] +- Union :- LocalRelation , [a#0, b#1, c#2] :- LocalRelation , [d#3, e#4, f#5] +- LocalRelation , [g#6, h#7, i#8] Output LogicalPlan: Project [a#0, rand(10) AS rnd#9] +- Union :- Project [a#0] : +- LocalRelation , [a#0, b#1, c#2] :- Project [d#3] : +- LocalRelation , [d#3, e#4, f#5] +- Project [g#6] +- LocalRelation , [g#6, h#7, i#8] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23343) Increase the exception test for the bind port
caoxuewen created SPARK-23343: - Summary: Increase the exception test for the bind port Key: SPARK-23343 URL: https://issues.apache.org/jira/browse/SPARK-23343 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 2.4.0 Reporter: caoxuewen this PR add new test case, 1、add the boundary value test of port 65535 2、add the privileged port to test, 3、add rebinding port test when set `spark.port.maxRetries` is 1, 4、add `Utils.userPort` self circulation to generating port, in addition, in the existing test case, if you don't set the `spark.testing` for true, the default value for `spark.port.maxRetries` is not 100, but 16, (expectedPort + 100) is a little mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23311) add FilterFunction test case for test CombineTypedFilters
caoxuewen created SPARK-23311: - Summary: add FilterFunction test case for test CombineTypedFilters Key: SPARK-23311 URL: https://issues.apache.org/jira/browse/SPARK-23311 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen In the current test case for CombineTypedFilters, we lack the test of FilterFunction, so let's add it. In addition, in TypedFilterOptimizationSuite's existing test cases, Let's extract a common LocalRelation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source
caoxuewen created SPARK-23247: - Summary: combines Unsafe operations and statistics operations in Scan Data Source Key: SPARK-23247 URL: https://issues.apache.org/jira/browse/SPARK-23247 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen Currently, we scan the execution plan of the data source, first the unsafe operation of each row of data, and then re traverse the data for the count of rows. In terms of performance, this is not necessary. this PR combines the two operations and makes statistics on the number of rows while performing the unsafe operation. *Before modified,* {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { (index{color:#cc7832}, {color}iter) => {color:#cc7832}val {color}proj = UnsafeProjection.create({color:#9876aa}schema{color}) proj.initialize(index) {color:#FF}iter.map(proj){color} } {color:#cc7832}val {color}numOutputRows = longMetric({color:#6a8759}"numOutputRows"{color}) unsafeRow.map { r => {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color} {color} r } *After modified,* val numOutputRows = longMetric("numOutputRows") rdd.mapPartitionsWithIndexInternal { (index, iter) => val proj = UnsafeProjection.create(schema) proj.initialize(index) iter.map( r => { {color:#FF} numOutputRows += 1{color} {color:#FF} proj(r){color} }) } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23199) improved Removes repetition from group expressions in Aggregate
caoxuewen created SPARK-23199: - Summary: improved Removes repetition from group expressions in Aggregate Key: SPARK-23199 URL: https://issues.apache.org/jira/browse/SPARK-23199 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, all Aggregate operations will go into RemoveRepetitionFromGroupExpressions, but there is no group expression or there is no duplicate group expression in group expression, we not need copy for logic plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22035) the value of statistical logicalPlan.stats.sizeInBytes which is not expected
[ https://issues.apache.org/jira/browse/SPARK-22035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-22035: -- Summary: the value of statistical logicalPlan.stats.sizeInBytes which is not expected (was: Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected) > the value of statistical logicalPlan.stats.sizeInBytes which is not expected > > > Key: SPARK-22035 > URL: https://issues.apache.org/jira/browse/SPARK-22035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, assume there will be the same number of rows as child has. > statistics `logicalPlan.stats.SizeInBytes` is calculated based on the > percentage of the size of the child's data type and the size of the current > data type size. But there is a problem. Statistics are not very accurate. > for example: > ``` > val N = 1 << 3 > val df = spark.range(N).selectExpr("id as k1", > s"cast(id % 3 as string) as idString1", > s"cast((id + 1) % 5 as string) as idString3") > val sizeInBytes = df.logicalPlan.stats.sizeInBytes > println("sizeInBytes : " + sizeInBytes) > ``` > before modify: > sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). > debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` > ``` > p.child.dataType: LongType defaultSize: 8 > p.dataType: LongType defaultSize: 8 > p.dataType: StringType defaultSize: 20 > p.dataType: StringType defaultSize: 20 > childRowSize: 16 outputRowSize: 56 > p.child.stats.sizeInBytes : 64 > p.stats.sizeInBytes : 224 > sizeInBytes: 224 > ``` > but sizeInBytes must be 384( 8 * (8 + 20 + 20) ). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22035) Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected
[ https://issues.apache.org/jira/browse/SPARK-22035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-22035: -- Description: Currently, assume there will be the same number of rows as child has. statistics `logicalPlan.stats.SizeInBytes` is calculated based on the percentage of the size of the child's data type and the size of the current data type size. But there is a problem. Statistics are not very accurate. for example: ``` val N = 1 << 3 val df = spark.range(N).selectExpr("id as k1", s"cast(id % 3 as string) as idString1", s"cast((id + 1) % 5 as string) as idString3") val sizeInBytes = df.logicalPlan.stats.sizeInBytes println("sizeInBytes : " + sizeInBytes) ``` before modify: sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 16 outputRowSize: 56 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 224 sizeInBytes: 224 ``` but sizeInBytes must be 384( 8 * (8 + 20 + 20) ). was: Currently, assume there will be the same number of rows as child has. statistics `logicalPlan.stats.SizeInBytes` is calculated based on the percentage of the size of the child's data type and the size of the current data type size. But there is a problem. Statistics are not very accurate. for example: ``` val N = 1 << 3 val df = spark.range(N).selectExpr("id as k1", s"cast(id % 3 as string) as idString1", s"cast((id + 1) % 5 as string) as idString3") val sizeInBytes = df.logicalPlan.stats.sizeInBytes println("sizeInBytes : " + sizeInBytes) ``` before modify: sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 16 outputRowSize: 56 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 224 sizeInBytes: 224 ``` after modify: sizeInBytes is 384( 8 * 8 ((8 + 20 + 20) / 8) ). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 8 outputRowSize: 48 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 384 sizeInBytes: 384 ``` > Improving the value of statistical logicalPlan.stats.sizeInBytes which is not > expected > -- > > Key: SPARK-22035 > URL: https://issues.apache.org/jira/browse/SPARK-22035 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, assume there will be the same number of rows as child has. > statistics `logicalPlan.stats.SizeInBytes` is calculated based on the > percentage of the size of the child's data type and the size of the current > data type size. But there is a problem. Statistics are not very accurate. > for example: > ``` > val N = 1 << 3 > val df = spark.range(N).selectExpr("id as k1", > s"cast(id % 3 as string) as idString1", > s"cast((id + 1) % 5 as string) as idString3") > val sizeInBytes = df.logicalPlan.stats.sizeInBytes > println("sizeInBytes : " + sizeInBytes) > ``` > before modify: > sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). > debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` > ``` > p.child.dataType: LongType defaultSize: 8 > p.dataType: LongType defaultSize: 8 > p.dataType: StringType defaultSize: 20 > p.dataType: StringType defaultSize: 20 > childRowSize: 16 outputRowSize: 56 > p.child.stats.sizeInBytes : 64 > p.stats.sizeInBytes : 224 > sizeInBytes: 224 > ``` > but sizeInBytes must be 384( 8 * (8 + 20 + 20) ). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22035) Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected
caoxuewen created SPARK-22035: - Summary: Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected Key: SPARK-22035 URL: https://issues.apache.org/jira/browse/SPARK-22035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen Currently, assume there will be the same number of rows as child has. statistics `logicalPlan.stats.SizeInBytes` is calculated based on the percentage of the size of the child's data type and the size of the current data type size. But there is a problem. Statistics are not very accurate. for example: ``` val N = 1 << 3 val df = spark.range(N).selectExpr("id as k1", s"cast(id % 3 as string) as idString1", s"cast((id + 1) % 5 as string) as idString3") val sizeInBytes = df.logicalPlan.stats.sizeInBytes println("sizeInBytes : " + sizeInBytes) ``` before modify: sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 16 outputRowSize: 56 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 224 sizeInBytes: 224 ``` after modify: sizeInBytes is 384( 8 * 8 ((8 + 20 + 20) / 8) ). debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode ` ``` p.child.dataType: LongType defaultSize: 8 p.dataType: LongType defaultSize: 8 p.dataType: StringType defaultSize: 20 p.dataType: StringType defaultSize: 20 childRowSize: 8 outputRowSize: 48 p.child.stats.sizeInBytes : 64 p.stats.sizeInBytes : 384 sizeInBytes: 384 ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21963) create temp file should be delete after use
caoxuewen created SPARK-21963: - Summary: create temp file should be delete after use Key: SPARK-21963 URL: https://issues.apache.org/jira/browse/SPARK-21963 Project: Spark Issue Type: Bug Components: Spark Core, Tests Affects Versions: 2.3.0 Reporter: caoxuewen After you create a temporary table, you need to delete it, otherwise it will leave a file similar to the file name ‘SPARK194465907929586320484966temp’ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark
[ https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen closed SPARK-21932. - Resolution: Resolved > Add test CartesianProduct join case and BroadcastNestedLoop join case in > JoinBenchmark > -- > > Key: SPARK-21932 > URL: https://issues.apache.org/jira/browse/SPARK-21932 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen >Priority: Minor > > this issue to fix two problems: > 1. add new two test case. test CartesianProduct join and > BroadcastNestedLoop join. We understand the effect of CodeGen on them. and > through these two test cases, we can know Per Row (NS), is much more delayed > than our previous hash join. > 2.similar 'logical.Join', is not right, to be consistent in > SparkStrategies.scala. fix to remove package name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark
[ https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21932: -- Priority: Minor (was: Major) > Add test CartesianProduct join case and BroadcastNestedLoop join case in > JoinBenchmark > -- > > Key: SPARK-21932 > URL: https://issues.apache.org/jira/browse/SPARK-21932 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen >Priority: Minor > > this issue to fix two problems: > 1. add new two test case. test CartesianProduct join and > BroadcastNestedLoop join. We understand the effect of CodeGen on them. and > through these two test cases, we can know Per Row (NS), is much more delayed > than our previous hash join. > 2.similar 'logical.Join', is not right, to be consistent in > SparkStrategies.scala. fix to remove package name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark
[ https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21932: -- Description: this issue to fix two problems: 1. add new two test case. test CartesianProduct join and BroadcastNestedLoop join. We understand the effect of CodeGen on them. and through these two test cases, we can know Per Row (NS), is much more delayed than our previous hash join. 2.similar 'logical.Join', is not right, to be consistent in SparkStrategies.scala. fix to remove package name. was: this issue to fix two problems: 1. similar 'logical.Join', is not right, to be consistent in SparkStrategies.scala. fix to remove package name. 2. add two test case. test CartesianProduct join and BroadcastNestedLoop join. We understand the effect of CodeGen on them. > Add test CartesianProduct join case and BroadcastNestedLoop join case in > JoinBenchmark > -- > > Key: SPARK-21932 > URL: https://issues.apache.org/jira/browse/SPARK-21932 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen > > this issue to fix two problems: > 1. add new two test case. test CartesianProduct join and > BroadcastNestedLoop join. We understand the effect of CodeGen on them. and > through these two test cases, we can know Per Row (NS), is much more delayed > than our previous hash join. > 2.similar 'logical.Join', is not right, to be consistent in > SparkStrategies.scala. fix to remove package name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark
[ https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen reopened SPARK-21932: --- > Add test CartesianProduct join case and BroadcastNestedLoop join case in > JoinBenchmark > -- > > Key: SPARK-21932 > URL: https://issues.apache.org/jira/browse/SPARK-21932 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen > > this issue to fix two problems: > 1. add new two test case. test CartesianProduct join and > BroadcastNestedLoop join. We understand the effect of CodeGen on them. and > through these two test cases, we can know Per Row (NS), is much more delayed > than our previous hash join. > 2.similar 'logical.Join', is not right, to be consistent in > SparkStrategies.scala. fix to remove package name. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark
[ https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21932: -- Summary: Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark (was: remove package name similar 'logical.Join' to 'Join' in JoinSelection) > Add test CartesianProduct join case and BroadcastNestedLoop join case in > JoinBenchmark > -- > > Key: SPARK-21932 > URL: https://issues.apache.org/jira/browse/SPARK-21932 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: caoxuewen > > this issue to fix two problems: > 1. similar 'logical.Join', is not right, to be consistent in > SparkStrategies.scala. fix to remove package name. > 2. add two test case. test CartesianProduct join and BroadcastNestedLoop > join. We understand the effect of CodeGen on them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21932) remove package name similar 'logical.Join' to 'Join' in JoinSelection
caoxuewen created SPARK-21932: - Summary: remove package name similar 'logical.Join' to 'Join' in JoinSelection Key: SPARK-21932 URL: https://issues.apache.org/jira/browse/SPARK-21932 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 2.3.0 Reporter: caoxuewen this issue to fix two problems: 1. similar 'logical.Join', is not right, to be consistent in SparkStrategies.scala. fix to remove package name. 2. add two test case. test CartesianProduct join and BroadcastNestedLoop join. We understand the effect of CodeGen on them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21746) nondeterministic expressions incorrectly for filter predicates
[ https://issues.apache.org/jira/browse/SPARK-21746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen closed SPARK-21746. - Resolution: Not A Problem > nondeterministic expressions incorrectly for filter predicates > -- > > Key: SPARK-21746 > URL: https://issues.apache.org/jira/browse/SPARK-21746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, We do interpretedpredicate optimization, but not very well, > because when our filter contained an indeterminate expression, it would have > an exception. This PR describes solving this problem by adding the initialize > method in InterpretedPredicate. > java.lang.IllegalArgumentException: > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157) > at scala.collection.immutable.Stream.filter(Stream.scala:519) > at scala.collection.immutable.Stream.filter(Stream.scala:202) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21746) nondeterministic expressions incorrectly for filter predicates
[ https://issues.apache.org/jira/browse/SPARK-21746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21746: -- Summary: nondeterministic expressions incorrectly for filter predicates (was: nondeterministic expressions correctly for filter predicates) > nondeterministic expressions incorrectly for filter predicates > -- > > Key: SPARK-21746 > URL: https://issues.apache.org/jira/browse/SPARK-21746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, We do interpretedpredicate optimization, but not very well, > because when our filter contained an indeterminate expression, it would have > an exception. This PR describes solving this problem by adding the initialize > method in InterpretedPredicate. > java.lang.IllegalArgumentException: > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157) > at scala.collection.immutable.Stream.filter(Stream.scala:519) > at scala.collection.immutable.Stream.filter(Stream.scala:202) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21746) nondeterministic expressions correctly for filter predicates
caoxuewen created SPARK-21746: - Summary: nondeterministic expressions correctly for filter predicates Key: SPARK-21746 URL: https://issues.apache.org/jira/browse/SPARK-21746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen Currently, We do interpretedpredicate optimization, but not very well, because when our filter contained an indeterminate expression, it would have an exception. This PR describes solving this problem by adding the initialize method in InterpretedPredicate. java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157) at scala.collection.immutable.Stream.filter(Stream.scala:519) at scala.collection.immutable.Stream.filter(Stream.scala:202) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21707) Improvement a special case for non-deterministic filters in optimizer
caoxuewen created SPARK-21707: - Summary: Improvement a special case for non-deterministic filters in optimizer Key: SPARK-21707 URL: https://issues.apache.org/jira/browse/SPARK-21707 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic filters. Deal with that we only need to read user needs fields for non-deterministic filters in optimizer. For example, the condition of filters is nondeterministic. e.g:contains nondeterministic function(rand function), HiveTableScans optimizer generated: ``` HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as bigint)) AS sum(id)#395L] +- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L] +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0)) +- MetastoreRelation XXX_database, XXX_table HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L] +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0)) +- MetastoreRelation XXX_database, XXX_table HiveTableScans plan:Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0)) +- MetastoreRelation XXX_database, XXX_table HiveTableScans plan:MetastoreRelation XXX_database, XXX_table ``` so HiveTableScan will read all the fields from table. but we only need to ‘d004’ and 'c010' . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects. Deal with that we only need to read user needs fields for non-deterministic projects in optimizer. For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated: *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task. was: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated: *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task. > Improvement a special case for non-deterministic projects in optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects. Deal with that we only need to read user > needs fields for non-deterministic projects in optimizer. > For example, the fields of project contains nondeterministic function(rand > function), after a executedPlan optimizer generated: > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > HiveTableScan will read all the fields from table. but we only need to ‘d004’ > . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Summary: Improvement a special case for non-deterministic projects in optimizer (was: Improvement a special case for non-deterministic projects and filters in optimizer) > Improvement a special case for non-deterministic projects in optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. > For example, the fields of project contains nondeterministic function(rand > function), after a executedPlan optimizer generated: > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > HiveTableScan will read all the fields from table. but we only need to ‘d004’ > . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21705) Add spark.internal.config parameter description
caoxuewen created SPARK-21705: - Summary: Add spark.internal.config parameter description Key: SPARK-21705 URL: https://issues.apache.org/jira/browse/SPARK-21705 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: caoxuewen Currently, some of the configuration parameters of spark.internal.config without adding a description, which is incorrectly, this PR is based on http://spark.apache.org/docs/latest/configuration.html Property Name to supplement spark.internal.config parameter description. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. For example, the fields of project contains nondeterministic function(rand function), after a executedPlan optimizer generated: *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table HiveTableScan will read all the fields from table. but we only need to ‘d004’ . it will affect the performance of task. was: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. > For example, the fields of project contains nondeterministic function(rand > function), after a executedPlan optimizer generated: > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > HiveTableScan will read all the fields from table. but we only need to ‘d004’ > . it will affect the performance of task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119617#comment-16119617 ] caoxuewen commented on SPARK-21520: --- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/18892 > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, Did a lot of special handling for non-deterministic projects and filters in optimizer. but not good enough. this patch add a new special case for non-deterministic projects and filters. Deal with that we only need to read user needs fields for non-deterministic projects and filters in optimizer. was: Currently, when the rand function is present in the SQL statement, hivetable searches all columns in the table. e.g: select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, ceil(c010) as cceila from XXX_table) a group by k,k; generate WholeStageCodegen subtrees: == Subtree 1 / 2 == *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table == Subtree 2 / 2 == *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L, sum(id)#797L]) +- Exchange hashpartitioning(k#403L, 200) +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table All columns will be searched in HiveTableScans , Consequently, All column data is read to a ORC table. e.g: INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807} so, The execution of the SQL statement will become very slow. > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, Did a lot of special handling for non-deterministic projects and > filters in optimizer. but not good enough. this patch add a new special case > for non-deterministic projects and filters. Deal with that we only need to > read user needs fields for non-deterministic projects and filters in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Summary: Improvement a special case for non-deterministic projects and filters in optimizer (was: Hivetable scan for all the columns the SQL statement contains the 'rand') > Improvement a special case for non-deterministic projects and filters in > optimizer > -- > > Key: SPARK-21520 > URL: https://issues.apache.org/jira/browse/SPARK-21520 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: caoxuewen > > Currently, when the rand function is present in the SQL statement, hivetable > searches all columns in the table. > e.g: > select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, > ceil(c010) as cceila from XXX_table) a > group by k,k; > generate WholeStageCodegen subtrees: > == Subtree 1 / 2 == > *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) > AS k#403L] >+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > == Subtree 2 / 2 == > *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], > output=[k#403L, k#403L, sum(id)#797L]) > +- Exchange hashpartitioning(k#403L, 200) >+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as > bigint))], output=[k#403L, sum#800L]) > +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * > 1.0)) AS k#403L] > +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, > d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, > d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, > c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], > MetastoreRelation XXX_database, XXX_table > > All columns will be searched in HiveTableScans , Consequently, All column > data is read to a ORC table. > e.g: > INFO ReaderImpl: Reading ORC rows from > hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 > with {include: [true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true, true, true, true, true, true, true, > true, true, true, true, true, true, true], offset: 0, length: > 9223372036854775807} > so, The execution of the SQL statement will become very slow. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'
[ https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-21520: -- Description: Currently, when the rand function is present in the SQL statement, hivetable searches all columns in the table. e.g: select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, ceil(c010) as cceila from XXX_table) a group by k,k; generate WholeStageCodegen subtrees: == Subtree 1 / 2 == *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table == Subtree 2 / 2 == *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L, sum(id)#797L]) +- Exchange hashpartitioning(k#403L, 200) +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table All columns will be searched in HiveTableScans , Consequently, All column data is read to a ORC table. e.g: INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807} so, The execution of the SQL statement will become very slow. was: Currently, when the rand function is present in the SQL statement, hivetable searches all columns in the table. e.g: select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, ceil(c010) as cceila from XXX_table) a group by k,k; generate WholeStageCodegen subtrees: == Subtree 1 / 2 == *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table == Subtree 2 / 2 == *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L, sum(id)#797L]) +- Exchange hashpartitioning(k#403L, 200) +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table All columns will be searched in HiveTableScans , Consequently, All column data is read to a ORC table. e.g: INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true,
[jira] [Created] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'
caoxuewen created SPARK-21520: - Summary: Hivetable scan for all the columns the SQL statement contains the 'rand' Key: SPARK-21520 URL: https://issues.apache.org/jira/browse/SPARK-21520 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: caoxuewen Currently, when the rand function is present in the SQL statement, hivetable searches all columns in the table. e.g: select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, ceil(c010) as cceila from XXX_table) a group by k,k; generate WholeStageCodegen subtrees: == Subtree 1 / 2 == *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table == Subtree 2 / 2 == *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], output=[k#403L, k#403L, sum(id)#797L]) +- Exchange hashpartitioning(k#403L, 200) +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], output=[k#403L, sum#800L]) +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS k#403L] +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation XXX_database, XXX_table All columns will be searched in HiveTableScans , Consequently, All column data is read to a ORC table. e.g: INFO ReaderImpl: Reading ORC rows from hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with {include: [true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true, true], offset: 0, length: 9223372036854775807} so, The execution of the SQL statement will become very slow. solution: Set the property of the rand expression, deterministic = true -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations
caoxuewen created SPARK-21353: - Summary: add checkValue in spark.internal.config about how to correctly set configurations Key: SPARK-21353 URL: https://issues.apache.org/jira/browse/SPARK-21353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: caoxuewen Priority: Minor add checkValue for configurations: spark.driver.memory spark.executor.memory spark.blockManager.port spark.task.cpus spark.dynamicAllocation.minExecutors spark.task.maxFailures spark.blacklist.task.maxTaskAttemptsPerExecutor spark.blacklist.task.maxTaskAttemptsPerNode spark.blacklist.application.maxFailedTasksPerExecutor spark.blacklist.stage.maxFailedTasksPerExecutor spark.blacklist.application.maxFailedExecutorsPerNode spark.blacklist.stage.maxFailedExecutorsPerNode spark.scheduler.listenerbus.metrics.maxListenerClassesTimed spark.ui.retainedTasks spark.blockManager.port spark.driver.blockManager.port spark.files.maxPartitionBytes spark.files.openCostInBytes spark.shuffle.accurateBlockThreshold spark.shuffle.registration.timeout spark.shuffle.registration.maxAttempts spark.reducer.maxReqSizeShuffleToMem and we copy the document from http://spark.apache.org/docs/latest/configuration.html spark.driver.userClassPathFirst spark.driver.memory spark.executor.userClassPathFirst spark.executor.memory spark.yarn.isPython spark.task.cpus spark.dynamicAllocation.minExecutors spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.maxExecutors spark.shuffle.service.enabled spark.submit.pyFiles spark.task.maxFailures spark.blacklist.enabled spark.blacklist.task.maxTaskAttemptsPerExecutor spark.blacklist.task.maxTaskAttemptsPerNode spark.blacklist.stage.maxFailedTasksPerExecutor spark.blacklist.stage.maxFailedExecutorsPerNode spark.pyspark.driver.python spark.pyspark.python spark.ui.retainedTasks spark.io.encryption.enabled spark.io.encryption.keygen.algorithm spark.io.encryption.keySizeBits spark.authenticate.enableSaslEncryption spark.authenticate -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). {quote}diskWriteBufferSize: 1M512K256K128K64K32K 16K8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285{quote} 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function was: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). bq. bq. diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K bq. --- bq. RecordSize = 2.5M 742 722 694 686 667668671 669 683 bq. RecordSize = 1M294 293 292 287 283285281 279 285 bq. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function > add a new config to diskWriteBufferSize which is hard coded before > -- > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR Improvement in two: > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > when change the size of the diskWriteBufferSize to test forceSorterToSpill > The average performance of running 10 times is as follows:(their unit is MS). > > {quote}diskWriteBufferSize: 1M512K256K128K64K32K > 16K8K4K > --- > RecordSize = 2.5M 742 722 694 686 667668671 > 669 683 > RecordSize = 1M294 293 292 287 283285281 > 279 285{quote} > > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). bq. bq. diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K bq. --- bq. RecordSize = 2.5M 742 722 694 686 667668671 669 683 bq. RecordSize = 1M294 293 292 287 283285281 279 285 bq. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function was: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). {quote}diskWriteBufferSize: 1M512K256K128K64K32K 16K8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285{quote} 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function > add a new config to diskWriteBufferSize which is hard coded before > -- > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR Improvement in two: > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > when change the size of the diskWriteBufferSize to test forceSorterToSpill > The average performance of running 10 times is as follows:(their unit is MS). > bq. > bq. diskWriteBufferSize: 1M512K256K128K64K32K > 16K8K4K > bq. > --- > bq. RecordSize = 2.5M 742 722 694 686 667668 > 671669 683 > bq. RecordSize = 1M294 293 292 287 283285 > 281279 285 > bq. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). ~diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285~ 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function was: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function > add a new config to diskWriteBufferSize which is hard coded before > -- > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR Improvement in two: > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > when change the size of the diskWriteBufferSize to test forceSorterToSpill > The average performance of running 10 times is as follows:(their unit is MS). > ~diskWriteBufferSize: 1M512K256K128K64K32K16K > 8K4K > --- > RecordSize = 2.5M 742 722 694 686 667668671 > 669 683 > RecordSize = 1M294 293 292 287 283285281 > 279 285~ > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). {quote}diskWriteBufferSize: 1M512K256K128K64K32K 16K8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285{quote} 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function was: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). ~diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285~ 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function > add a new config to diskWriteBufferSize which is hard coded before > -- > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR Improvement in two: > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > when change the size of the diskWriteBufferSize to test forceSorterToSpill > The average performance of running 10 times is as follows:(their unit is MS). > {quote}diskWriteBufferSize: 1M512K256K128K64K32K > 16K8K4K > --- > RecordSize = 2.5M 742 722 694 686 667668671 > 669 683 > RecordSize = 1M294 293 292 287 283285281 > 279 285{quote} > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: This PR Improvement in two: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. when change the size of the diskWriteBufferSize to test forceSorterToSpill The average performance of running 10 times is as follows:(their unit is MS). diskWriteBufferSize: 1M512K256K128K64K32K16K 8K4K --- RecordSize = 2.5M 742 722 694 686 667668671 669 683 RecordSize = 1M294 293 292 287 283285281 279 285 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function was: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. Summary: add a new config to diskWriteBufferSize which is hard coded before (was: Improve diskWriteBufferSize configurable) > add a new config to diskWriteBufferSize which is hard coded before > -- > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR Improvement in two: > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > when change the size of the diskWriteBufferSize to test forceSorterToSpill > The average performance of running 10 times is as follows:(their unit is MS). > diskWriteBufferSize: 1M512K256K128K64K32K16K > 8K4K > --- > RecordSize = 2.5M 742 722 694 686 667668671 > 669 683 > RecordSize = 1M294 293 292 287 283285281 > 279 285 > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
[ https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen closed SPARK-20999. - Not A Problem > No failure Stages, no log 'DAGScheduler: failed: Set()' output. > --- > > Key: SPARK-20999 > URL: https://issues.apache.org/jira/browse/SPARK-20999 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.1 >Reporter: caoxuewen >Priority: Trivial > > In the output of the spark log information: > INFO DAGScheduler: looking for newly runnable stages > INFO DAGScheduler: running: Set(ShuffleMapStage 14) > INFO DAGScheduler: waiting: Set(ResultStage 15) > INFO DAGScheduler: failed: Set() > If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need > to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.
caoxuewen created SPARK-20999: - Summary: No failure Stages, no log 'DAGScheduler: failed: Set()' output. Key: SPARK-20999 URL: https://issues.apache.org/jira/browse/SPARK-20999 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.2.1 Reporter: caoxuewen Priority: Trivial In the output of the spark log information: INFO DAGScheduler: looking for newly runnable stages INFO DAGScheduler: running: Set(ShuffleMapStage 14) INFO DAGScheduler: waiting: Set(ResultStage 15) INFO DAGScheduler: failed: Set() If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need to output in the log information. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of ShuffleExternalSorter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. was: 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize of UnsafeShuffleWriter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. > Improve diskWriteBufferSize configurable > > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize > of ShuffleExternalSorter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Summary: Improve diskWriteBufferSize configurable (was: Improve Serializerbuffersize configurable) > Improve diskWriteBufferSize configurable > > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize > of UnsafeShuffleWriter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize
[ https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034466#comment-16034466 ] caoxuewen commented on SPARK-20959: --- thanks for modify Priority > Add a parameter to UnsafeExternalSorter to configure filebuffersize > --- > > Key: SPARK-20959 > URL: https://issues.apache.org/jira/browse/SPARK-20959 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in > UnsafeExternalSorter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Component/s: (was: SQL) > Improve Serializerbuffersize configurable > - > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: caoxuewen > > 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize > of UnsafeShuffleWriter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize
caoxuewen created SPARK-20959: - Summary: Add a parameter to UnsafeExternalSorter to configure filebuffersize Key: SPARK-20959 URL: https://issues.apache.org/jira/browse/SPARK-20959 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.2.0 Reporter: caoxuewen Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in UnsafeExternalSorter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Description: 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize of UnsafeShuffleWriter. 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. was: 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of UnsafeExternalSorter . 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize of UnsafeShuffleWriter. 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. > Improve Serializerbuffersize configurable > - > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen > > 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize > of UnsafeShuffleWriter. > 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable
[ https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20950: -- Summary: Improve Serializerbuffersize configurable (was: Improve Serializerbuffersize and filebuffersize configurable) > Improve Serializerbuffersize configurable > - > > Key: SPARK-20950 > URL: https://issues.apache.org/jira/browse/SPARK-20950 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen > > 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of > UnsafeExternalSorter . > 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize > of UnsafeShuffleWriter. > 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in > mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20950) Improve Serializerbuffersize and filebuffersize configurable
caoxuewen created SPARK-20950: - Summary: Improve Serializerbuffersize and filebuffersize configurable Key: SPARK-20950 URL: https://issues.apache.org/jira/browse/SPARK-20950 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.2.0 Reporter: caoxuewen 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of UnsafeExternalSorter . 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize of UnsafeShuffleWriter. 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in mergeSpillsWithFileStream function. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20850) Improve division and multiplication mixing process the data
[ https://issues.apache.org/jira/browse/SPARK-20850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen resolved SPARK-20850. --- Resolution: Fixed > Improve division and multiplication mixing process the data > --- > > Key: SPARK-20850 > URL: https://issues.apache.org/jira/browse/SPARK-20850 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen > > spark-sql> select (1234567890123456789012 / 12345678901234567890120) * > 12345678901234567890120; > NULL > spark-sql> select (12345678901234567890 / 123) * 123; > NULL > when the length of the getText is greater than 19, The result is not what we > expected. > but mysql handle the value is ok. > mysql> select (1234567890123456789012 / 12345678901234567890120) * > 12345678901234567890120; > +--+ > | (1234567890123456789012 / 12345678901234567890120) * > 12345678901234567890120 | > +--+ > | > 1234567890123456789012. | > +--+ > 1 row in set (0.00 sec) > mysql> select (12345678901234567890 / 123) * 123; > ++ > | (12345678901234567890 / 123) * 123 | > ++ > | 12345678901234567890. | > ++ > 1 row in set (0.00 sec) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20850) Improve division and multiplication mixing process the data
caoxuewen created SPARK-20850: - Summary: Improve division and multiplication mixing process the data Key: SPARK-20850 URL: https://issues.apache.org/jira/browse/SPARK-20850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: caoxuewen spark-sql> select (1234567890123456789012 / 12345678901234567890120) * 12345678901234567890120; NULL spark-sql> select (12345678901234567890 / 123) * 123; NULL when the length of the getText is greater than 19, The result is not what we expected. but mysql handle the value is ok. mysql> select (1234567890123456789012 / 12345678901234567890120) * 12345678901234567890120; +--+ | (1234567890123456789012 / 12345678901234567890120) * 12345678901234567890120 | +--+ | 1234567890123456789012. | +--+ 1 row in set (0.00 sec) mysql> select (12345678901234567890 / 123) * 123; ++ | (12345678901234567890 / 123) * 123 | ++ | 12345678901234567890. | ++ 1 row in set (0.00 sec) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20786) Improve ceil and floor handle the value which is not expected
[ https://issues.apache.org/jira/browse/SPARK-20786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20786: -- Summary: Improve ceil and floor handle the value which is not expected (was: Improve ceil handle the value which is not expected) > Improve ceil and floor handle the value which is not expected > - > > Key: SPARK-20786 > URL: https://issues.apache.org/jira/browse/SPARK-20786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: caoxuewen > > spark-sql>SELECT ceil(1234567890123456); > 1234567890123456 > spark-sql>SELECT ceil(12345678901234567); > 12345678901234568 > spark-sql>SELECT ceil(123456789012345678); > 123456789012345680 > when the length of the getText is greater than 16. long to double will be > precision loss. > but mysql handle the value is ok. > mysql> SELECT ceil(1234567890123456); > ++ > | ceil(1234567890123456) | > ++ > | 1234567890123456 | > ++ > 1 row in set (0.00 sec) > mysql> SELECT ceil(12345678901234567); > +-+ > | ceil(12345678901234567) | > +-+ > | 12345678901234567 | > +-+ > 1 row in set (0.00 sec) > mysql> SELECT ceil(123456789012345678); > +--+ > | ceil(123456789012345678) | > +--+ > | 123456789012345678 | > +--+ > 1 row in set (0.00 sec) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20786) Improve ceil handle the value which is not expected
caoxuewen created SPARK-20786: - Summary: Improve ceil handle the value which is not expected Key: SPARK-20786 URL: https://issues.apache.org/jira/browse/SPARK-20786 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: caoxuewen spark-sql>SELECT ceil(1234567890123456); 1234567890123456 spark-sql>SELECT ceil(12345678901234567); 12345678901234568 spark-sql>SELECT ceil(123456789012345678); 123456789012345680 when the length of the getText is greater than 16. long to double will be precision loss. but mysql handle the value is ok. mysql> SELECT ceil(1234567890123456); ++ | ceil(1234567890123456) | ++ | 1234567890123456 | ++ 1 row in set (0.00 sec) mysql> SELECT ceil(12345678901234567); +-+ | ceil(12345678901234567) | +-+ | 12345678901234567 | +-+ 1 row in set (0.00 sec) mysql> SELECT ceil(123456789012345678); +--+ | ceil(123456789012345678) | +--+ | 123456789012345678 | +--+ 1 row in set (0.00 sec) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20607: -- Priority: Minor (was: Trivial) > Add new unit tests to ShuffleSuite > -- > > Key: SPARK-20607 > URL: https://issues.apache.org/jira/browse/SPARK-20607 > Project: Spark > Issue Type: Test > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen >Priority: Minor > > 1.adds the new unit tests. > testing would be performed when there is no shuffle stage, > shuffle will not generate the data file and the index files. > 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its > local shuffle file' unit test, > parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20607: -- Priority: Trivial (was: Major) > Add new unit tests to ShuffleSuite > -- > > Key: SPARK-20607 > URL: https://issues.apache.org/jira/browse/SPARK-20607 > Project: Spark > Issue Type: Test > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen >Priority: Trivial > > 1.adds the new unit tests. > testing would be performed when there is no shuffle stage, > shuffle will not generate the data file and the index files. > 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its > local shuffle file' unit test, > parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite
[ https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20607: -- Issue Type: Test (was: Bug) > Add new unit tests to ShuffleSuite > -- > > Key: SPARK-20607 > URL: https://issues.apache.org/jira/browse/SPARK-20607 > Project: Spark > Issue Type: Test > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen > > 1.adds the new unit tests. > testing would be performed when there is no shuffle stage, > shuffle will not generate the data file and the index files. > 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its > local shuffle file' unit test, > parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20609) Run the SortShuffleSuite unit tests have residual spark_* system directory
[ https://issues.apache.org/jira/browse/SPARK-20609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20609: -- Priority: Minor (was: Major) > Run the SortShuffleSuite unit tests have residual spark_* system directory > -- > > Key: SPARK-20609 > URL: https://issues.apache.org/jira/browse/SPARK-20609 > Project: Spark > Issue Type: Bug > Components: Shuffle, Tests >Affects Versions: 2.1.2 >Reporter: caoxuewen >Priority: Minor > > This PR solution to run the SortShuffleSuite unit tests have residual spark_* > system directory > For example: > OS:Windows 7 > After the running SortShuffleSuite unit tests, > the system of TMP directory have > '..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not > deleted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20739: -- Issue Type: Test (was: Improvement) > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Minor > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20739: -- Issue Type: Improvement (was: Test) > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Minor > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20739: -- Priority: Minor (was: Trivial) > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Minor > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20739: -- Issue Type: Test (was: Improvement) > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Test > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010147#comment-16010147 ] caoxuewen edited comment on SPARK-20739 at 5/15/17 8:26 AM: ok thanks, was (Author: heary-cao): ok thansk, > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010147#comment-16010147 ] caoxuewen commented on SPARK-20739: --- ok thansk, > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite
[ https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caoxuewen updated SPARK-20739: -- Summary: Supplement the new unit tests to SparkContextSchedulerCreationSuite (was: Unit testing support for 'spark://' format) > Supplement the new unit tests to SparkContextSchedulerCreationSuite > --- > > Key: SPARK-20739 > URL: https://issues.apache.org/jira/browse/SPARK-20739 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Tests >Affects Versions: 2.2.0 >Reporter: caoxuewen >Priority: Trivial > > This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' > branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20739) Unit testing support for 'spark://' format
caoxuewen created SPARK-20739: - Summary: Unit testing support for 'spark://' format Key: SPARK-20739 URL: https://issues.apache.org/jira/browse/SPARK-20739 Project: Spark Issue Type: Bug Components: Scheduler, Tests Affects Versions: 2.2.1 Reporter: caoxuewen This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' branch in createTaskScheduler methods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20609) Run the SortShuffleSuite unit tests have residual spark_* system directory
caoxuewen created SPARK-20609: - Summary: Run the SortShuffleSuite unit tests have residual spark_* system directory Key: SPARK-20609 URL: https://issues.apache.org/jira/browse/SPARK-20609 Project: Spark Issue Type: Bug Components: Shuffle, Tests Affects Versions: 2.1.2 Reporter: caoxuewen This PR solution to run the SortShuffleSuite unit tests have residual spark_* system directory For example: OS:Windows 7 After the running SortShuffleSuite unit tests, the system of TMP directory have '..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not deleted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20607) Add new unit tests to ShuffleSuite
caoxuewen created SPARK-20607: - Summary: Add new unit tests to ShuffleSuite Key: SPARK-20607 URL: https://issues.apache.org/jira/browse/SPARK-20607 Project: Spark Issue Type: Bug Components: Shuffle, Tests Affects Versions: 2.1.2 Reporter: caoxuewen 1.adds the new unit tests. testing would be performed when there is no shuffle stage, shuffle will not generate the data file and the index files. 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its local shuffle file' unit test, parallelize is 1 but not is 2, Check the index file and delete. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20518) Supplement the new blockidsuite unit tests
caoxuewen created SPARK-20518: - Summary: Supplement the new blockidsuite unit tests Key: SPARK-20518 URL: https://issues.apache.org/jira/browse/SPARK-20518 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.1.0 Reporter: caoxuewen adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , TempShuffleBlockId , TempLocalBlockId -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.
caoxuewen created SPARK-20471: - Summary: Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled. Key: SPARK-20471 URL: https://issues.apache.org/jira/browse/SPARK-20471 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.0 Reporter: caoxuewen remove AggregateBenchmark testsuite warning: such as '14:26:33.220 WARN org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap is disabled but vectorized hashmap is enabled.' unit tests: AggregateBenchmark Modify the 'ignore function for 'test funtion -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org