[jira] [Created] (SPARK-26289) cleanup enablePerfMetrics parameter from BytesToBytesMap

2018-12-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26289:
-

 Summary: cleanup enablePerfMetrics parameter from BytesToBytesMap
 Key: SPARK-26289
 URL: https://issues.apache.org/jira/browse/SPARK-26289
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


enablePerfMetrics was originally designed in BytesToBytesMap to control 
getNumHashCollisions getTimeSpentResizingNs getAverageProbesPerLookup.
However, as the Spark version gradual progress. This parameter is only used for 
getAverageProbesPerLookup and always given to true when using BytesToBytesMap. 
it is also dangerous to determine whether getAverageProbesPerLookup opens and 
throws an IllegalStateException exception. So this pr will be remove 
enablePerfMetrics parameter from BytesToBytesMap. thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26271) remove unuse object SparkPlan

2018-12-05 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26271:
-

 Summary: remove unuse object SparkPlan
 Key: SPARK-26271
 URL: https://issues.apache.org/jira/browse/SPARK-26271
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: caoxuewen


this code come from PR: [#11190|https://github.com/apache/spark/pull/11190],
but this code has never been used, only since PR: 
[#14548|https://github.com/apache/spark/pull/14548],
Let's continue fix it. thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26180) Add a withCreateTempDir function to the SparkCore test case

2018-11-26 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26180:
-

 Summary: Add a withCreateTempDir function to the SparkCore test 
case
 Key: SPARK-26180
 URL: https://issues.apache.org/jira/browse/SPARK-26180
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Tests
Affects Versions: 2.5.0
Reporter: caoxuewen


Currently, the common withTempDir function is used in Spark SQL test cases. To 
handle val dir = Utils. createTempDir () and Utils. deleteRecursively (dir). 
Unfortunately, the withTempDir function cannot be used in the SparkCore test 
case. This PR adds a common withCreateTempDir function to clean up SparkCore 
test cases. thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26117:
-

 Summary: use SparkOutOfMemoryError instead of OutOfMemoryError 
when catch exception
 Key: SPARK-26117
 URL: https://issues.apache.org/jira/browse/SPARK-26117
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
entire executor when an OutOfMemoryError is thrown.
so apply for memory using MemoryConsumer. allocatePage when catch exception, 
use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26073) remove invalid comment as we don't use it anymore

2018-11-15 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26073:
-

 Summary: remove invalid comment as we don't use it anymore
 Key: SPARK-26073
 URL: https://issues.apache.org/jira/browse/SPARK-26073
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


remove invalid comment as we don't use it anymore
More details: [https://github.com/apache/spark/pull/22976] (comment)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26072) remove invalid comment as we don't use it anymore

2018-11-15 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26072:
-

 Summary: remove invalid comment as we don't use it anymore
 Key: SPARK-26072
 URL: https://issues.apache.org/jira/browse/SPARK-26072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


remove invalid comment as we don't use it anymore
More details: [https://github.com/apache/spark/pull/22976] (comment)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26001) Reduce memory copy when writing decimal

2018-11-09 Thread caoxuewen (JIRA)
caoxuewen created SPARK-26001:
-

 Summary: Reduce memory copy when writing decimal
 Key: SPARK-26001
 URL: https://issues.apache.org/jira/browse/SPARK-26001
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


this PR fix 2 here:
- when writing non-null decimals, we not zero-out all the 16 allocated bytes. 
if the number of bytes needed for a decimal is greater than 8. then we not need 
zero-out between 0-byte and 8-byte. The first 8-byte will be covered when 
writing decimal.
- when writing null decimals, we not zero-out all the 16 allocated bytes. 
BitSetMethods.set the label for null and the length of decimal to 0. when we 
get the decimal, will not access the 16 byte memory value, so this is safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25974) Optimizes Generates bytecode for ordering based on the given order

2018-11-08 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25974:
-

 Summary: Optimizes Generates bytecode for ordering based on the 
given order
 Key: SPARK-25974
 URL: https://issues.apache.org/jira/browse/SPARK-25974
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: caoxuewen


Currently, when generates the code for ordering based on the given order, too 
many variables and assignment statements will be generated, which is not 
necessary. This PR will eliminate redundant variables. Optimizes Generates 
bytecode for ordering based on the given order.
The generated code looks like:

spark.range(1).selectExpr(
 "id as key",
 "(id & 1023) as value1",
"cast(id & 1023 as double) as value2",
"cast(id & 1023 as int) as value3"
).select("value1", "value2", "value3").orderBy("value1", "value2").collect()

before PR(codegen size: 178)

Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, 
false] ASC NULLS FIRST:
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */   public int compare(InternalRow a, InternalRow b) {
/* 016 */
/* 017 */ InternalRow i = null;
/* 018 */
/* 019 */ i = a;
/* 020 */ boolean isNullA_0;
/* 021 */ long primitiveA_0;
/* 022 */ {
/* 023 */   long value_0 = i.getLong(0);
/* 024 */   isNullA_0 = false;
/* 025 */   primitiveA_0 = value_0;
/* 026 */ }
/* 027 */ i = b;
/* 028 */ boolean isNullB_0;
/* 029 */ long primitiveB_0;
/* 030 */ {
/* 031 */   long value_0 = i.getLong(0);
/* 032 */   isNullB_0 = false;
/* 033 */   primitiveB_0 = value_0;
/* 034 */ }
/* 035 */ if (isNullA_0 && isNullB_0) {
/* 036 */   // Nothing
/* 037 */ } else if (isNullA_0) {
/* 038 */   return -1;
/* 039 */ } else if (isNullB_0) {
/* 040 */   return 1;
/* 041 */ } else {
/* 042 */   int comp = (primitiveA_0 > primitiveB_0 ? 1 : primitiveA_0 < 
primitiveB_0 ? -1 : 0);
/* 043 */   if (comp != 0) {
/* 044 */ return comp;
/* 045 */   }
/* 046 */ }
/* 047 */
/* 048 */ i = a;
/* 049 */ boolean isNullA_1;
/* 050 */ double primitiveA_1;
/* 051 */ {
/* 052 */   double value_1 = i.getDouble(1);
/* 053 */   isNullA_1 = false;
/* 054 */   primitiveA_1 = value_1;
/* 055 */ }
/* 056 */ i = b;
/* 057 */ boolean isNullB_1;
/* 058 */ double primitiveB_1;
/* 059 */ {
/* 060 */   double value_1 = i.getDouble(1);
/* 061 */   isNullB_1 = false;
/* 062 */   primitiveB_1 = value_1;
/* 063 */ }
/* 064 */ if (isNullA_1 && isNullB_1) {
/* 065 */   // Nothing
/* 066 */ } else if (isNullA_1) {
/* 067 */   return -1;
/* 068 */ } else if (isNullB_1) {
/* 069 */   return 1;
/* 070 */ } else {
/* 071 */   int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA_1, primitiveB_1);
/* 072 */   if (comp != 0) {
/* 073 */ return comp;
/* 074 */   }
/* 075 */ }
/* 076 */
/* 077 */
/* 078 */ return 0;
/* 079 */   }
/* 080 */
/* 081 */
/* 082 */ }

After PR(codegen size: 89)
Generated Ordering by input[0, bigint, false] ASC NULLS FIRST,input[1, double, 
false] ASC NULLS FIRST:
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */   public int compare(InternalRow a, InternalRow b) {
/* 016 */
/* 017 */
/* 018 */ long value_0 = a.getLong(0);
/* 019 */ long value_2 = b.getLong(0);
/* 020 */ if (false && false) {
/* 021 */   // Nothing
/* 022 */ } else if (false) {
/* 023 */   return -1;
/* 024 */ } else if (false) {
/* 025 */   return 1;
/* 026 */ } else {
/* 027 */   int comp = (value_0 > value_2 ? 1 : value_0 < value_2 ? -1 : 0);
/* 028 */   if (comp != 0) {
/* 029 */ return comp;
/* 030 */   }
/* 031 */ }
/* 032 */
/* 033 */ double value_1 = a.getDouble(1);
/* 034 */ double value_3 = b.getDouble(1);
/* 035 */ if (false && false) {
/* 036 */   // Nothing
/* 037 */ } else if (false) {
/* 038 */   return -1;
/* 039 */ } else if (false) {
/* 040 */   

[jira] [Updated] (SPARK-24066) Add new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions

2018-11-05 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-24066:
--
Description: 
Currently, when two adjacent window functions have the same partition and the 
same intersection of order, 
There will be two sorted after shuffling, which is not necessary. This PR adds 
a new optimization rule to eliminate unnecessary sort by exchanged adjacent 
Window expressions.

For example:

val df = Seq(
  ("a", "p1", 10.0, 20.0, 30.0),
  ("a", "p2", 20.0, 10.0, 40.0)).toDF("key", "value", "value1", "value2", 
"value3").select(
    $"key",
    sum("value1").over(Window.partitionBy("key").orderBy("value")),
    max("value2").over(Window.partitionBy("key").orderBy("value", 
"value1")),
    avg("value3").over(Window.partitionBy("key").orderBy("value", "value1", 
"value2"))
  ).queryExecution.executedPlan

 

Before  this PR:

*(5) Project [key#16, sum(value1) OVER (PARTITION BY key ORDER BY value ASC 
NULLS FIRST unspecifiedframe$())#29, max(value2) OVER (PARTITION BY key ORDER 
BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30, 
avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC 
NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31]
+- Window [max(value2#19) windowspecdefinition(key#16, value#17 ASC NULLS 
FIRST, value1#18 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
unboundedpreceding$(), currentrow$())) AS max(value2) OVER (PARTITION BY key 
ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST 
unspecifiedframe$())#30], [key#16], [value#17 ASC NULLS FIRST, value1#18 ASC 
NULLS FIRST]
   +- *(4) Project [key#16, value1#18, value#17, value2#19, sum(value1) OVER 
(PARTITION BY key ORDER BY value ASC NULLS FIRST unspecifiedframe$())#29, 
avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC 
NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31]
  +- Window [avg(value3#20) windowspecdefinition(key#16, value#17 ASC NULLS 
FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST, 
specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS 
avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC 
NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31], [key#16], 
[value#17 ASC NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST]
 +- *(3) Sort [key#16 ASC NULLS FIRST, value#17 ASC NULLS FIRST, 
value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST], false, 0
    +- Window [sum(value1#18) windowspecdefinition(key#16, value#17 ASC 
NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
currentrow$())) AS sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS 
FIRST unspecifiedframe$())#29], [key#16], [value#17 ASC NULLS FIRST]
   +- *(2) Sort [key#16 ASC NULLS FIRST, value#17 ASC NULLS FIRST], 
false, 0
  +- Exchange hashpartitioning(key#16, 5)
 +- *(1) Project [_1#5 AS key#16, _3#7 AS value1#18, _2#6 
AS value#17, _4#8 AS value2#19, _5#9 AS value3#20]
    +- LocalTableScan [_1#5, _2#6, _3#7, _4#8, _5#9]

 

After  this PR:

*(5) Project [key#16, sum(value1) OVER (PARTITION BY key ORDER BY value ASC 
NULLS FIRST unspecifiedframe$())#29, max(value2) OVER (PARTITION BY key ORDER 
BY value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30, 
avg(value3) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC 
NULLS FIRST, value2 ASC NULLS FIRST unspecifiedframe$())#31]
+- Window [sum(value1#18) windowspecdefinition(key#16, value#17 ASC NULLS 
FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) 
AS sum(value1) OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST 
unspecifiedframe$())#29], [key#16], [value#17 ASC NULLS FIRST]
   +- *(4) Project [key#16, value1#18, value#17, avg(value3) OVER (PARTITION BY 
key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, value2 ASC NULLS 
FIRST unspecifiedframe$())#31, max(value2) OVER (PARTITION BY key ORDER BY 
value ASC NULLS FIRST, value1 ASC NULLS FIRST unspecifiedframe$())#30]
  +- Window [max(value2#19) windowspecdefinition(key#16, value#17 ASC NULLS 
FIRST, value1#18 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
unboundedpreceding$(), currentrow$())) AS max(value2) OVER (PARTITION BY key 
ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST 
unspecifiedframe$())#30], [key#16], [value#17 ASC NULLS FIRST, value1#18 ASC 
NULLS FIRST]
 +- *(3) Project [key#16, value1#18, value#17, value2#19, avg(value3) 
OVER (PARTITION BY key ORDER BY value ASC NULLS FIRST, value1 ASC NULLS FIRST, 
value2 ASC NULLS FIRST unspecifiedframe$())#31]
    +- Window [avg(value3#20) windowspecdefinition(key#16, value#17 ASC 
NULLS FIRST, value1#18 ASC NULLS FIRST, value2#19 ASC NULLS FIRST, 

[jira] [Updated] (SPARK-24066) Add new optimization rule to eliminate unnecessary sort by exchanged adjacent Window expressions

2018-11-05 Thread caoxuewen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-24066:
--
Summary: Add new optimization rule to eliminate unnecessary sort by 
exchanged adjacent Window expressions  (was: Add a window exchange rule to 
eliminate redundant physical plan SortExec)

> Add new optimization rule to eliminate unnecessary sort by exchanged adjacent 
> Window expressions
> 
>
> Key: SPARK-24066
> URL: https://issues.apache.org/jira/browse/SPARK-24066
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Major
>
> Currently, when the order field of window function has a subset relationship, 
> SparkSQL will randomly generate different physical plan.
> Similar like:
> case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
> val df = spark.sparkContext.parallelize(
>   DistinctAgg(8, 2, 3, 4, "a") ::
>   DistinctAgg(9, 3, 4, 5, "b") ::
>   DistinctAgg(3, 4, 5, 6, "c") ::
>   DistinctAgg(3, 4, 5, 7, "c") ::
>   DistinctAgg(3, 4, 5, 8, "c") ::
>   DistinctAgg(3, 6, 6, 9, "d") ::
>   DistinctAgg(30, 40, 50, 60, "e") ::
>   DistinctAgg(41, 51, 61, 71, null) ::
>   DistinctAgg(42, 52, 62, 72, null) ::
>   DistinctAgg(43, 53, 63, 73, "k") ::Nil).toDF()
> df.createOrReplaceTempView("distinctAgg")
> select a, b, c, 
> avg(b) over(partition by a order by b) as sumIb, 
> sum(d) over(partition by a order by b, c) as sumId, d 
> from distinctAgg 
> The physics plan will produce different results randomly.  
> One: there is only one sort of physical plan  
> == Physical Plan ==
> *(3) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184]
> +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 ASC 
> NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST]
>    +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 
> ASC NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
> unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC 
> NULLS FIRST, c#183 ASC NULLS FIRST]
>   +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC 
> NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(a#181, 5)
>     +- *(1) Project [a#181, b#182, c#183, d#184]
>    +- *(1) SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).b AS b#182, assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).d AS d#184, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).e, true, false) AS e#185]
>   +- Scan ExternalRDDScan[obj#180]
> Another one: there is two sort of physical plans
> == Physical Plan ==
> *(4) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184]
> +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC 
> NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
> unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC 
> NULLS FIRST, c#183 ASC NULLS FIRST]
>    +- *(3) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC 
> NULLS FIRST], false, 0
>   +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 
> ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
> currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST]
>  +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST], false, 0
>     +- Exchange hashpartitioning(a#181, 5)
>    +- *(1) Project [a#181, b#182, c#183, d#184]
>   +- *(1) SerializeFromObject [assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).b AS b#182, assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).d AS d#184, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
> true]).e, true, false) AS e#185]
>  +- Scan ExternalRDDScan[obj#180]
> this 

[jira] [Created] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25848:
-

 Summary: Refactor CSVBenchmarks to use main method
 Key: SPARK-25848
 URL: https://issues.apache.org/jira/browse/SPARK-25848
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25847:
-

 Summary: Refactor JSONBenchmarks to use main method
 Key: SPARK-25847
 URL: https://issues.apache.org/jira/browse/SPARK-25847
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25846:
-

 Summary: Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use 
main method
 Key: SPARK-25846
 URL: https://issues.apache.org/jira/browse/SPARK-25846
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25755) Supplementation of non-CodeGen unit tested for BroadcastHashJoinExec

2018-10-17 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25755:
-

 Summary: Supplementation of non-CodeGen unit tested for 
BroadcastHashJoinExec
 Key: SPARK-25755
 URL: https://issues.apache.org/jira/browse/SPARK-25755
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


Currently, the BroadcastHashJoinExec physical plan supports CodeGen and 
non-codegen, but only CodeGen code is tested in the unit tests of 
InnerJoinSuite、OuterJoinSuite、ExistenceJoinSuite, and non-codegen code is not 
tested. This PR supplements this part of the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24999) Reduce unnecessary 'new' memory operations

2018-08-02 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24999:
-

 Summary: Reduce unnecessary 'new' memory operations
 Key: SPARK-24999
 URL: https://issues.apache.org/jira/browse/SPARK-24999
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


This PR is to solve the CodeGen code generated by fast hash, and there is no 
need to apply for a block of memory for every new entry, because unsafeRow's 
memory can be reused.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24978) Add spark.sql.fast.hash.aggregate.row.max.capacity to configure the capacity of fast aggregation.

2018-07-31 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24978:
-

 Summary: Add spark.sql.fast.hash.aggregate.row.max.capacity to 
configure the capacity of fast aggregation.
 Key: SPARK-24978
 URL: https://issues.apache.org/jira/browse/SPARK-24978
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: caoxuewen


this pr add a configuration parameter to configure the capacity of fast 
aggregation. 
Performance comparison:
    /*
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
    Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
    Aggregate w multiple keys:   Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    

    fasthash = default    5612 / 5882  3.7  
   267.6   1.0X
    fasthash = config 3586 / 3595  5.8  
   171.0   1.6X
    */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false

2018-07-24 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24901:
-

 Summary: Merge the codegen of RegularHashMap and fastHashMap to 
reduce compiler maxCodesize when VectorizedHashMap is false
 Key: SPARK-24901
 URL: https://issues.apache.org/jira/browse/SPARK-24901
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, Generate code of update UnsafeRow in hash aggregation.
FastHashMap and RegularHashMap are two separate codes,These two separate codes 
need only when VectorizedHashMap is true. but other cases, we can merge 
together to reduce compiler maxCodesize. thanks.
case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
spark.sparkContext.parallelize(
  DistinctAgg(8, 2, 3, 4, "a") ::
  DistinctAgg(9, 3, 4, 5, "b") 
::Nil).toDF()createOrReplaceTempView("distinctAgg")
val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else 
null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e")
println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan))
df.show()

Generate code like:
 *Before modified:*
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
...
/* 354 */
/* 355 */ if (agg_fastAggBuffer_0 != null) {
/* 356 */   // common sub-expressions
/* 357 */
/* 358 */   // evaluate aggregate function
/* 359 */   agg_agg_isNull_31_0 = true;
/* 360 */   int agg_value_34 = -1;
/* 361 */
/* 362 */   boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0);
/* 363 */   int agg_value_35 = agg_isNull_32 ?
/* 364 */   -1 : (agg_fastAggBuffer_0.getInt(0));
/* 365 */
/* 366 */   if (!agg_isNull_32 && (agg_agg_isNull_31_0 ||
/* 367 */   agg_value_34 > agg_value_35)) {
/* 368 */ agg_agg_isNull_31_0 = false;
/* 369 */ agg_value_34 = agg_value_35;
/* 370 */   }
/* 371 */
/* 372 */   if (!false && (agg_agg_isNull_31_0 ||
/* 373 */   agg_value_34 > agg_expr_2_0)) {
/* 374 */ agg_agg_isNull_31_0 = false;
/* 375 */ agg_value_34 = agg_expr_2_0;
/* 376 */   }
/* 377 */   agg_agg_isNull_34_0 = true;
/* 378 */   int agg_value_37 = -1;
/* 379 */
/* 380 */   boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1);
/* 381 */   int agg_value_38 = agg_isNull_35 ?
/* 382 */   -1 : (agg_fastAggBuffer_0.getInt(1));
/* 383 */
/* 384 */   if (!agg_isNull_35 && (agg_agg_isNull_34_0 ||
/* 385 */   agg_value_37 > agg_value_38)) {
/* 386 */ agg_agg_isNull_34_0 = false;
/* 387 */ agg_value_37 = agg_value_38;
/* 388 */   }
/* 389 */
/* 390 */   byte agg_caseWhenResultState_1 = -1;
/* 391 */   do {
/* 392 */ boolean agg_value_40 = false;
/* 393 */ agg_value_40 = agg_expr_0_0 > 10;
/* 394 */ if (!false && agg_value_40) {
/* 395 */   agg_caseWhenResultState_1 = (byte)(false ? 1 : 0);
/* 396 */   agg_agg_value_39_0 = agg_expr_0_0;
/* 397 */   continue;
/* 398 */ }
/* 399 */
/* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0);
/* 401 */ agg_agg_value_39_0 = -1;
/* 402 */
/* 403 */   } while (false);
/* 404 */   // TRUE if any condition is met and the result is null, or no 
any condition is met.
/* 405 */   final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 0);
/* 406 */
/* 407 */   if (!agg_isNull_36 && (agg_agg_isNull_34_0 ||
/* 408 */   agg_value_37 > agg_agg_value_39_0)) {
/* 409 */ agg_agg_isNull_34_0 = false;
/* 410 */ agg_value_37 = agg_agg_value_39_0;
/* 411 */   }
/* 412 */   agg_agg_isNull_42_0 = true;
/* 413 */   int agg_value_45 = -1;
/* 414 */
/* 415 */   boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2);
/* 416 */   int agg_value_46 = agg_isNull_43 ?
/* 417 */   -1 : (agg_fastAggBuffer_0.getInt(2));
/* 418 */
/* 419 */   if (!agg_isNull_43 && (agg_agg_isNull_42_0 ||
/* 420 */   agg_value_45 > agg_value_46)) {
/* 421 */ agg_agg_isNull_42_0 = false;
/* 422 */ agg_value_45 = agg_value_46;
/* 423 */   }
/* 424 */
/* 425 */   if (!false && (agg_agg_isNull_42_0 ||
/* 426 */   agg_value_45 > agg_expr_0_0)) {
/* 427 */ agg_agg_isNull_42_0 = false;
/* 428 */ agg_value_45 = agg_expr_0_0;
/* 429 */   }
/* 430 */   // update fast row
/* 431 */   agg_fastAggBuffer_0.setInt(0, agg_value_34);
/* 432 */
/* 433 */   if (!agg_agg_isNull_34_0) {
/* 434 */ agg_fastAggBuffer_0.setInt(1, agg_value_37);
/* 435 */   } else {
/* 436 */ agg_fastAggBuffer_0.setNullAt(1);
/* 437 */   }
/* 438 */
/* 439 */   agg_fastAggBuffer_0.setInt(2, agg_value_45);
/* 440 */ } else {
/* 441 */   // common 

[jira] [Created] (SPARK-24066) Add a window exchange rule to eliminate redundant physical plan SortExec

2018-04-24 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24066:
-

 Summary: Add a window exchange rule to eliminate redundant 
physical plan SortExec
 Key: SPARK-24066
 URL: https://issues.apache.org/jira/browse/SPARK-24066
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, when the order field of window function has a subset relationship, 
SparkSQL will randomly generate different physical plan.
Similar like:

case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
val df = spark.sparkContext.parallelize(
  DistinctAgg(8, 2, 3, 4, "a") ::
  DistinctAgg(9, 3, 4, 5, "b") ::
  DistinctAgg(3, 4, 5, 6, "c") ::
  DistinctAgg(3, 4, 5, 7, "c") ::
  DistinctAgg(3, 4, 5, 8, "c") ::
  DistinctAgg(3, 6, 6, 9, "d") ::
  DistinctAgg(30, 40, 50, 60, "e") ::
  DistinctAgg(41, 51, 61, 71, null) ::
  DistinctAgg(42, 52, 62, 72, null) ::
  DistinctAgg(43, 53, 63, 73, "k") ::Nil).toDF()
df.createOrReplaceTempView("distinctAgg")

select a, b, c, 
avg(b) over(partition by a order by b) as sumIb, 
sum(d) over(partition by a order by b, c) as sumId, d 
from distinctAgg 

The physics plan will produce different results randomly.  
One: there is only one sort of physical plan  
== Physical Plan ==
*(3) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184]
+- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 ASC 
NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST]
   +- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC 
NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC 
NULLS FIRST, c#183 ASC NULLS FIRST]
  +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC 
NULLS FIRST], false, 0
 +- Exchange hashpartitioning(a#181, 5)
    +- *(1) Project [a#181, b#182, c#183, d#184]
   +- *(1) SerializeFromObject [assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, 
assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
true]).b AS b#182, assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, 
assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
true]).d AS d#184, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).e, true, false) AS 
e#185]
  +- Scan ExternalRDDScan[obj#180]

Another one: there is two sort of physical plans
== Physical Plan ==
*(4) Project [a#181, b#182, c#183, sumId#210L, sumIb#209L, d#184]
+- Window [sum(cast(d#184 as bigint)) windowspecdefinition(a#181, b#182 ASC 
NULLS FIRST, c#183 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, 
unboundedpreceding$(), currentrow$())) AS sumId#210L], [a#181], [b#182 ASC 
NULLS FIRST, c#183 ASC NULLS FIRST]
   +- *(3) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST, c#183 ASC NULLS 
FIRST], false, 0
  +- Window [sum(cast(b#182 as bigint)) windowspecdefinition(a#181, b#182 
ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
currentrow$())) AS sumIb#209L], [a#181], [b#182 ASC NULLS FIRST]
 +- *(2) Sort [a#181 ASC NULLS FIRST, b#182 ASC NULLS FIRST], false, 0
    +- Exchange hashpartitioning(a#181, 5)
   +- *(1) Project [a#181, b#182, c#183, d#184]
  +- *(1) SerializeFromObject [assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).a AS a#181, 
assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
true]).b AS b#182, assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).c AS c#183, 
assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$DistinctAgg, 
true]).d AS d#184, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$DistinctAgg, true]).e, true, false) AS 
e#185]
 +- Scan ExternalRDDScan[obj#180]

this PR add an exchange rule to ensure that no redundant physical plan SortExec 
is generated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23676) Support left join codegen in SortMergeJoinExec

2018-03-14 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23676:
-

 Summary: Support left join codegen in SortMergeJoinExec
 Key: SPARK-23676
 URL: https://issues.apache.org/jira/browse/SPARK-23676
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


This PR generates java code to directly complete the function of LeftOuter in 
`SortMergeJoinExec` without using an iterator. 
This PR improves runtime performance by this generates java code.

joinBenchmark result: **1.3x**
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
left sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
--
left merge join wholestage=off 2439 / 2575 0.9 1163.0 1.0X
left merge join wholestage=on 1890 / 1904 1.1 901.1 1.3X
```
Benchmark program
```
 val N = 2 << 20
 runBenchmark("left sort merge join", N) {
 val df1 = sparkSession.range(N)
 .selectExpr(s"(id * 15485863) % ${N*10} as k1")
 val df2 = sparkSession.range(N)
 .selectExpr(s"(id * 15485867) % ${N*10} as k2")
 val df = df1.join(df2, col("k1") === col("k2"), "left")
 
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
 df.count()
```
code example
```
val df1 = spark.range(2 << 20).selectExpr("id as k1", "id * 2 as v1")
val df2 = spark.range(2 << 20).selectExpr("id as k2", "id * 3 as v2")
df1.join(df2, col("k1") === col("k2") && col("v1") < col("v2"), "left").collect
```
Generated code
```
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage5(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=5
/* 006 */ final class GeneratedIteratorForCodegenStage5 extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */ private Object[] references;
/* 008 */ private scala.collection.Iterator[] inputs;
/* 009 */ private scala.collection.Iterator smj_leftInput;
/* 010 */ private scala.collection.Iterator smj_rightInput;
/* 011 */ private InternalRow smj_leftRow;
/* 012 */ private InternalRow smj_rightRow;
/* 013 */ private long smj_value2;
/* 014 */ private 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches;
/* 015 */ private long smj_value3;
/* 016 */ private long smj_value4;
/* 017 */ private long smj_value5;
/* 018 */ private long smj_value6;
/* 019 */ private boolean smj_isNull2;
/* 020 */ private long smj_value7;
/* 021 */ private boolean smj_isNull3;
/* 022 */ private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[] 
smj_mutableStateArray1 = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder[1];
/* 023 */ private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] 
smj_mutableStateArray2 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 024 */ private UnsafeRow[] smj_mutableStateArray = new UnsafeRow[1];
/* 025 */
/* 026 */ public GeneratedIteratorForCodegenStage5(Object[] references) {
/* 027 */ this.references = references;
/* 028 */ }
/* 029 */
/* 030 */ public void init(int index, scala.collection.Iterator[] inputs) {
/* 031 */ partitionIndex = index;
/* 032 */ this.inputs = inputs;
/* 033 */ smj_leftInput = inputs[0];
/* 034 */ smj_rightInput = inputs[1];
/* 035 */
/* 036 */ smj_matches = new 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483647, 
2147483647);
/* 037 */ smj_mutableStateArray[0] = new UnsafeRow(4);
/* 038 */ smj_mutableStateArray1[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_mutableStateArray[0],
 0);
/* 039 */ smj_mutableStateArray2[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(smj_mutableStateArray1[0],
 4);
/* 040 */
/* 041 */ }
/* 042 */
/* 043 */ private void writeJoinRows() throws java.io.IOException {
/* 044 */ smj_mutableStateArray2[0].zeroOutNullBytes();
/* 045 */
/* 046 */ smj_mutableStateArray2[0].write(0, smj_value4);
/* 047 */
/* 048 */ smj_mutableStateArray2[0].write(1, smj_value5);
/* 049 */
/* 050 */ if (smj_isNull2) {
/* 051 */ smj_mutableStateArray2[0].setNullAt(2);
/* 052 */ } else {
/* 053 */ smj_mutableStateArray2[0].write(2, smj_value6);
/* 054 */ }
/* 055 */
/* 056 */ if (smj_isNull3) {
/* 057 */ smj_mutableStateArray2[0].setNullAt(3);
/* 058 */ } else {
/* 059 */ smj_mutableStateArray2[0].write(3, smj_value7);
/* 060 */ }
/* 061 */ append(smj_mutableStateArray[0].copy());
/* 062 */
/* 063 */ }
/* 064 */
/* 065 */ private boolean findNextJoinRows(
/* 066 */ scala.collection.Iterator leftIter,
/* 067 */ scala.collection.Iterator rightIter) {
/* 068 */ smj_leftRow = null;
/* 069 */ int comp = 0;
/* 070 */ while (smj_leftRow == null) {
/* 071 */ if (!leftIter.hasNext()) return false;
/* 072 */ smj_leftRow = (InternalRow) leftIter.next();
/* 073 */
/* 

[jira] [Updated] (SPARK-23609) Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected

2018-03-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-23609:
--
Summary: Test EnsureRequirements's test cases to eliminate ShuffleExchange 
while is not expected  (was: Test code does not conform to the test title)

> Test EnsureRequirements's test cases to eliminate ShuffleExchange while is 
> not expected
> ---
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23609) Test code does not conform to the test title

2018-03-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23609:
-

 Summary: Test code does not conform to the test title
 Key: SPARK-23609
 URL: https://issues.apache.org/jira/browse/SPARK-23609
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, In testing EnsureRequirements's test cases to eliminate 
ShuffleExchange, The test code is not in conformity with the purpose of the 
test.These test cases are as follows:
1、test("EnsureRequirements eliminates Exchange if child has same partitioning")
   The checking condition is that there is no ShuffleExchange in the physical 
plan. = = 2 It's not accurate here.
2、test("EnsureRequirements does not eliminate Exchange with different 
partitioning")
   The purpose of the test is to not eliminate ShuffleExchange, but its test 
code is the same as test("EnsureRequirements eliminates Exchange if child has 
same partitioning")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23544) Remove redundancy ShuffleExchange in the planner

2018-03-04 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-23544:
--
Description: 
Currently, we execute the SQL statement:
select MDTTemp.* from (select * from distinctAgg where a > 2 distribute by a, 
b,c) MDTTemp 
left join (select * from testData5 where f > 1) ProjData 
on MDTTemp.b = ProjData.g and 
 MDTTemp.c = ProjData.h and 
 MDTTemp.d < (ProjData.j - 3) and 
 MDTTemp.d >= (ProjData.j + 3)

the physical plan the explain looks like:
== Physical Plan ==
*Project [a#203, b#204, c#205, d#206, e#207]
+- SortMergeJoin [b#204, c#205], [g#222, h#223], LeftOuter, ((d#206 < (j#224 - 
3)) && (d#206 >= (j#224 + 3)))
 :- *Sort [b#204 ASC, c#205 ASC], false, 0
 : +- Exchange hashpartitioning(b#204, c#205, 5)
 : +- Exchange hashpartitioning(a#203, b#204, c#205, 5)
 : +- *Filter (a#203 > 2)
 : +- Scan ExistingRDD[a#203,b#204,c#205,d#206,e#207]
 +- *Sort [g#222 ASC, h#223 ASC], false, 0
 +- Exchange hashpartitioning(g#222, h#223, 5)
 +- *Project [g#222, h#223, j#224]
 +- *Filter (f#221 > 1)
 +- Scan ExistingRDD[f#221,g#222,h#223,j#224,k#225]

There is a redundancy ShuffleExchange that is not necessary. This PR will 
provide a rule to remove redundancy ShuffleExchange in the planner. now the 
explain looks like:

== Physical Plan ==
*Project [a#203, b#204, c#205, d#206, e#207]
+- SortMergeJoin [b#204, c#205], [g#222, h#223], LeftOuter, ((d#206 < (j#224 - 
3)) && (d#206 >= (j#224 + 3)))
 :- *Sort [b#204 ASC, c#205 ASC], false, 0
 : +- Exchange hashpartitioning(b#204, c#205, 5)
 : +- *Filter (a#203 > 2)
 : +- Scan ExistingRDD[a#203,b#204,c#205,d#206,e#207]
 +- *Sort [g#222 ASC, h#223 ASC], false, 0
 +- Exchange hashpartitioning(g#222, h#223, 5)
 +- *Project [g#222, h#223, j#224]
 +- *Filter (f#221 > 1)
 +- Scan ExistingRDD[f#221,g#222,h#223,j#224,k#225]

 

and I have add a test case:

val N = 2 << 20
 runJoinBenchmark("sort merge join", N) {
 val df1 = sparkSession.range(N)
 .selectExpr(s"(id * 15485863) % ${N*10} as k1")
 val df2 = sparkSession.range(N)
 .selectExpr(s"(id * 15485867) % ${N*10} as k2")
 val df = df1.join(df2.repartition(20), col("k1") === col("k2"))
 
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
 df.count()
 }

 

To test the performance of the following:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
 Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
 sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
 

 sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X
 sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X

  was:
Currently, when the children of the join are Repartition or 
RepartitionByExpression, Repartition operation is not necessary, I think that 
we can remove the Repartition operation in the Optimizer, and it is safe for 
the join operation. now the explain looks like:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition 
===
 
Input LogicalPlan:
Join Inner
:- Repartition 10, false
: +- LocalRelation , [a#0, b#1]
+- Repartition 10, false
 +- LocalRelation , [c#2, d#3]

Output LogicalPlan:
Join Inner
:- LocalRelation , [a#0, b#1]
+- LocalRelation , [c#2, d#3]

 
h3. and I have add a test case:


val N = 2 << 20
runJoinBenchmark("sort merge join", N) {
 val df1 = sparkSession.range(N)
 .selectExpr(s"(id * 15485863) % ${N*10} as k1")
 val df2 = sparkSession.range(N)
 .selectExpr(s"(id * 15485867) % ${N*10} as k2")
 val df = df1.join(df2.repartition(20), col("k1") === col("k2"))
 
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
 df.count()
}

 

To test the performance of the following:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative

sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X
sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X


> Remove redundancy ShuffleExchange in the planner
> 
>
> Key: SPARK-23544
> URL: https://issues.apache.org/jira/browse/SPARK-23544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Major
>
> Currently, we execute the SQL statement:
> select MDTTemp.* from (select * from distinctAgg where a > 2 distribute by a, 
> b,c) MDTTemp 
> left join (select * from testData5 where f > 1) ProjData 
> on MDTTemp.b = ProjData.g and 
>  MDTTemp.c = ProjData.h and 
>  MDTTemp.d < (ProjData.j - 3) and 
>  MDTTemp.d >= (ProjData.j + 3)
> the physical plan the explain 

[jira] [Updated] (SPARK-23544) Remove redundancy ShuffleExchange in the planner

2018-03-04 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-23544:
--
Summary: Remove redundancy ShuffleExchange in the planner  (was: Remove 
repartition operation from join in the optimizer)

> Remove redundancy ShuffleExchange in the planner
> 
>
> Key: SPARK-23544
> URL: https://issues.apache.org/jira/browse/SPARK-23544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Major
>
> Currently, when the children of the join are Repartition or 
> RepartitionByExpression, Repartition operation is not necessary, I think that 
> we can remove the Repartition operation in the Optimizer, and it is safe for 
> the join operation. now the explain looks like:
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition 
> ===
>  
> Input LogicalPlan:
> Join Inner
> :- Repartition 10, false
> : +- LocalRelation , [a#0, b#1]
> +- Repartition 10, false
>  +- LocalRelation , [c#2, d#3]
> Output LogicalPlan:
> Join Inner
> :- LocalRelation , [a#0, b#1]
> +- LocalRelation , [c#2, d#3]
>  
> h3. and I have add a test case:
> val N = 2 << 20
> runJoinBenchmark("sort merge join", N) {
>  val df1 = sparkSession.range(N)
>  .selectExpr(s"(id * 15485863) % ${N*10} as k1")
>  val df2 = sparkSession.range(N)
>  .selectExpr(s"(id * 15485867) % ${N*10} as k2")
>  val df = df1.join(df2.repartition(20), col("k1") === col("k2"))
>  
> assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
>  df.count()
> }
>  
> To test the performance of the following:
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
> Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
> sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> 
> sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X
> sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23544) Remove repartition operation from join in the optimizer

2018-02-28 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23544:
-

 Summary: Remove repartition operation from join in the optimizer
 Key: SPARK-23544
 URL: https://issues.apache.org/jira/browse/SPARK-23544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, when the children of the join are Repartition or 
RepartitionByExpression, Repartition operation is not necessary, I think that 
we can remove the Repartition operation in the Optimizer, and it is safe for 
the join operation. now the explain looks like:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseRepartition 
===
 
Input LogicalPlan:
Join Inner
:- Repartition 10, false
: +- LocalRelation , [a#0, b#1]
+- Repartition 10, false
 +- LocalRelation , [c#2, d#3]

Output LogicalPlan:
Join Inner
:- LocalRelation , [a#0, b#1]
+- LocalRelation , [c#2, d#3]

 
h3. and I have add a test case:


val N = 2 << 20
runJoinBenchmark("sort merge join", N) {
 val df1 = sparkSession.range(N)
 .selectExpr(s"(id * 15485863) % ${N*10} as k1")
 val df2 = sparkSession.range(N)
 .selectExpr(s"(id * 15485867) % ${N*10} as k2")
 val df = df1.join(df2.repartition(20), col("k1") === col("k2"))
 
assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[SortMergeJoinExec]).isDefined)
 df.count()
}

 

To test the performance of the following:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Windows 7 6.1
Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
sort merge join: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative

sort merge join Repartition off 3520 / 4364 0.6 1678.5 1.0X
sort merge join Repartition on 1946 / 2203 1.1 927.9 1.8X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23356) Pushes Project to both sides of Union when expression is non-deterministic

2018-02-07 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23356:
-

 Summary: Pushes Project to both sides of Union when expression is 
non-deterministic
 Key: SPARK-23356
 URL: https://issues.apache.org/jira/browse/SPARK-23356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, PushProjectionThroughUnion optimizer only supports pushdown project 
operator to both sides of a Union operator when expression is deterministic , 
in fact, we can be like pushdown filters, also support pushdown project 
operator to both sides of a Union operator when expression is non-deterministic 
, this PR description fix this problem。now the explain looks like:

=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion ===
Input LogicalPlan:
Project [a#0, rand(10) AS rnd#9]
+- Union
 :- LocalRelation , [a#0, b#1, c#2]
 :- LocalRelation , [d#3, e#4, f#5]
 +- LocalRelation , [g#6, h#7, i#8]

Output LogicalPlan:
Project [a#0, rand(10) AS rnd#9]
+- Union
 :- Project [a#0]
 : +- LocalRelation , [a#0, b#1, c#2]
 :- Project [d#3]
 : +- LocalRelation , [d#3, e#4, f#5]
 +- Project [g#6]
 +- LocalRelation , [g#6, h#7, i#8]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23343) Increase the exception test for the bind port

2018-02-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23343:
-

 Summary: Increase the exception test for the bind port
 Key: SPARK-23343
 URL: https://issues.apache.org/jira/browse/SPARK-23343
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: caoxuewen


this PR add new test case, 
1、add the boundary value test of port 65535
2、add the privileged port to test,
3、add rebinding port test when set `spark.port.maxRetries` is 1,
4、add `Utils.userPort` self circulation to generating port,

in addition, in the existing test case, if you don't set the `spark.testing` 
for true, the default value for `spark.port.maxRetries` is not 100, but 16, 
(expectedPort + 100) is a little mistake.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23311) add FilterFunction test case for test CombineTypedFilters

2018-02-01 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23311:
-

 Summary: add FilterFunction test case for test CombineTypedFilters
 Key: SPARK-23311
 URL: https://issues.apache.org/jira/browse/SPARK-23311
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


In the current test case for CombineTypedFilters, we lack the test of 
FilterFunction, so let's add it. In addition, in TypedFilterOptimizationSuite's 
existing test cases, Let's extract a common LocalRelation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23247) combines Unsafe operations and statistics operations in Scan Data Source

2018-01-26 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23247:
-

 Summary: combines Unsafe operations and statistics operations in 
Scan Data Source
 Key: SPARK-23247
 URL: https://issues.apache.org/jira/browse/SPARK-23247
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, we scan the execution plan of the data source, first the unsafe 
operation of each row of data, and then re traverse the data for the count of 
rows. In terms of performance, this is not necessary. this PR combines the two 
operations and makes statistics on the number of rows while performing the 
unsafe operation.

*Before modified,*

{color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { 
(index{color:#cc7832}, {color}iter) =>
 {color:#cc7832}val {color}proj = 
UnsafeProjection.create({color:#9876aa}schema{color})
 proj.initialize(index)
 {color:#FF}iter.map(proj){color}
}

{color:#cc7832}val {color}numOutputRows = 
longMetric({color:#6a8759}"numOutputRows"{color})
unsafeRow.map { r =>
 {color:#FF}numOutputRows += {color}{color:#6897bb}{color:#FF}1{color}
{color} r
}

*After modified,*

    val numOutputRows = longMetric("numOutputRows")

    rdd.mapPartitionsWithIndexInternal { (index, iter) =>
  val proj = UnsafeProjection.create(schema)
  proj.initialize(index)
  iter.map( r => {
{color:#FF}    numOutputRows += 1{color}
{color:#FF}    proj(r){color}
  })
    }

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23199) improved Removes repetition from group expressions in Aggregate

2018-01-23 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23199:
-

 Summary: improved Removes repetition from group expressions in 
Aggregate
 Key: SPARK-23199
 URL: https://issues.apache.org/jira/browse/SPARK-23199
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, all Aggregate operations will go into 
RemoveRepetitionFromGroupExpressions, but there is no group expression or there 
is no duplicate group expression in group expression, we not need copy for 
logic plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22035) the value of statistical logicalPlan.stats.sizeInBytes which is not expected

2017-09-16 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-22035:
--
Summary: the value of statistical logicalPlan.stats.sizeInBytes which is 
not expected  (was: Improving the value of statistical 
logicalPlan.stats.sizeInBytes which is not expected)

> the value of statistical logicalPlan.stats.sizeInBytes which is not expected
> 
>
> Key: SPARK-22035
> URL: https://issues.apache.org/jira/browse/SPARK-22035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, assume there will be the same number of rows as child has.  
> statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
> percentage of the size of the child's data type and the size of the current 
> data type size.   But there is a problem. Statistics are not very accurate. 
> for example:
>  ```
> val N = 1 << 3
> val df = spark.range(N).selectExpr("id as k1",
>   s"cast(id  % 3 as string) as idString1",
>   s"cast((id + 1) % 5 as string) as idString3")
> val sizeInBytes = df.logicalPlan.stats.sizeInBytes
> println("sizeInBytes : " + sizeInBytes)
> ```
> before modify:
> sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
> debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
> ```
> p.child.dataType: LongType defaultSize: 8
> p.dataType: LongType defaultSize: 8
> p.dataType: StringType defaultSize: 20
> p.dataType: StringType defaultSize: 20
> childRowSize: 16 outputRowSize: 56
> p.child.stats.sizeInBytes : 64
> p.stats.sizeInBytes : 224
> sizeInBytes: 224
> ```
> but sizeInBytes  must be  384( 8 * (8 + 20 + 20) ).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22035) Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected

2017-09-16 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-22035:
--
Description: 
Currently, assume there will be the same number of rows as child has.  
statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
percentage of the size of the child's data type and the size of the current 
data type size.   But there is a problem. Statistics are not very accurate. for 
example:
 ```
val N = 1 << 3
val df = spark.range(N).selectExpr("id as k1",
  s"cast(id  % 3 as string) as idString1",
  s"cast((id + 1) % 5 as string) as idString3")
val sizeInBytes = df.logicalPlan.stats.sizeInBytes
println("sizeInBytes : " + sizeInBytes)
```
before modify:
sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 16 outputRowSize: 56
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 224
sizeInBytes: 224
```

but sizeInBytes  must be  384( 8 * (8 + 20 + 20) ).


  was:
Currently, assume there will be the same number of rows as child has.  
statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
percentage of the size of the child's data type and the size of the current 
data type size.   But there is a problem. Statistics are not very accurate. for 
example:
 ```
val N = 1 << 3
val df = spark.range(N).selectExpr("id as k1",
  s"cast(id  % 3 as string) as idString1",
  s"cast((id + 1) % 5 as string) as idString3")
val sizeInBytes = df.logicalPlan.stats.sizeInBytes
println("sizeInBytes : " + sizeInBytes)
```
before modify:
sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 16 outputRowSize: 56
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 224
sizeInBytes: 224
```

after modify:
sizeInBytes  is 384( 8 * 8 ((8 + 20 + 20) / 8) ).
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 8 outputRowSize: 48
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 384
sizeInBytes: 384
```


> Improving the value of statistical logicalPlan.stats.sizeInBytes which is not 
> expected
> --
>
> Key: SPARK-22035
> URL: https://issues.apache.org/jira/browse/SPARK-22035
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, assume there will be the same number of rows as child has.  
> statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
> percentage of the size of the child's data type and the size of the current 
> data type size.   But there is a problem. Statistics are not very accurate. 
> for example:
>  ```
> val N = 1 << 3
> val df = spark.range(N).selectExpr("id as k1",
>   s"cast(id  % 3 as string) as idString1",
>   s"cast((id + 1) % 5 as string) as idString3")
> val sizeInBytes = df.logicalPlan.stats.sizeInBytes
> println("sizeInBytes : " + sizeInBytes)
> ```
> before modify:
> sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
> debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
> ```
> p.child.dataType: LongType defaultSize: 8
> p.dataType: LongType defaultSize: 8
> p.dataType: StringType defaultSize: 20
> p.dataType: StringType defaultSize: 20
> childRowSize: 16 outputRowSize: 56
> p.child.stats.sizeInBytes : 64
> p.stats.sizeInBytes : 224
> sizeInBytes: 224
> ```
> but sizeInBytes  must be  384( 8 * (8 + 20 + 20) ).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22035) Improving the value of statistical logicalPlan.stats.sizeInBytes which is not expected

2017-09-16 Thread caoxuewen (JIRA)
caoxuewen created SPARK-22035:
-

 Summary: Improving the value of statistical 
logicalPlan.stats.sizeInBytes which is not expected
 Key: SPARK-22035
 URL: https://issues.apache.org/jira/browse/SPARK-22035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, assume there will be the same number of rows as child has.  
statistics `logicalPlan.stats.SizeInBytes` is calculated based on the 
percentage of the size of the child's data type and the size of the current 
data type size.   But there is a problem. Statistics are not very accurate. for 
example:
 ```
val N = 1 << 3
val df = spark.range(N).selectExpr("id as k1",
  s"cast(id  % 3 as string) as idString1",
  s"cast((id + 1) % 5 as string) as idString3")
val sizeInBytes = df.logicalPlan.stats.sizeInBytes
println("sizeInBytes : " + sizeInBytes)
```
before modify:
sizeInBytes is 224(8 * 8 * ( (8 + 20 + 20 +8) / (8 + 8))).  
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 16 outputRowSize: 56
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 224
sizeInBytes: 224
```

after modify:
sizeInBytes  is 384( 8 * 8 ((8 + 20 + 20) / 8) ).
debug information in ` SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode `
```
p.child.dataType: LongType defaultSize: 8
p.dataType: LongType defaultSize: 8
p.dataType: StringType defaultSize: 20
p.dataType: StringType defaultSize: 20
childRowSize: 8 outputRowSize: 48
p.child.stats.sizeInBytes : 64
p.stats.sizeInBytes : 384
sizeInBytes: 384
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21963) create temp file should be delete after use

2017-09-09 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21963:
-

 Summary: create temp file should be delete after use
 Key: SPARK-21963
 URL: https://issues.apache.org/jira/browse/SPARK-21963
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.3.0
Reporter: caoxuewen


After you create a temporary table, you need to delete it, otherwise it will 
leave a file similar to the file name ‘SPARK194465907929586320484966temp’




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-07 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen closed SPARK-21932.
-
Resolution: Resolved

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Minor
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Priority: Minor  (was: Major)

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Minor
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Description: 
this issue to fix two problems:
  1. add new two test case. test CartesianProduct join and BroadcastNestedLoop 
join. We understand the effect of CodeGen on them. and through these two test 
cases, we can know Per Row (NS), is much more delayed than our previous hash 
join.
  2.similar 'logical.Join', is not right, to be consistent in 
SparkStrategies.scala. fix to remove package name.

  was:
this issue to fix two problems:
  1. similar 'logical.Join', is not right, to be consistent in 
SparkStrategies.scala. fix to remove package name.
  2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
join. We understand the effect of CodeGen on them.


> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen reopened SPARK-21932:
---

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. add new two test case. test CartesianProduct join and 
> BroadcastNestedLoop join. We understand the effect of CodeGen on them. and 
> through these two test cases, we can know Per Row (NS), is much more delayed 
> than our previous hash join.
>   2.similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21932) Add test CartesianProduct join case and BroadcastNestedLoop join case in JoinBenchmark

2017-09-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21932:
--
Summary: Add test CartesianProduct join case and BroadcastNestedLoop join 
case in JoinBenchmark  (was: remove package name similar 'logical.Join' to 
'Join' in JoinSelection)

> Add test CartesianProduct join case and BroadcastNestedLoop join case in 
> JoinBenchmark
> --
>
> Key: SPARK-21932
> URL: https://issues.apache.org/jira/browse/SPARK-21932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> this issue to fix two problems:
>   1. similar 'logical.Join', is not right, to be consistent in 
> SparkStrategies.scala. fix to remove package name.
>   2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
> join. We understand the effect of CodeGen on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21932) remove package name similar 'logical.Join' to 'Join' in JoinSelection

2017-09-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21932:
-

 Summary: remove package name similar 'logical.Join' to 'Join' in 
JoinSelection
 Key: SPARK-21932
 URL: https://issues.apache.org/jira/browse/SPARK-21932
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.3.0
Reporter: caoxuewen


this issue to fix two problems:
  1. similar 'logical.Join', is not right, to be consistent in 
SparkStrategies.scala. fix to remove package name.
  2. add two test case. test CartesianProduct join and BroadcastNestedLoop 
join. We understand the effect of CodeGen on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21746) nondeterministic expressions incorrectly for filter predicates

2017-09-03 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen closed SPARK-21746.
-
Resolution: Not A Problem

> nondeterministic expressions incorrectly for filter predicates
> --
>
> Key: SPARK-21746
> URL: https://issues.apache.org/jira/browse/SPARK-21746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, We do interpretedpredicate optimization, but not very well, 
> because when our filter contained an indeterminate expression, it would have 
> an exception. This PR describes solving this problem by adding the initialize 
> method in InterpretedPredicate.
> java.lang.IllegalArgumentException:
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157)
>   at scala.collection.immutable.Stream.filter(Stream.scala:519)
>   at scala.collection.immutable.Stream.filter(Stream.scala:202)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925)
>   at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21746) nondeterministic expressions incorrectly for filter predicates

2017-08-16 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21746:
--
Summary: nondeterministic expressions incorrectly for filter predicates  
(was: nondeterministic expressions correctly for filter predicates)

> nondeterministic expressions incorrectly for filter predicates
> --
>
> Key: SPARK-21746
> URL: https://issues.apache.org/jira/browse/SPARK-21746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, We do interpretedpredicate optimization, but not very well, 
> because when our filter contained an indeterminate expression, it would have 
> an exception. This PR describes solving this problem by adding the initialize 
> method in InterpretedPredicate.
> java.lang.IllegalArgumentException:
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157)
>   at scala.collection.immutable.Stream.filter(Stream.scala:519)
>   at scala.collection.immutable.Stream.filter(Stream.scala:202)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925)
>   at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21746) nondeterministic expressions correctly for filter predicates

2017-08-16 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21746:
-

 Summary: nondeterministic expressions correctly for filter 
predicates
 Key: SPARK-21746
 URL: https://issues.apache.org/jira/browse/SPARK-21746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, We do interpretedpredicate optimization, but not very well, because 
when our filter contained an indeterminate expression, it would have an 
exception. This PR describes solving this problem by adding the initialize 
method in InterpretedPredicate.

java.lang.IllegalArgumentException:
java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized 
before eval.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:291)
at 
org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:415)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:38)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:158)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$$anonfun$prunePartitionsByFilter$1.apply(ExternalCatalogUtils.scala:157)
at scala.collection.immutable.Stream.filter(Stream.scala:519)
at scala.collection.immutable.Stream.filter(Stream.scala:202)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:157)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1129)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1119)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1119)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:925)
at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:60)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21707) Improvement a special case for non-deterministic filters in optimizer

2017-08-11 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21707:
-

 Summary: Improvement a special case for non-deterministic filters 
in optimizer
 Key: SPARK-21707
 URL: https://issues.apache.org/jira/browse/SPARK-21707
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic filters. Deal with that we only need to read user needs 
fields for non-deterministic filters in optimizer.
For example, the condition of filters is nondeterministic. e.g:contains 
nondeterministic function(rand function), HiveTableScans optimizer generated:

```
HiveTableScans plan:Aggregate [k#2L], [k#2L, k#2L, sum(cast(id#1 as bigint)) AS 
sum(id)#395L]
+- Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
   +- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && 
NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
  +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Project [d004#205 AS id#1, CEIL(c010#214) AS k#2L]
+- Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) <= 0.5)) && NOT 
(cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
   +- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:Filter ((isnotnull(d004#205) && (rand(-4530215890880734772) 
<= 0.5)) && NOT (cast(cast(d004#205 as decimal(10,0)) as decimal(11,1)) = 0.0))
+- MetastoreRelation XXX_database, XXX_table

HiveTableScans plan:MetastoreRelation XXX_database, XXX_table

```
so HiveTableScan will read all the fields from table. but we only need to 
‘d004’ and 'c010' . it will affect the performance of task.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer

2017-08-11 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects. Deal with that we only need to read user needs 
fields for non-deterministic projects in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.


  was:
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.



> Improvement a special case for non-deterministic projects in optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects. Deal with that we only need to read user 
> needs fields for non-deterministic projects in optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects in optimizer

2017-08-11 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Summary: Improvement a special case for non-deterministic projects in 
optimizer  (was: Improvement a special case for non-deterministic projects and 
filters in optimizer)

> Improvement a special case for non-deterministic projects in optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21705) Add spark.internal.config parameter description

2017-08-11 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21705:
-

 Summary: Add spark.internal.config parameter description 
 Key: SPARK-21705
 URL: https://issues.apache.org/jira/browse/SPARK-21705
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, some of the configuration parameters of spark.internal.config 
without adding a description, which is incorrectly, this PR is based on 
http://spark.apache.org/docs/latest/configuration.html Property Name to 
supplement spark.internal.config parameter description.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-10 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.
For example, the fields of project contains nondeterministic function(rand 
function), after a executedPlan optimizer generated:

*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table

HiveTableScan will read all the fields from table. but we only need to ‘d004’ . 
it will affect the performance of task.


  was:
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.



> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.
> For example, the fields of project contains nondeterministic function(rand 
> function), after a executedPlan optimizer generated:
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> HiveTableScan will read all the fields from table. but we only need to ‘d004’ 
> . it will affect the performance of task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-09 Thread caoxuewen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119617#comment-16119617
 ] 

caoxuewen commented on SPARK-21520:
---

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18892

> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-09 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, Did a lot of special handling for non-deterministic projects and 
filters in optimizer. but not good enough. this patch add a new special case 
for non-deterministic projects and filters. Deal with that we only need to read 
user needs fields for non-deterministic projects and filters in optimizer.


  was:
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.



> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, Did a lot of special handling for non-deterministic projects and 
> filters in optimizer. but not good enough. this patch add a new special case 
> for non-deterministic projects and filters. Deal with that we only need to 
> read user needs fields for non-deterministic projects and filters in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Improvement a special case for non-deterministic projects and filters in optimizer

2017-08-09 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Summary: Improvement a special case for non-deterministic projects and 
filters in optimizer  (was: Hivetable scan for all the columns the SQL 
statement contains the 'rand')

> Improvement a special case for non-deterministic projects and filters in 
> optimizer
> --
>
> Key: SPARK-21520
> URL: https://issues.apache.org/jira/browse/SPARK-21520
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>
> Currently, when the rand function is present in the SQL statement, hivetable 
> searches all columns in the table.
> e.g:
> select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
> ceil(c010) as cceila from XXX_table) a
> group by k,k;
> generate WholeStageCodegen subtrees:
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
> +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) 
> AS k#403L]
>+- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
> == Subtree 2 / 2 ==
> *HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
> output=[k#403L, k#403L, sum(id)#797L])
> +- Exchange hashpartitioning(k#403L, 200)
>+- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
> bigint))], output=[k#403L, sum#800L])
>   +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
> 1.0)) AS k#403L]
>  +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
> d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
> d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, 
> c023#625, c024#626, c025#627, c026#628, c027#629, ... 169 more fields], 
> MetastoreRelation XXX_database, XXX_table
>
> All columns will be searched in HiveTableScans , Consequently, All column 
> data is read to a ORC table.
> e.g:
> INFO ReaderImpl: Reading ORC rows from 
> hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 
> with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true, true, true, true, true, true, true, 
> true, true, true, true, true, true, true], offset: 0, length: 
> 9223372036854775807}
> so, The execution of the SQL statement will become very slow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-21520:
--
Description: 
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.


  was:
Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, 

[jira] [Created] (SPARK-21520) Hivetable scan for all the columns the SQL statement contains the 'rand'

2017-07-24 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21520:
-

 Summary: Hivetable scan for all the columns the SQL statement 
contains the 'rand'
 Key: SPARK-21520
 URL: https://issues.apache.org/jira/browse/SPARK-21520
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: caoxuewen


Currently, when the rand function is present in the SQL statement, hivetable 
searches all columns in the table.
e.g:
select k,k,sum(id) from (select d004 as id, floor(rand() * 1) as k, 
ceil(c010) as cceila from XXX_table) a
group by k,k;

generate WholeStageCodegen subtrees:
== Subtree 1 / 2 ==
*HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as bigint))], 
output=[k#403L, sum#800L])
+- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 1.0)) AS 
k#403L]
   +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
== Subtree 2 / 2 ==
*HashAggregate(keys=[k#403L], functions=[sum(cast(id#402 as bigint))], 
output=[k#403L, k#403L, sum(id)#797L])
+- Exchange hashpartitioning(k#403L, 200)
   +- *HashAggregate(keys=[k#403L], functions=[partial_sum(cast(id#402 as 
bigint))], output=[k#403L, sum#800L])
  +- Project [d004#607 AS id#402, FLOOR((rand(8828525941469309371) * 
1.0)) AS k#403L]
 +- HiveTableScan [c030#606L, d004#607, d005#608, d025#609, c002#610, 
d023#611, d024#612, c005#613L, c008#614, c009#615, c010#616, d021#617, 
d022#618, c017#619, c018#620, c019#621, c020#622, c021#623, c022#624, c023#625, 
c024#626, c025#627, c026#628, c027#629, ... 169 more fields], MetastoreRelation 
XXX_database, XXX_table
 
All columns will be searched in HiveTableScans , Consequently, All column data 
is read to a ORC table.
e.g:
INFO ReaderImpl: Reading ORC rows from 
hdfs://opena:8020/.../XXX_table/.../p_date=2017-05-25/p_hour=10/part-9 with 
{include: [true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true, true, true, true, true, true, true, true, 
true, true, true, true, true, true], offset: 0, length: 9223372036854775807}

so, The execution of the SQL statement will become very slow.

solution:
Set the property of the rand expression, deterministic = true



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations

2017-07-09 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21353:
-

 Summary: add checkValue in spark.internal.config about how to 
correctly set configurations
 Key: SPARK-21353
 URL: https://issues.apache.org/jira/browse/SPARK-21353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: caoxuewen
Priority: Minor


add checkValue for configurations:

spark.driver.memory
spark.executor.memory
spark.blockManager.port
spark.task.cpus
spark.dynamicAllocation.minExecutors
spark.task.maxFailures
spark.blacklist.task.maxTaskAttemptsPerExecutor
spark.blacklist.task.maxTaskAttemptsPerNode
spark.blacklist.application.maxFailedTasksPerExecutor
spark.blacklist.stage.maxFailedTasksPerExecutor
spark.blacklist.application.maxFailedExecutorsPerNode
spark.blacklist.stage.maxFailedExecutorsPerNode
spark.scheduler.listenerbus.metrics.maxListenerClassesTimed
spark.ui.retainedTasks
spark.blockManager.port
spark.driver.blockManager.port
spark.files.maxPartitionBytes
spark.files.openCostInBytes
spark.shuffle.accurateBlockThreshold
spark.shuffle.registration.timeout
spark.shuffle.registration.maxAttempts
spark.reducer.maxReqSizeShuffleToMem
and we copy the document from 
http://spark.apache.org/docs/latest/configuration.html

spark.driver.userClassPathFirst
spark.driver.memory
spark.executor.userClassPathFirst
spark.executor.memory
spark.yarn.isPython
spark.task.cpus
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.maxExecutors
spark.shuffle.service.enabled
spark.submit.pyFiles
spark.task.maxFailures
spark.blacklist.enabled
spark.blacklist.task.maxTaskAttemptsPerExecutor
spark.blacklist.task.maxTaskAttemptsPerNode
spark.blacklist.stage.maxFailedTasksPerExecutor
spark.blacklist.stage.maxFailedExecutorsPerNode
spark.pyspark.driver.python
spark.pyspark.python
spark.ui.retainedTasks
spark.io.encryption.enabled
spark.io.encryption.keygen.algorithm
spark.io.encryption.keySizeBits
spark.authenticate.enableSaslEncryption
spark.authenticate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before

2017-06-26 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).
 
{quote}diskWriteBufferSize:   1M512K256K128K64K32K
16K8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285{quote}
 
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function

  was:
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).
bq. 
bq. diskWriteBufferSize:   1M512K256K128K64K32K16K  
  8K4K
bq. 
---
bq. RecordSize = 2.5M  742   722 694 686 667668671  
  669   683
bq. RecordSize = 1M294   293 292 287 283285281  
  279   285
bq. 
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function


> add a new config to diskWriteBufferSize which is hard coded before
> --
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR Improvement in two:
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> when change the size of the diskWriteBufferSize to test forceSorterToSpill
> The average performance of running 10 times is as follows:(their unit is MS).
>  
> {quote}diskWriteBufferSize:   1M512K256K128K64K32K
> 16K8K4K
> ---
> RecordSize = 2.5M  742   722 694 686 667668671
> 669   683
> RecordSize = 1M294   293 292 287 283285281
> 279   285{quote}
>  
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before

2017-06-26 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).
bq. 
bq. diskWriteBufferSize:   1M512K256K128K64K32K16K  
  8K4K
bq. 
---
bq. RecordSize = 2.5M  742   722 694 686 667668671  
  669   683
bq. RecordSize = 1M294   293 292 287 283285281  
  279   285
bq. 
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function

  was:
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

{quote}diskWriteBufferSize:   1M512K256K128K64K32K
16K8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285{quote}

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function


> add a new config to diskWriteBufferSize which is hard coded before
> --
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR Improvement in two:
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> when change the size of the diskWriteBufferSize to test forceSorterToSpill
> The average performance of running 10 times is as follows:(their unit is MS).
> bq. 
> bq. diskWriteBufferSize:   1M512K256K128K64K32K
> 16K8K4K
> bq. 
> ---
> bq. RecordSize = 2.5M  742   722 694 686 667668
> 671669   683
> bq. RecordSize = 1M294   293 292 287 283285
> 281279   285
> bq. 
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before

2017-06-26 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

~diskWriteBufferSize:   1M512K256K128K64K32K16K
8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285~

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function

  was:
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

diskWriteBufferSize:   1M512K256K128K64K32K16K
8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function


> add a new config to diskWriteBufferSize which is hard coded before
> --
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR Improvement in two:
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> when change the size of the diskWriteBufferSize to test forceSorterToSpill
> The average performance of running 10 times is as follows:(their unit is MS).
> ~diskWriteBufferSize:   1M512K256K128K64K32K16K   
>  8K4K
> ---
> RecordSize = 2.5M  742   722 694 686 667668671
> 669   683
> RecordSize = 1M294   293 292 287 283285281
> 279   285~
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before

2017-06-26 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

{quote}diskWriteBufferSize:   1M512K256K128K64K32K
16K8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285{quote}

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function

  was:
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

~diskWriteBufferSize:   1M512K256K128K64K32K16K
8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285~

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function


> add a new config to diskWriteBufferSize which is hard coded before
> --
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR Improvement in two:
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> when change the size of the diskWriteBufferSize to test forceSorterToSpill
> The average performance of running 10 times is as follows:(their unit is MS).
> {quote}diskWriteBufferSize:   1M512K256K128K64K32K
> 16K8K4K
> ---
> RecordSize = 2.5M  742   722 694 686 667668671
> 669   683
> RecordSize = 1M294   293 292 287 283285281
> 279   285{quote}
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) add a new config to diskWriteBufferSize which is hard coded before

2017-06-26 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
This PR Improvement in two:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
when change the size of the diskWriteBufferSize to test forceSorterToSpill
The average performance of running 10 times is as follows:(their unit is MS).

diskWriteBufferSize:   1M512K256K128K64K32K16K
8K4K
---
RecordSize = 2.5M  742   722 694 686 667668671
669   683
RecordSize = 1M294   293 292 287 283285281
279   285

2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function

  was:
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.

Summary: add a new config to diskWriteBufferSize which is hard coded 
before  (was: Improve diskWriteBufferSize configurable)

> add a new config to diskWriteBufferSize which is hard coded before
> --
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR Improvement in two:
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> when change the size of the diskWriteBufferSize to test forceSorterToSpill
> The average performance of running 10 times is as follows:(their unit is MS).
> diskWriteBufferSize:   1M512K256K128K64K32K16K
> 8K4K
> ---
> RecordSize = 2.5M  742   722 694 686 667668671
> 669   683
> RecordSize = 1M294   293 292 287 283285281
> 279   285
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen closed SPARK-20999.
-

Not A Problem

> No failure Stages, no log 'DAGScheduler: failed: Set()' output.
> ---
>
> Key: SPARK-20999
> URL: https://issues.apache.org/jira/browse/SPARK-20999
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.1
>Reporter: caoxuewen
>Priority: Trivial
>
> In the output of the spark log information:
> INFO DAGScheduler: looking for newly runnable stages
> INFO DAGScheduler: running: Set(ShuffleMapStage 14)
> INFO DAGScheduler: waiting: Set(ResultStage 15)
> INFO DAGScheduler: failed: Set()
> If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need 
> to output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20999) No failure Stages, no log 'DAGScheduler: failed: Set()' output.

2017-06-06 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20999:
-

 Summary: No failure Stages, no log 'DAGScheduler: failed: Set()' 
output.
 Key: SPARK-20999
 URL: https://issues.apache.org/jira/browse/SPARK-20999
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.2.1
Reporter: caoxuewen
Priority: Trivial


In the output of the spark log information:

INFO DAGScheduler: looking for newly runnable stages
INFO DAGScheduler: running: Set(ShuffleMapStage 14)
INFO DAGScheduler: waiting: Set(ResultStage 15)
INFO DAGScheduler: failed: Set()

If there is no failure stage, "INFO DAGScheduler: failed: Set()" is no need to 
output in the log information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable

2017-06-03 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize of 
ShuffleExternalSorter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.

  was:
1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.


> Improve diskWriteBufferSize configurable
> 
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.With spark.shuffle.spill.diskWriteBufferSize configure diskWriteBufferSize 
> of ShuffleExternalSorter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve diskWriteBufferSize configurable

2017-06-03 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Summary: Improve diskWriteBufferSize configurable  (was: Improve 
Serializerbuffersize configurable)

> Improve diskWriteBufferSize configurable
> 
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-02 Thread caoxuewen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16034466#comment-16034466
 ] 

caoxuewen commented on SPARK-20959:
---

thanks for modify Priority

> Add a parameter to UnsafeExternalSorter to configure filebuffersize
> ---
>
> Key: SPARK-20959
> URL: https://issues.apache.org/jira/browse/SPARK-20959
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
> UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Component/s: (was: SQL)

> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20959) Add a parameter to UnsafeExternalSorter to configure filebuffersize

2017-06-01 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20959:
-

 Summary: Add a parameter to UnsafeExternalSorter to configure 
filebuffersize
 Key: SPARK-20959
 URL: https://issues.apache.org/jira/browse/SPARK-20959
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.0
Reporter: caoxuewen


Improvement with spark.shuffle.file.buffer configure fileBufferSizeBytes in 
UnsafeExternalSorter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Description: 
1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.

  was:
1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
UnsafeExternalSorter .  
2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.


> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 2.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20950) Improve Serializerbuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20950:
--
Summary: Improve Serializerbuffersize configurable  (was: Improve 
Serializerbuffersize and filebuffersize configurable)

> Improve Serializerbuffersize configurable
> -
>
> Key: SPARK-20950
> URL: https://issues.apache.org/jira/browse/SPARK-20950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> 1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
> UnsafeExternalSorter .  
> 2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
> of UnsafeShuffleWriter.
> 3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
> mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20950) Improve Serializerbuffersize and filebuffersize configurable

2017-06-01 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20950:
-

 Summary: Improve Serializerbuffersize and filebuffersize 
configurable
 Key: SPARK-20950
 URL: https://issues.apache.org/jira/browse/SPARK-20950
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.0
Reporter: caoxuewen


1.With spark.shuffle.file.buffer configure fileBufferSizeBytes of 
UnsafeExternalSorter .  
2.With spark.shuffle.sort.initialSerBufferSize configure SerializerBufferSize 
of UnsafeShuffleWriter.
3.Remove outputBufferSizeInBytes and inputBufferSizeInBytes to initialize in 
mergeSpillsWithFileStream function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20850) Improve division and multiplication mixing process the data

2017-05-23 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen resolved SPARK-20850.
---
Resolution: Fixed

> Improve division and multiplication mixing process the data
> ---
>
> Key: SPARK-20850
> URL: https://issues.apache.org/jira/browse/SPARK-20850
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> spark-sql> select  (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> NULL
> spark-sql> select  (12345678901234567890 / 123) * 123;
> NULL
> when the length of the getText is greater than 19, The result is not what we 
> expected.
> but mysql handle the value is ok.
> mysql> select (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120;
> +--+
> | (1234567890123456789012 / 12345678901234567890120) * 
> 12345678901234567890120 |
> +--+
> |  
> 1234567890123456789012. |
> +--+
> 1 row in set (0.00 sec)
> mysql> select (12345678901234567890 / 123) * 123;
> ++
> | (12345678901234567890 / 123) * 123 |
> ++
> |  12345678901234567890. |
> ++
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20850) Improve division and multiplication mixing process the data

2017-05-23 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20850:
-

 Summary: Improve division and multiplication mixing process the 
data
 Key: SPARK-20850
 URL: https://issues.apache.org/jira/browse/SPARK-20850
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: caoxuewen



spark-sql> select  (1234567890123456789012 / 12345678901234567890120) * 
12345678901234567890120;
NULL
spark-sql> select  (12345678901234567890 / 123) * 123;
NULL

when the length of the getText is greater than 19, The result is not what we 
expected.
but mysql handle the value is ok.

mysql> select (1234567890123456789012 / 12345678901234567890120) * 
12345678901234567890120;
+--+
| (1234567890123456789012 / 12345678901234567890120) * 12345678901234567890120 |
+--+
|  1234567890123456789012. |
+--+
1 row in set (0.00 sec)

mysql> select (12345678901234567890 / 123) * 123;
++
| (12345678901234567890 / 123) * 123 |
++
|  12345678901234567890. |
++
1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20786) Improve ceil and floor handle the value which is not expected

2017-05-17 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20786:
--
Summary: Improve ceil and floor handle the value which is not expected  
(was: Improve ceil handle the value which is not expected)

> Improve ceil and floor handle the value which is not expected
> -
>
> Key: SPARK-20786
> URL: https://issues.apache.org/jira/browse/SPARK-20786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>
> spark-sql>SELECT ceil(1234567890123456);
> 1234567890123456
> spark-sql>SELECT ceil(12345678901234567);
> 12345678901234568
> spark-sql>SELECT ceil(123456789012345678);
> 123456789012345680
> when the length of the getText is greater than 16. long to double will be 
> precision loss.
> but mysql handle the value is ok.
> mysql> SELECT ceil(1234567890123456);
> ++
> | ceil(1234567890123456) |
> ++
> |   1234567890123456 |
> ++
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(12345678901234567);
> +-+
> | ceil(12345678901234567) |
> +-+
> |   12345678901234567 |
> +-+
> 1 row in set (0.00 sec)
> mysql> SELECT ceil(123456789012345678);
> +--+
> | ceil(123456789012345678) |
> +--+
> |   123456789012345678 |
> +--+
> 1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20786) Improve ceil handle the value which is not expected

2017-05-17 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20786:
-

 Summary: Improve ceil handle the value which is not expected
 Key: SPARK-20786
 URL: https://issues.apache.org/jira/browse/SPARK-20786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: caoxuewen


spark-sql>SELECT ceil(1234567890123456);
1234567890123456

spark-sql>SELECT ceil(12345678901234567);
12345678901234568

spark-sql>SELECT ceil(123456789012345678);
123456789012345680

when the length of the getText is greater than 16. long to double will be 
precision loss.

but mysql handle the value is ok.

mysql> SELECT ceil(1234567890123456);
++
| ceil(1234567890123456) |
++
|   1234567890123456 |
++
1 row in set (0.00 sec)

mysql> SELECT ceil(12345678901234567);
+-+
| ceil(12345678901234567) |
+-+
|   12345678901234567 |
+-+
1 row in set (0.00 sec)

mysql> SELECT ceil(123456789012345678);
+--+
| ceil(123456789012345678) |
+--+
|   123456789012345678 |
+--+
1 row in set (0.00 sec)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20607:
--
Priority: Minor  (was: Trivial)

> Add new unit tests to ShuffleSuite
> --
>
> Key: SPARK-20607
> URL: https://issues.apache.org/jira/browse/SPARK-20607
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>Priority: Minor
>
> 1.adds the new unit tests.
>   testing would be performed when there is no shuffle stage, 
>   shuffle will not generate the data file and the index files.
> 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
> local shuffle file' unit test, 
>   parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20607:
--
Priority: Trivial  (was: Major)

> Add new unit tests to ShuffleSuite
> --
>
> Key: SPARK-20607
> URL: https://issues.apache.org/jira/browse/SPARK-20607
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>Priority: Trivial
>
> 1.adds the new unit tests.
>   testing would be performed when there is no shuffle stage, 
>   shuffle will not generate the data file and the index files.
> 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
> local shuffle file' unit test, 
>   parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20607:
--
Issue Type: Test  (was: Bug)

> Add new unit tests to ShuffleSuite
> --
>
> Key: SPARK-20607
> URL: https://issues.apache.org/jira/browse/SPARK-20607
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>
> 1.adds the new unit tests.
>   testing would be performed when there is no shuffle stage, 
>   shuffle will not generate the data file and the index files.
> 2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
> local shuffle file' unit test, 
>   parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20609) Run the SortShuffleSuite unit tests have residual spark_* system directory

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20609:
--
Priority: Minor  (was: Major)

> Run the SortShuffleSuite unit tests have residual spark_* system directory
> --
>
> Key: SPARK-20609
> URL: https://issues.apache.org/jira/browse/SPARK-20609
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Tests
>Affects Versions: 2.1.2
>Reporter: caoxuewen
>Priority: Minor
>
> This PR solution to run the SortShuffleSuite unit tests have residual spark_* 
> system directory
> For example:
> OS:Windows 7
> After the running SortShuffleSuite unit tests, 
> the system of TMP directory have 
> '..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not 
> deleted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20739:
--
Issue Type: Test  (was: Improvement)

> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Minor
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20739:
--
Issue Type: Improvement  (was: Test)

> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Minor
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20739:
--
Priority: Minor  (was: Trivial)

> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Minor
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20739:
--
Issue Type: Test  (was: Improvement)

> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Test
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010147#comment-16010147
 ] 

caoxuewen edited comment on SPARK-20739 at 5/15/17 8:26 AM:


ok
thanks,



was (Author: heary-cao):
ok
thansk,


> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16010147#comment-16010147
 ] 

caoxuewen commented on SPARK-20739:
---

ok
thansk,


> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20739) Supplement the new unit tests to SparkContextSchedulerCreationSuite

2017-05-15 Thread caoxuewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caoxuewen updated SPARK-20739:
--
Summary: Supplement the new unit tests to 
SparkContextSchedulerCreationSuite  (was: Unit testing support for 'spark://' 
format)

> Supplement the new unit tests to SparkContextSchedulerCreationSuite
> ---
>
> Key: SPARK-20739
> URL: https://issues.apache.org/jira/browse/SPARK-20739
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Tests
>Affects Versions: 2.2.0
>Reporter: caoxuewen
>Priority: Trivial
>
> This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' 
> branch in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20739) Unit testing support for 'spark://' format

2017-05-15 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20739:
-

 Summary: Unit testing support for 'spark://' format
 Key: SPARK-20739
 URL: https://issues.apache.org/jira/browse/SPARK-20739
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Tests
Affects Versions: 2.2.1
Reporter: caoxuewen


This PR adds the new unit tests to support test 'SPARK_REGEX(sparkUrl)' branch 
in createTaskScheduler methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20609) Run the SortShuffleSuite unit tests have residual spark_* system directory

2017-05-05 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20609:
-

 Summary: Run the SortShuffleSuite unit tests have residual spark_* 
system directory
 Key: SPARK-20609
 URL: https://issues.apache.org/jira/browse/SPARK-20609
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Tests
Affects Versions: 2.1.2
Reporter: caoxuewen


This PR solution to run the SortShuffleSuite unit tests have residual spark_* 
system directory
For example:
OS:Windows 7
After the running SortShuffleSuite unit tests, 
the system of TMP directory have 
'..\AppData\Local\Temp\spark-f64121f9-11b4-4ffd-a4f0-cfca66643503' not deleted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20607) Add new unit tests to ShuffleSuite

2017-05-04 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20607:
-

 Summary: Add new unit tests to ShuffleSuite
 Key: SPARK-20607
 URL: https://issues.apache.org/jira/browse/SPARK-20607
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Tests
Affects Versions: 2.1.2
Reporter: caoxuewen


1.adds the new unit tests.
  testing would be performed when there is no shuffle stage, 
  shuffle will not generate the data file and the index files.
2.Modify the '[SPARK-4085] rerun map stage if reduce stage cannot find its 
local shuffle file' unit test, 
  parallelize is 1 but not is 2, Check the index file and delete.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20518) Supplement the new blockidsuite unit tests

2017-04-27 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20518:
-

 Summary: Supplement the new blockidsuite unit tests
 Key: SPARK-20518
 URL: https://issues.apache.org/jira/browse/SPARK-20518
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.1.0
Reporter: caoxuewen


adds the new unit tests to support ShuffleDataBlockId , ShuffleIndexBlockId , 
TempShuffleBlockId , TempLocalBlockId 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20471) Remove AggregateBenchmark testsuite warning: Two level hashmap is disabled but vectorized hashmap is enabled.

2017-04-26 Thread caoxuewen (JIRA)
caoxuewen created SPARK-20471:
-

 Summary: Remove AggregateBenchmark testsuite warning: Two level 
hashmap is disabled but vectorized hashmap is enabled.
 Key: SPARK-20471
 URL: https://issues.apache.org/jira/browse/SPARK-20471
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.0
Reporter: caoxuewen


remove  AggregateBenchmark testsuite warning:
such as '14:26:33.220 WARN 
org.apache.spark.sql.execution.aggregate.HashAggregateExec: Two level hashmap 
is disabled but vectorized hashmap is enabled.'

unit tests: AggregateBenchmark 
Modify the 'ignore function for 'test funtion



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org