from:"xuanyuanking"

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77753006
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -259,8 +259,23 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
* @throws IllegalArgumentException if a field with the given name does 
not exist
*/
   def apply(name: String): StructField = {
-nameToField.getOrElse(name,
-  throw new IllegalArgumentException(s"""Field "$name" does not 
exist."""))
+if (name.contains('.')) {
--- End diff --

@HyukjinKwon Thanks for your review, mix the recursively get with the 
default apply has this problem, I fixed it in next patch and use ',' which is a 
invalid character in Parquet schema


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77760264
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -571,6 +571,44 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
 }
   }
 
+  test("SPARK-4502 parquet nested fields pruning") {
+// Schema of "test-data/nested-array-struct.parquet":
+//root
+//|-- primitive: integer (nullable = true)
+//|-- myComplex: array (nullable = true)
+//||-- element: struct (containsNull = true)
+//|||-- id: integer (nullable = true)
+//|||-- repeatedMessage: array (nullable = true)
+//||||-- element: struct (containsNull = true)
+//|||||-- someId: integer (nullable = true)
+val df = 
readResourceParquetFile("test-data/nested-array-struct.parquet")
--- End diff --


https://github.com/apache/spark/blob/master/sql/core/src/test/resources/test-data/nested-array-struct.parquet
I reuse this file to test nested struct in paruqet, this file in 
sql/core/src/test/resources/test-data/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77760397
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -280,6 +280,29 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
   }
 
   /**
+   * Extracts the [[StructField]] with the given name recursively.
+   *
+   * @throws IllegalArgumentException if the parent field's type is not 
StructType
+   */
+  def getFieldRecursively(name: String): StructField = {
--- End diff --

The mark of nested fields is one kind of tmp data, finally it will convert 
to a pruned StructType and pass to 
`org.apache.spark.sql.parquet.row.requested_schema`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77844084
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +98,16 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning) {
--- End diff --

I run all datasource tests by 
`./build/sbt "test-only org.apache.spark.sql.execution.datasources.*"`
three test failed, but run them seperately, all tests can passwd


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77845789
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +98,16 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning) {
--- End diff --

All three tests failed because a datetime check error, correct answer is 
like '2015-05-23' but the spark answer is '2015-05-22', I don't think this 
error is made by my patch.
Do anybody have the same problem before? it really confusing me and running 
test suit seperate it will pass!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77934557
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +98,16 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning) {
--- End diff --

It's my mistake here, I can only make sure this patch work for parquet,so I 
should check the fileFormat here, also like the config 
namespace(`spark.sql.parquet.nestColumnPruning`), it can only work for parquet. 
I add a patch to fix this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77934764
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +98,16 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning) {
--- End diff --

I ran the command 
```
./build/sbt "test-only org.apache.spark.sql.execution.datasources.*"
```
locally and three test suits failed
```
[error] Failed tests:
[error] org.apache.spark.sql.execution.datasources.csv.CSVSuite
[error] org.apache.spark.sql.execution.datasources.json.JsonSuite
[error] 
org.apache.spark.sql.execution.datasources.parquet.ParquetPartitionDiscoverySuite
```
But I run them separately, all three pass. Also I run 
```
./build/sbt "test-only org.apache.spark.sql.execution.csv.*"
./build/sbt "test-only org.apache.spark.sql.execution.json.*"
```
all tests pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/14957
  
(Thank you for your comments and help me a lot :)   )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...

2016-09-09 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/14957
  
@liancheng @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r77762907
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -280,6 +280,29 @@ case class StructType(fields: Array[StructField]) 
extends DataType with Seq[Stru
   }
 
   /**
+   * Extracts the [[StructField]] with the given name recursively.
+   *
+   * @throws IllegalArgumentException if the parent field's type is not 
StructType
+   */
+  def getFieldRecursively(name: String): StructField = {
--- End diff --

I think there's another way to solve this problem, maybe generate the final 
structType in FileSourceStrategy better.I'll try it and give another pr later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

2016-10-30 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r85656729
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -126,4 +136,52 @@ object FileSourceStrategy extends Strategy with 
Logging {
 
 case _ => Nil
   }
+
+  private def generateStructFieldsContainsNesting(projects: 
Seq[Expression],
+  totalSchema: StructType) : 
Seq[StructField] = {
+def generateStructField(curField: List[String],
+ node: Expression) : Seq[StructField] = {
+  node match {
+case ai: GetArrayItem =>
+  // Here we drop the previous for simplify array and map support.
+  // Same strategy in GetArrayStructFields and GetMapValue
+  generateStructField(List.empty[String], ai.child)
+case asf: GetArrayStructFields =>
+  generateStructField(List.empty[String], asf.child)
+case mv: GetMapValue =>
+  generateStructField(List.empty[String], mv.child)
+case attr: AttributeReference =>
+  Seq(getFieldRecursively(totalSchema, attr.name :: curField))
+case sf: GetStructField =>
+  generateStructField(sf.name.get :: curField, sf.child)
--- End diff --

fix done.
a little question here, all projects parse from sql must have the name 
while projects from dataframe api may not   , right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

2016-10-30 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r85656841
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -442,6 +443,79 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("[SPARK-4502] pruning nested schema by projects correctly") {
+val testFunc = 
PrivateMethod[Seq[StructField]]('generateStructFieldsContainsNesting)
+// Construct fullSchema like below:
+//root
+//|-- col: struct (nullable = true)
+//||-- s1: struct (nullable = true)
+//|||-- s1_1: long (nullable = true)
+//|||-- s1_2: long (nullable = true)
+//||-- str: string (nullable = true)
+//||-- info_list: array (nullable = true)
+//|||-- element: struct (containsNull = true)
+//||||-- s1: struct (nullable = true)
+//|||||-- s1_1: long (nullable = true)
+//|||||-- s1_2: long (nullable = true)
+//|-- num: long (nullable = true)
+//|-- str: string (nullable = true)
+val nested_s1 = StructField("s1",
+  StructType(
+Seq(
+  StructField("s1_1", LongType, true),
+  StructField("s1_2", LongType, true)
+)
+  ), true)
+val flat_str = StructField("str", StringType, true)
+val nested_arr = StructField("info_list", 
ArrayType(StructType(Seq(nested_s1))), true)
+
+val fullSchema = StructType(
+  Seq(
+StructField("col", StructType(Seq(nested_s1, flat_str, 
nested_arr)), true),
+StructField("num", LongType, true),
+flat_str
+  ))
+
+// Attr of struct col
+val colAttr = AttributeReference("col", StructType(
+  Seq(nested_s1, flat_str, nested_arr)), true)()
+// Child expression of col.s1.s1_1
+val childExp = GetStructField(
+  GetStructField(colAttr, 0, Some("s1")), 0, Some("s1_1"))
+// Child expression of col.info_list[0].s1.s1_1
+val arrayChildExp = GetStructField(
+  GetStructField(
+GetArrayItem(
+  GetStructField(colAttr, 0, Some("info_list")),
+  Literal(0)
+), 0, Some("s1")
+  ), 0, Some("s1_1")
+)
+// Project list of "select num, col.s1.s1_1 as s1_1, 
col.info_list[0].s1.s1_1 as complex_get"
+val projects = Seq(
+  AttributeReference("num", LongType, true)(),
+  Alias(childExp, "s1_1")(),
+  Alias(arrayChildExp, "complex_get")()
+  )
+val expextResult =
+  Seq(
+StructField("num", LongType, true),
+StructField("col", StructType(
+  Seq(
+StructField(
+  "s1",
+  StructType(Seq(StructField("s1_1", LongType, true))),
+  true)
+  )
+), true),
+StructField("col", StructType(Seq(nested_arr)))
+  )
+// Call the function generateStructFieldsContainsNesting
+val result = 
FileSourceStrategy.invokePrivate[Seq[StructField]](testFunc(projects,
+  fullSchema))
+assert(result == expextResult)
+  }
--- End diff --

fix done. Thanks for liancheng's remind.
Here I considered the CreateStruct(Unsafe) and CreateNamedStruct(Unsafe), 
other expressions in complexTypeCreator(CreateArray, CreateMap) just ignore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592821
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -126,4 +136,52 @@ object FileSourceStrategy extends Strategy with 
Logging {
 
 case _ => Nil
   }
+
+  private def generateStructFieldsContainsNesting(projects: 
Seq[Expression],
+  totalSchema: StructType) : 
Seq[StructField] = {
+def generateStructField(curField: List[String],
+ node: Expression) : Seq[StructField] = {
+  node match {
+case ai: GetArrayItem =>
+  // Here we drop the previous for simplify array and map support.
+  // Same strategy in GetArrayStructFields and GetMapValue
+  generateStructField(List.empty[String], ai.child)
+case asf: GetArrayStructFields =>
+  generateStructField(List.empty[String], asf.child)
+case mv: GetMapValue =>
+  generateStructField(List.empty[String], mv.child)
+case attr: AttributeReference =>
+  Seq(getFieldRecursively(totalSchema, attr.name :: curField))
+case sf: GetStructField =>
+  generateStructField(sf.name.get :: curField, sf.child)
+case _ =>
+  if (node.children.nonEmpty) {
+node.children.flatMap(child => generateStructField(curField, 
child))
+  } else {
+Seq.empty[StructField]
+  }
+  }
+}
+
+def getFieldRecursively(totalSchema: StructType,
+name: List[String]): StructField = {
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592818
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -126,4 +136,52 @@ object FileSourceStrategy extends Strategy with 
Logging {
 
 case _ => Nil
   }
+
+  private def generateStructFieldsContainsNesting(projects: 
Seq[Expression],
+  totalSchema: StructType) : 
Seq[StructField] = {
+def generateStructField(curField: List[String],
+ node: Expression) : Seq[StructField] = {
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592816
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -126,4 +136,52 @@ object FileSourceStrategy extends Strategy with 
Logging {
 
 case _ => Nil
   }
+
+  private def generateStructFieldsContainsNesting(projects: 
Seq[Expression],
+  totalSchema: StructType) : 
Seq[StructField] = {
--- End diff --

fix code style done.
No problem, I'll add tests for the private func 
generateStructFieldsContainsNesting next patch, this patch fix all code style 
and naming problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592883
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -661,6 +666,8 @@ private[sql] class SQLConf extends Serializable with 
CatalystConf with Logging {
 
   def isParquetINT96AsTimestamp: Boolean = 
getConf(PARQUET_INT96_AS_TIMESTAMP)
 
+  def isParquetNestColumnPruning: Boolean = 
getConf(PARQUET_NEST_COLUMN_PRUNING)
--- End diff --

rename done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592888
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -571,6 +571,37 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
 }
   }
 
+  test("SPARK-4502 parquet nested fields pruning") {
+// Schema of "test-data/nested-array-struct.parquet":
+//root
+//|-- col: struct (nullable = true)
+//||-- s1: struct (nullable = true)
+//|||-- s1_1: long (nullable = true)
+//|||-- s1_2: long (nullable = true)
+//||-- str: string (nullable = true)
+//|-- num: long (nullable = true)
+//|-- str: string (nullable = true)
+val df = 
readResourceParquetFile("test-data/nested-struct.snappy.parquet")
+df.createOrReplaceTempView("tmp_table")
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592876
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -212,6 +212,11 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_NEST_COLUMN_PRUNING = 
SQLConfigBuilder("spark.sql.parquet.nestColumnPruning")
--- End diff --

rename done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592865
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -126,4 +136,52 @@ object FileSourceStrategy extends Strategy with 
Logging {
 
 case _ => Nil
   }
+
+  private def generateStructFieldsContainsNesting(projects: 
Seq[Expression],
+  totalSchema: StructType) : 
Seq[StructField] = {
+def generateStructField(curField: List[String],
+ node: Expression) : Seq[StructField] = {
+  node match {
+case ai: GetArrayItem =>
+  // Here we drop the previous for simplify array and map support.
+  // Same strategy in GetArrayStructFields and GetMapValue
+  generateStructField(List.empty[String], ai.child)
+case asf: GetArrayStructFields =>
+  generateStructField(List.empty[String], asf.child)
+case mv: GetMapValue =>
+  generateStructField(List.empty[String], mv.child)
+case attr: AttributeReference =>
+  Seq(getFieldRecursively(totalSchema, attr.name :: curField))
+case sf: GetStructField =>
+  generateStructField(sf.name.get :: curField, sf.child)
+case _ =>
+  if (node.children.nonEmpty) {
+node.children.flatMap(child => generateStructField(curField, 
child))
+  } else {
+Seq.empty[StructField]
+  }
+  }
+}
+
+def getFieldRecursively(totalSchema: StructType,
+name: List[String]): StructField = {
+  if (name.length > 1) {
+val curField = name.head
+val curFieldType = totalSchema(curField)
+curFieldType.dataType match {
+  case st: StructType =>
+val newField = getFieldRecursively(StructType(st.fields), 
name.drop(1))
+StructField(curFieldType.name, StructType(Seq(newField)),
+  curFieldType.nullable, curFieldType.metadata)
+  case _ =>
+throw new IllegalArgumentException(s"""Field "$curField" is 
not struct field.""")
+}
+  } else {
+totalSchema(name.head)
+  }
+}
--- End diff --

The func getFieldRecursively here need the return value which is a 
StructField contains all nested relation in path. For example:
The fullSchema is:
```
root
 |-- col: struct (nullable = true)
 ||-- s1: struct (nullable = true)
 |||-- s1_1: long (nullable = true)
 |||-- s1_2: long (nullable = true)
 ||-- str: string (nullable = true)
 |-- num: long (nullable = true)
 |-- str: string (nullable = true)
```
the func should return:
```

StructField(col,StructType(StructField(s1,StructType(StructField(s1_1,LongType,true)),true)),true)
```
So maybe I can't use the simplified func getnestedField because it returns 
only the last StructField:
```
StructField(s1_1,LongType,true)
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592806
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +99,15 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning
+&& fsRelation.fileFormat.isInstanceOf[ParquetFileFormat]) {
+val totalSchema = readDataColumns.toStructType
+val prunedSchema = StructType(
+  generateStructFieldsContainsNesting(projects, totalSchema))
+// Merge schema in same StructType and merge with filterAttributes
+prunedSchema.fields.map(f => StructType(Array(f))).reduceLeft(_ 
merge _)
+  .merge(filterAttributes.toSeq.toStructType)
+  } else readDataColumns.toStructType
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/14957#discussion_r84592805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -97,7 +99,15 @@ object FileSourceStrategy extends Strategy with Logging {
 dataColumns
   .filter(requiredAttributes.contains)
   .filterNot(partitionColumns.contains)
-  val outputSchema = readDataColumns.toStructType
+  val outputSchema = if 
(fsRelation.sqlContext.conf.isParquetNestColumnPruning
+&& fsRelation.fileFormat.isInstanceOf[ParquetFileFormat]) {
+val totalSchema = readDataColumns.toStructType
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849543
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala ---
@@ -97,6 +97,12 @@ object HiveCatalogMetrics extends Source {
 MetricRegistry.name("parallelListingJobCount"))
 
   /**
+   * Tracks the total number of cachedDataSourceTables hits.
+   */
+  val METRIC_DATASOUCE_TABLE_CACHE_HITS = metricRegistry.counter(
+MetricRegistry.name("dataSourceTableCacheHits"))
--- End diff --

May be we can't, only the cache hits can help us check the number.
I do the test below:
I add a `Thread.sleep(1000)` before 
`cachedDataSourceTables.put(tableIdentifier, created)` in 
HiveMetastoreCatalog.scala +265 to make the build table relation slow. And 
print all the metrics with and without lock
```
println(HiveCatalogMetrics.METRIC_DATASOUCE_TABLE_CACHE_HITS.getCount())
println(HiveCatalogMetrics.METRIC_FILE_CACHE_HITS.getCount())
println(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount())
println(HiveCatalogMetrics.METRIC_HIVE_CLIENT_CALLS.getCount())
println(HiveCatalogMetrics.METRIC_PARALLEL_LISTING_JOB_COUNT.getCount())
println(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount())
```
The result of without lock:
```
0
0
5
70
0
0
```
and the result of with lock:
```
9
0
5
70
0
0
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849598
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 ---
@@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite
   }
 }
   }
+
+  test("SPARK-18700: add lock for each table's realation in cache") {
+withTable("test") {
+  withTempDir { dir =>
+HiveCatalogMetrics.reset()
+setupPartitionedHiveTable("test", dir)
+// select the table in multi-threads
+val executorPool = Executors.newFixedThreadPool(10)
+(1 to 10).map(threadId => {
+  val runnable = new Runnable {
+override def run(): Unit = {
+  spark.sql("select * from test where partCol1 = 999").count()
+}
+  }
+  executorPool.execute(runnable)
+  None
+})
+executorPool.shutdown()
+executorPool.awaitTermination(30, TimeUnit.SECONDS)
+// check the cache hit, the cache only load once
+
assert(HiveCatalogMetrics.METRIC_DATASOUCE_TABLE_CACHE_HITS.getCount() == 9)
--- End diff --

It may failed sometimes without lock, but when I add `Thread.sleep(1000)` 
before `cachedDataSourceTables.put(tableIdentifier, created)` in 
HiveMetastoreCatalog.scala +265 to make the build table relation slow.It will 
failed every time. How can I do this hook in UT? Or how to make the cache build 
operation slow without really make a big table?  : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849563
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
+  private val tableLockStripes = Striped.lazyWeakLock(10)
+
+  /** Acquires a lock on the table cache for the duration of `f`. */
+  private def cacheLock[A](tableName: QualifiedTableName, f: => A): A = {
--- End diff --

update done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849557
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
--- End diff --

update done, and more comments at HiveMetastoreCatalog.scala+226


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849565
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 ---
@@ -352,4 +353,28 @@ class PartitionedTablePerfStatsSuite
   }
 }
   }
+
+  test("SPARK-18700: add lock for each table's realation in cache") {
--- End diff --

update done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849561
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +56,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** ReadWriteLock for each tables, protect the read and write cached */
+  private val tableLockStripes = Striped.lazyWeakLock(10)
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91849544
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -33,6 +35,7 @@ import org.apache.spark.sql.hive.orc.OrcFileFormat
 import org.apache.spark.sql.types._
 
 
+
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

2016-12-11 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91879183
  
--- Diff: 
core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala ---
@@ -105,6 +111,7 @@ object HiveCatalogMetrics extends Source {
 METRIC_FILE_CACHE_HITS.dec(METRIC_FILE_CACHE_HITS.getCount())
 METRIC_HIVE_CLIENT_CALLS.dec(METRIC_HIVE_CLIENT_CALLS.getCount())
 
METRIC_PARALLEL_LISTING_JOB_COUNT.dec(METRIC_PARALLEL_LISTING_JOB_COUNT.getCount())
+
METRIC_DATASOUCE_TABLE_CACHE_HITS.dec(METRIC_DATASOUCE_TABLE_CACHE_HITS.getCount())
--- End diff --

e...sorry, this new added metric will delete next patch like before comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
Thanks for ericl's review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91904868
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -209,72 +221,79 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
 }
   }
 
-  val cached = getCached(
-tableIdentifier,
-rootPaths,
-metastoreRelation,
-metastoreSchema,
-fileFormatClass,
-bucketSpec,
-Some(partitionSchema))
-
-  val logicalRelation = cached.getOrElse {
-val sizeInBytes = metastoreRelation.statistics.sizeInBytes.toLong
-val fileCatalog = {
-  val catalog = new CatalogFileIndex(
-sparkSession, metastoreRelation.catalogTable, sizeInBytes)
-  if (lazyPruningEnabled) {
-catalog
-  } else {
-catalog.filterPartitions(Nil)  // materialize all the 
partitions in memory
+  // Here we should protect all relation get and create operation with 
lock while big
+  // table's CatalogFileIndex will take some time, only lock 
cachedDataSourceTables.put
+  // will still cause driver memory waste. More detail see SPARK-18700.
--- End diff --

done, delete it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91904915
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
 ---
@@ -352,4 +353,34 @@ class PartitionedTablePerfStatsSuite
   }
 }
   }
+
+  test("SPARK-18700: table loaded only once even when resolved 
concurrently") {
+withSQLConf(SQLConf.HIVE_MANAGE_FILESOURCE_PARTITIONS.key -> "false") {
+  withTable("test") {
+withTempDir { dir =>
+  HiveCatalogMetrics.reset()
+  setupPartitionedHiveTable("test", dir, 50)
+  // select the table in multi-threads
+  val executorPool = Executors.newFixedThreadPool(10)
+  (1 to 10).map(threadId => {
+val runnable = new Runnable {
+  override def run(): Unit = {
+spark.sql("select * from test where partCol1 = 
999").count()
+  }
+}
+executorPool.execute(runnable)
+None
+  })
+  executorPool.shutdown()
+  executorPool.awaitTermination(30, TimeUnit.SECONDS)
+  // check the cache hit, we use the metric of 
METRIC_FILES_DISCOVERED and
+  // METRIC_PARALLEL_LISTING_JOB_COUNT to check this, while the 
lock take effect,
+  // only one thread can really do the build, so the listing job 
count is 2, the other
+  // one is cahce.load func. Also METRIC_FILES_DISCOVERED is 
$partition_num * 2
--- End diff --

â¦â¦sorry, fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: [SPARK-18700][SQL] Add StripedLock for each table...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/16135#discussion_r91904834
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -53,6 +53,18 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
   tableIdent.table.toLowerCase)
   }
 
+  /** Locks for preventing driver mem waste when concurrent table 
instantiation */
+  private val tableCreationLocks = Striped.lazyWeakLock(100)
--- End diff --

fix done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...

2016-12-15 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
cc @rxin thanks for check. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add ReadWriteLock for each table's re...

2016-12-06 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
@rxin @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...

2016-12-08 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
hi @ericl 
This commit do the 3 things below, thanks for your check:
1. Delete the unnecessary lock use and simplify the lock operation
2. Add UT test in `PartitionedTablePerfStatsSuite`
3. Add cache hit metrics in `HiveCatalogMetrics`
Also change the description of this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16135: SPARK-18700: add ReadWriteLock for each table's r...

2016-12-03 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/16135

SPARK-18700: add ReadWriteLock for each table's relation in cache

## What changes were proposed in this pull request?

As the scenario describe in 
[SPARK-18700][https://issues.apache.org/jira/browse/SPARK-18700], when 
cachedDataSourceTables invalided, the coming few queries will fetch all 
FileStatus in listLeafFiles function. In the condition of table has many 
partitions, these jobs will occupy much memory of driver finally may cause 
driver OOM.

In this patch, add ReadWriteLock for each table's relation in cache not for 
the whole cachedDataSourceTables, keep a hashMap which key is TableIdentifier 
and value is relevant ReadWriteLock.

## How was this patch tested?

The existing tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-18700

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16135


commit 8718ec31d1234c04755467c872c5a9ab0991fa95
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2016-12-04T06:41:14Z

SPARK-18700: add ReadWriteLock for each table's relation in cache




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16350
  
Delete the UT and metrics done. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock ...

2016-12-20 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/16350

[SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for each table's relation 
in cache

## What changes were proposed in this pull request?

Backport of #16135 to branch-2.0

## How was this patch tested?

Because of the diff between branch-2.0 and master/2.1, here add a 
multi-thread access table test in `HiveMetadataCacheSuite` and check it only 
loading once using metrics in `HiveCatalogMetrics`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-18700-2.0

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16350


commit 132d12ee1457c41a0bec56516ab5a41d36d8ac1f
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2016-12-20T10:50:03Z

SPARK-18700: Add StripedLock for each table's relation in cache




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock for eac...

2016-12-21 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16350
  
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16350: [SPARK-18700][SQL][BACKPORT-2.0] Add StripedLock ...

2016-12-21 Thread xuanyuanking

Github user xuanyuanking closed the pull request at:

https://github.com/apache/spark/pull/16350


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add ReadWriteLock for each table's re...

2016-12-07 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
@ericl Thanks for your review.

> Is it sufficient to lock around the catalog.filterPartitions(Nil)?
Yes, this patch port from 1.6.2 and I missed the diff here. Fixed in next 
patch.

>  Why do we need reader locks?
Write or Invalid the table cache operation fewer than read it. Reader 
waiting when there is same table writing cache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16135: [SPARK-18700][SQL] Add StripedLock for each table's rela...

2016-12-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/16135
  
@hvanhovell Sure, I open a new BACKPORT-2.0.
There's a little diff in branch-2.0, the ut test of this patch based on the 
`HiveCatalogMetrics` which not added in 2.0, so I added the patch need metric. 
Thanks for check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-04-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
@marmbrus Can you take a look of this? Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

2017-04-20 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/17702

[SPARK-20408][SQL] Get the glob path in parallel to reduce resolve relation 
time

## What changes were proposed in this pull request?
This PR change the work of getting glob path in parallel, which can make 
complex wildcard path more quickly, the mainly changes in details:
1.Add config named `spark.sql.globPathInParallel` , default false
2.Add new function `getGlobbedPaths` in DataSource, return all paths 
represented by the wildcard, in parallel or not control by the config
3.Add new function `expandGlobPath ` in SparkHadoopUtil, to expand the 
first dir represented by the wildcard

## How was this patch tested?
Existing UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-20408

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17702.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17702


commit b27ef4f9e696e2b2c2fc2e0df504baea88937234
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2017-04-20T11:07:47Z

[SPARK-20408][SQL]Get the glob path in parallel to reduce resolve relation 
time




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18760: [SPARK-21560][Core] Add hold mode for the LiveLis...

2017-07-28 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/18760

[SPARK-21560][Core] Add hold mode for the LiveListenerBus

## What changes were proposed in this pull request?
1. Add config for hold strategy and the idle capacity.
2. Hold the post method while event queue is full.
3. Notify all holding thread while event queue empty rate lager than the 
configuration.

## How was this patch tested?

Add a new test in SparkListenerSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-21560

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18760.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18760


commit 5e9a4b4feec38e132f2e9c4053be22ca6926e932
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-07-28T13:25:31Z

SPARK-21560: Add hold mode for the LiveListenerBus




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18760: [SPARK-21560][Core] Add hold mode for the LiveListenerBu...

2017-08-14 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18760
  
@jiangxb1987 Hi xingbo, can you give me some advise about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
@HyukjinKwon Thanks for you comment, as your mentioned in #18650 and 
#17395, empty results of parquet can be fixed by leave the first partition, how 
about the orc format? The orc format error for empty result should also 
consider together within this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18650: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18650
  
Yep, just close this and open #18654 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/18654

[SPARK-21435][SQL] Empty files should be skipped while write to file

## What changes were proposed in this pull request?

Add EmptyDirectoryWriteTask for empty task while writing files.

## How was this patch tested?

Add new test in `FileFormatWriterSuite `

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-21435

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18654.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18654


commit ff92ba3ae3abaae0d8ce8bb8c08465f43c14906f
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2017-07-17T07:38:56Z

empty files should be skipped while write to file

commit e08fb1939d5267a38f3318af7506b6ed8628ebbf
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2017-07-17T10:20:34Z

Handle the empty result of parquet

commit 6153001bc42deee197030ad91fbb4f72bd1aa5d3
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2017-07-17T12:09:08Z

Modify UT and add more notes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18650: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking closed the pull request at:

https://github.com/apache/spark/pull/18650


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18650: [SPARK-21435][SQL] Empty files should be skipped ...

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/18650

[SPARK-21435][SQL] Empty files should be skipped while write to file

## What changes were proposed in this pull request?

Add EmptyDirectoryWriteTask for empty task while writing files.

## How was this patch tested?

Add new test in `FileFormatWriterSuite `

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-21435

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18650


commit ff92ba3ae3abaae0d8ce8bb8c08465f43c14906f
Author: xuanyuanking <xyliyuanj...@gmail.com>
Date:   2017-07-17T07:38:56Z

empty files should be skipped while write to file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127856498
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.{File, FilenameFilter}
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempDir { dir =>
+  dir.delete()
+  spark.range(1).repartition(10).write.parquet(dir.toString)
+  val df = spark.read.parquet(dir.toString)
+  val allFiles = dir.listFiles(new FilenameFilter {
--- End diff --

Both is ok I think, just copy this from `HadoopFsRelationSuite`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127872988
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.{File, FilenameFilter}
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempDir { dir =>
+  dir.delete()
+  spark.range(1).repartition(10).write.parquet(dir.toString)
+  val df = spark.read.parquet(dir.toString)
+  val allFiles = dir.listFiles(new FilenameFilter {
+override def accept(dir: File, name: String): Boolean = {
+  !name.startsWith(".") && !name.startsWith("_")
+}
+  })
+  assert(allFiles.length == 10)
+
+  withTempDir { dst_dir =>
+dst_dir.delete()
+df.where("id = 50").write.parquet(dst_dir.toString)
+val allFiles = dst_dir.listFiles(new FilenameFilter {
+  override def accept(dir: File, name: String): Boolean = {
+!name.startsWith(".") && !name.startsWith("_")
+  }
+})
+// First partition file and the data file
--- End diff --

Can't agree more,  firstly I try to implement like this but the 
`FileFormatWriter.write` can only see the iterator of each task self.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127888746
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempPath { dir =>
--- End diff --

More clear :) No need to create source files in real.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
Yep, empty result dir need this meta, otherwise will throw the exception:
```
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
It must be specified manually.;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:188)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:188)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:187)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:381)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:571)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:555)
  ... 48 elided
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127865091
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.{File, FilenameFilter}
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempDir { dir =>
+  dir.delete()
+  spark.range(1).repartition(10).write.parquet(dir.toString)
+  val df = spark.read.parquet(dir.toString)
+  val allFiles = dir.listFiles(new FilenameFilter {
+override def accept(dir: File, name: String): Boolean = {
+  !name.startsWith(".") && !name.startsWith("_")
+}
+  })
+  assert(allFiles.length == 10)
--- End diff --

OK, I'll remove this assert and leave a note.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18654: [SPARK-21435][SQL] Empty files should be skipped ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/18654#discussion_r127856720
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.{File, FilenameFilter}
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.SharedSQLContext
+
+class FileFormatWriterSuite extends QueryTest with SharedSQLContext {
+
+  test("empty file should be skipped while write to file") {
+withTempDir { dir =>
+  dir.delete()
+  spark.range(1).repartition(10).write.parquet(dir.toString)
+  val df = spark.read.parquet(dir.toString)
+  val allFiles = dir.listFiles(new FilenameFilter {
+override def accept(dir: File, name: String): Boolean = {
+  !name.startsWith(".") && !name.startsWith("_")
+}
+  })
+  assert(allFiles.length == 10)
--- End diff --

Just make sure the source dir have many files, and the output dir only have 
2 files.
If this make people confuse, just leave a notes and delete the assert?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
retest this please...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18654: [SPARK-21435][SQL] Empty files should be skipped while w...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18654
  
ping @cloud-fan @HyukjinKwon 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-04-25 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
cc @zsxwing @tdas, can you review this? Founded the relative code of yours 
before. Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-04-27 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
@HyukjinKwon Can you help me to find a appropriate reviewer about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-04-30 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
ping @cloud-fan and @gatorsmile ,  could you have a look about this ? 
Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18760: [SPARK-21560][Core] Add hold mode for the LiveListenerBu...

2017-07-31 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18760
  
ping @gatorsmile @cloud-fan , can you review about this? Thanks :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-05-12 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
ping @cloud-fan 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14957: [SPARK-4502][SQL]Support parquet nested struct pr...

2017-06-26 Thread xuanyuanking

Github user xuanyuanking closed the pull request at:

https://github.com/apache/spark/pull/14957


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14957: [SPARK-4502][SQL]Support parquet nested struct pruning a...

2017-06-26 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/14957
  
OK, I'll close this and just use it in our internal env, thanks all guys's 
suggestion and review work. Next we may try more complex scenario of this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-06-05 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
@cloud-fan Thanks for your reply!
It's possible to consolidate them but may be not so necessary? I can 
consolidate them by replace the logic in `getGlobbedPaths` list below to 
`InMemoryFileIndex.bulkListLeafFiles`
```
+  val expanded = sparkSession.sparkContext
+.parallelize(paths, paths.length)
+.map { pathString =>
+  SparkHadoopUtil.get.globPathIfNecessary(new 
Path(pathString)).map(_.toString)
+}.collect()
+  expanded.flatMap(paths => paths.map(new Path(_))).toSeq
```
, and also change the interface of `bulkListLeafFiles` by adding a default 
param to parse the function `globPathIfNecessary`, because of the glob path can 
be nested in many levels.
Maybe current logic to add a new parallelize job is better than 
consolidate, what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-06-16 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
Test failed may because of the env? `process was terminated by signal 9` in 
jenkins log.
retest it please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-06-16 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

2017-06-15 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/17702#discussion_r122364493
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -389,6 +389,23 @@ case class DataSource(
   }
 
   /**
+   * Return all paths represented by the wildcard string.
+   */
+  private def getGlobbedPaths(qualified: Path): Seq[Path] = {
--- End diff --

You are right.
I'll fix this and also limit the max parallelism num in next patch, reuse 
the config in `InMemoryFileIndex.bulkListLeafFiles`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-05-07 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
@gatorsmile @cloud-fan, do we need other performance test? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-05-02 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702

Thanks for your review. @gatorsmile @cloud-fan

`Can you show us the performance difference?`

No problem, I reproduce our online case offline like below

## Without parallel resolve:

![image](https://cloud.githubusercontent.com/assets/4833765/25610958/d4b23510-2f57-11e7-9a87-969e215b16c6.png)

## With parallel resolve:

![image](https://cloud.githubusercontent.com/assets/4833765/25611104/6d061908-2f58-11e7-9006-08368d9a6610.png)

## Test env:
item | detail
--|---
Spark version | current master 2.2.0-SNAPSHOT
Hadoop version | 2.7.2
HDFS | 8 servers(128G, 20 core + 20 hyper threading)
Test case | ```spark.read.text("/app/dc/test_for_ls/*/*/*/*").count()```
first level below `test_for_ls` contains 96 directory, each directory has
1000 directory in next level, the third and forth only have 1 directory and
file each.

## Discussion:
1. More complex scenario and deeper directory level will have more
optimization, in this local test it can bring us 100% faster.
2. `spark.sql.globPathInParallel` config will only parallel the process of
resolve glob path.
3. If driver and cluster in different geographical region, this improvement
can produce at least 5* boosting in our scenario because of the resolving work
becoming a parallel job on the cluster.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17702: [SPARK-20408][SQL] Get the glob path in parallel ...

2017-05-02 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/17702#discussion_r114275475
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -146,6 +146,11 @@ object SQLConf {
 .longConf
 .createWithDefault(Long.MaxValue)
 
+  val GLOB_PATH_IN_PARALLEL = buildConf("spark.sql.globPathInParallel")
+.doc("When true, resolve the glob path in parallel, the strategy same 
with ")
--- End diff --

Sorry for the cut-off, I add a patch to fix this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

2017-09-19 Thread xuanyuanking

GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/19287

[SPARK-22074][Core] Task killed by other attempt task should not be 
resubmitted

## What changes were proposed in this pull request?

As the detail scenario described in 
[SPARK-22074](https://issues.apache.org/jira/browse/SPARK-22074), unnecessary 
resubmitted may cause stage hanging in currently release versions. This patch 
add a new var in TaskInfo to mark this task killed by other attempt or not.

## How was this patch tested?

Add a new UT `[SPARK-22074] Task killed by other attempt task should not be 
resubmitted` in TaskSetManagerSuite, this UT recreate the scenario in JIRA 
description, it failed without the changes in this PR and passed conversely.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-22074

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19287


commit a28daa2c3283ad31659f840e6d401ab48a42ad88
Author: Yuanjian Li <xyliyuanj...@gmail.com>
Date:   2017-09-20T05:35:35Z

[SPARK-22074][Core] Task killed by other attempt task should not be 
resubmitted




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-09-20 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/19287
  
`signal 9`
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

2017-09-20 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19287#discussion_r140143293
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala ---
@@ -66,6 +66,12 @@ class TaskInfo(
*/
   var finishTime: Long = 0
 
+  /**
+   * Set this tag when this task killed by other attempt. This kind of 
task should not resubmit
+   * while executor lost.
+   */
--- End diff --

Thanks for reviewing, rewrite this comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18760: [SPARK-21560][Core] Add hold mode for the LiveListenerBu...

2017-09-26 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18760
  
The hold mode is still valid, I resolved the conflict and add the logic 
into `AsyncEventQueue`, it can confirm by the test case added in this 
[patch](https://github.com/apache/spark/pull/18760/files#diff-6ddec7f06d0cf5392943ecdb80fcea24R515).
`Now LiveListenerBus have multiple queue`
Yep, glad to see SPARK-18838 finally merged and it may resolve the event 
drop problem. But as you say, `it's very unlikely` but may be a hold mode is 
more safety? 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-09-29 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/19287
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/19287
  
@squito Hi Rashid, thanks for you review and advise. In the last commit I 
moved `killedByOtherAttempt` into `TaskSetManager ` as you say and added more 
asserts in UT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18760: [SPARK-21560][Core] Add hold mode for the LiveListenerBu...

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/18760
  
@vanzin Hi Vanzin, thanks a lot for your comments.
Firstly answer your question about `Why isn't hold mode just calling 
queue.put (blocking) instead of queue.offer (non-blocking)?`
In general scenario of the queue is full, we need a time that hold all 
events push into the queue, here I use offer to control the `empty rate` in the 
configuration. If here use `put(blocking)`, this will not relief the queue 
blocking, and just hanging each post events.

Actually this patch is a internal fix patch for the event dropping problem 
in Baidu internal env as I described in 
[jira](https://issues.apache.org/jira/browse/SPARK-21560), glad to see 
SPARK-18838 has been merged. 

Here I had another thought -- how about I port SPARK-18838 to our product 
env which solved the event dropping by the current patch, if it works well I 
just close this. What do you two think about this? @cloud-fan @vanzin :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-09-30 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-10-01 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19287#discussion_r141784747
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -744,6 +744,100 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(resubmittedTasks === 0)
   }
 
+
+  test("[SPARK-22074] Task killed by other attempt task should not be 
resubmitted") {
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+// Set the speculation multiplier to be 0 so speculative tasks are 
launched immediately
+sc.conf.set("spark.speculation.multiplier", "0.0")
+sc.conf.set("spark.speculation.quantile", "0.5")
+sc.conf.set("spark.speculation", "true")
+
+val sched = new FakeTaskScheduler(sc, ("exec1", "host1"),
+  ("exec2", "host2"), ("exec3", "host3"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(
+   taskId: Long,
+   executorId: String,
+   interruptThread: Boolean,
+   reason: String): Unit = {}
+})
+
+// Keep track of the number of tasks that are resubmitted,
+// so that the test can check that no tasks were resubmitted.
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(
+  task: Task[_],
+  reason: TaskEndReason,
+  result: Any,
+  accumUpdates: Seq[AccumulatorV2[_, _]],
+  taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val taskSet = FakeTask.createShuffleMapTaskSet(4, 0, 0,
+  Seq(TaskLocation("host1", "exec1")),
+  Seq(TaskLocation("host1", "exec1")),
+  Seq(TaskLocation("host3", "exec3")),
+  Seq(TaskLocation("host2", "exec2")))
+
+val clock = new ManualClock()
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, 
clock = clock)
+val accumUpdatesByTask: Array[Seq[AccumulatorV2[_, _]]] = 
taskSet.tasks.map { task =>
+  task.metrics.internalAccums
+}
+// Offer resources for 4 tasks to start
+for ((k, v) <- List(
+  "exec1" -> "host1",
+  "exec1" -> "host1",
+  "exec3" -> "host3",
+  "exec2" -> "host2")) {
+  val taskOption = manager.resourceOffer(k, v, NO_PREF)
+  assert(taskOption.isDefined)
+  val task = taskOption.get
+  assert(task.executorId === k)
+}
+assert(sched.startedTasks.toSet === Set(0, 1, 2, 3))
+clock.advance(1)
+// Complete the 2 tasks and leave 2 task in running
+for (id <- Set(0, 1)) {
+  manager.handleSuccessfulTask(id, createTaskResult(id, 
accumUpdatesByTask(id)))
+  assert(sched.endedTasks(id) === Success)
+}
+
+// checkSpeculatableTasks checks that the task runtime is greater than 
the threshold for
+// speculating. Since we use a threshold of 0 for speculation, tasks 
need to be running for
+// > 0ms, so advance the clock by 1ms here.
+clock.advance(1)
+assert(manager.checkSpeculatableTasks(0))
+assert(sched.speculativeTasks.toSet === Set(2, 3))
+
+// Offer resource to start the speculative attempt for the running 
task 2.0
+val taskOption = manager.resourceOffer("exec2", "host2", ANY)
+assert(taskOption.isDefined)
+val task4 = taskOption.get
+assert(task4.index === 2)
+assert(task4.taskId === 4)
+assert(task4.executorId === "exec2")
+assert(task4.attemptNumber === 1)
+sched.backend = mock(classOf[SchedulerBackend])
--- End diff --

Yep, in this case there's only one `killTask` check, fixed this in next 
patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19287#discussion_r141784812
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -744,6 +744,100 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(resubmittedTasks === 0)
   }
 
+
+  test("[SPARK-22074] Task killed by other attempt task should not be 
resubmitted") {
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+// Set the speculation multiplier to be 0 so speculative tasks are 
launched immediately
+sc.conf.set("spark.speculation.multiplier", "0.0")
+sc.conf.set("spark.speculation.quantile", "0.5")
+sc.conf.set("spark.speculation", "true")
+
+val sched = new FakeTaskScheduler(sc, ("exec1", "host1"),
+  ("exec2", "host2"), ("exec3", "host3"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(
+   taskId: Long,
+   executorId: String,
+   interruptThread: Boolean,
+   reason: String): Unit = {}
+})
+
+// Keep track of the number of tasks that are resubmitted,
+// so that the test can check that no tasks were resubmitted.
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(
+  task: Task[_],
+  reason: TaskEndReason,
+  result: Any,
+  accumUpdates: Seq[AccumulatorV2[_, _]],
+  taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val taskSet = FakeTask.createShuffleMapTaskSet(4, 0, 0,
+  Seq(TaskLocation("host1", "exec1")),
+  Seq(TaskLocation("host1", "exec1")),
+  Seq(TaskLocation("host3", "exec3")),
+  Seq(TaskLocation("host2", "exec2")))
+
+val clock = new ManualClock()
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, 
clock = clock)
+val accumUpdatesByTask: Array[Seq[AccumulatorV2[_, _]]] = 
taskSet.tasks.map { task =>
+  task.metrics.internalAccums
+}
+// Offer resources for 4 tasks to start
+for ((k, v) <- List(
+  "exec1" -> "host1",
+  "exec1" -> "host1",
+  "exec3" -> "host3",
+  "exec2" -> "host2")) {
+  val taskOption = manager.resourceOffer(k, v, NO_PREF)
+  assert(taskOption.isDefined)
+  val task = taskOption.get
+  assert(task.executorId === k)
+}
+assert(sched.startedTasks.toSet === Set(0, 1, 2, 3))
--- End diff --

OK, I add the assert while get the `taskOption`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19287#discussion_r141784872
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala ---
@@ -66,6 +66,13 @@ class TaskInfo(
*/
   var finishTime: Long = 0
 
+  /**
+   * Set this var when the current task killed by other attempt tasks, 
this happened while we
+   * set the `spark.speculation` to true. The task killed by others should 
not resubmit
+   * while executor lost.
+   */
+  var killedByOtherAttempt = false
--- End diff --

Thanks to point out the `TaskInfo` is DeveloperApi, I address this as your 
second comment, add this totally private to TaskSetManager


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19287: [SPARK-22074][Core] Task killed by other attempt ...

2017-09-27 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19287#discussion_r141513379
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskInfo.scala ---
@@ -74,6 +81,10 @@ class TaskInfo(
 gettingResultTime = time
   }
 
+  private[spark] def markKilledAttempt: Unit = {
--- End diff --

Sorry for the missing, I change it right now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...

2017-09-25 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/19287
  
ping @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19287: [SPARK-22074][Core] Task killed by other attempt task sh...