[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...

2018-12-09 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23208#discussion_r240101369
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
 ---
@@ -17,52 +17,49 @@
 
 package org.apache.spark.sql.execution.datasources.v2
 
-import java.util.UUID
-
-import scala.collection.JavaConverters._
+import java.util.{Optional, UUID}
 
 import org.apache.spark.sql.{AnalysisException, SaveMode}
 import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, 
NamedRelation}
 import org.apache.spark.sql.catalyst.expressions.{AttributeReference, 
Expression}
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, 
Statistics}
 import org.apache.spark.sql.catalyst.util.truncatedString
-import org.apache.spark.sql.sources.DataSourceRegister
 import org.apache.spark.sql.sources.v2._
 import org.apache.spark.sql.sources.v2.reader._
 import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport
 import org.apache.spark.sql.types.StructType
 
 /**
- * A logical plan representing a data source v2 scan.
+ * A logical plan representing a data source v2 table.
  *
- * @param source An instance of a [[DataSourceV2]] implementation.
- * @param options The options for this scan. Used to create fresh 
[[BatchWriteSupport]].
- * @param userSpecifiedSchema The user-specified schema for this scan.
+ * @param table The table that this relation represents.
+ * @param options The options for this table operation. It's used to 
create fresh [[ScanBuilder]]
+ *and [[BatchWriteSupport]].
  */
 case class DataSourceV2Relation(
-// TODO: remove `source` when we finish API refactor for write.
-source: TableProvider,
-table: SupportsBatchRead,
+table: Table,
 output: Seq[AttributeReference],
-options: Map[String, String],
-userSpecifiedSchema: Option[StructType] = None)
+// TODO: use a simple case insensitive map instead.
+options: DataSourceOptions)
   extends LeafNode with MultiInstanceRelation with NamedRelation {
 
-  import DataSourceV2Relation._
-
   override def name: String = table.name()
 
   override def simpleString: String = {
 s"RelationV2${truncatedString(output, "[", ", ", "]")} $name"
   }
 
-  def newWriteSupport(): BatchWriteSupport = 
source.createWriteSupport(options, schema)
-
-  def newScanBuilder(): ScanBuilder = {
-val dsOptions = new DataSourceOptions(options.asJava)
-table.newScanBuilder(dsOptions)
+  def newWriteSupport(inputSchema: StructType, mode: SaveMode): 
Optional[BatchWriteSupport] = {
--- End diff --

Nit: add comment for the method. Especially when it will return None. 
Although it is explained in `SupportsBatchWrite.createBatchWriteSupport`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...

2018-12-06 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23215
  
Thank you @cloud-fan @viirya @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...

2018-12-06 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23215
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...

2018-12-06 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23208#discussion_r239454594
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchWrite.java 
---
@@ -25,14 +25,14 @@
 import org.apache.spark.sql.types.StructType;
 
 /**
- * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
+ * A mix-in interface for {@link Table}. Data sources can implement this 
interface to
  * provide data writing ability for batch processing.
  *
  * This interface is used to create {@link BatchWriteSupport} instances 
when end users run
  * {@code Dataset.write.format(...).option(...).save()}.
  */
 @Evolving
-public interface BatchWriteSupportProvider extends DataSourceV2 {
+public interface SupportsBatchWrite extends Table {
--- End diff --

It is quite confusing to have `BatchWriteSupport` and  `SupportsBatchWrite` 
to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23215: [SPARK-26263][SQL] Validate partition values with...

2018-12-06 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23215#discussion_r239423651
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -272,9 +279,13 @@ object PartitioningUtils {
   val literal = if (userSpecifiedDataTypes.contains(columnName)) {
 // SPARK-26188: if user provides corresponding column schema, get 
the column value without
 //  inference, and then cast it as user specified data 
type.
-val columnValue = inferPartitionColumnValue(rawColumnValue, false, 
timeZone)
-val castedValue =
-  Cast(columnValue, userSpecifiedDataTypes(columnName), 
Option(timeZone.getID)).eval()
+val dataType = userSpecifiedDataTypes(columnName)
+val columnValueLiteral = inferPartitionColumnValue(rawColumnValue, 
false, timeZone)
+val columnValue = columnValueLiteral.eval()
+val castedValue = Cast(columnValueLiteral, dataType, 
Option(timeZone.getID)).eval()
+if (validatePartitionColumns && columnValue != null && castedValue 
== null) {
+  throw new RuntimeException(s"Failed to cast partition value 
`$columnValue` to $dataType")
--- End diff --

Good suggestion. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...

2018-12-06 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23215
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23240: [SPARK-26281][WebUI] Duration column of task table shoul...

2018-12-05 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23240
  
Oh,  I see. Close this one now.
Please change the title in #23160


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23240: [SPARK-26281][WebUI] Duration column of task tabl...

2018-12-05 Thread gengliangwang
Github user gengliangwang closed the pull request at:

https://github.com/apache/spark/pull/23240


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23240: [SPARK-26281][WebUI] Duration column of task table shoul...

2018-12-05 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23240
  
@shahidki31 @pgandhi999 @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23240: [SPARK-26281][WebUI] Duration column of task tabl...

2018-12-05 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23240

[SPARK-26281][WebUI] Duration column of task table should be executor run 
time instead of real duration

## What changes were proposed in this pull request?

In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
changed to executor run time. The behavior is consistent with the summary 
metrics table and previous Spark version.

However, after PR https://github.com/apache/spark/pull/21688, the issue can 
be reproduced again.

## How was this patch tested?

Before the change, we can see:

1. The minimum duration in aggregation table doesn't match with the task 
table below.
2. The sorting order is wrong.

![image](https://user-images.githubusercontent.com/1097932/49533048-f7eecb80-f8f8-11e8-9256-2eb524e81be0.png)

After the change, the issues are fixed:

![image](https://user-images.githubusercontent.com/1097932/49533069-06d57e00-f8f9-11e8-872b-402e3014f557.png)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark fixDuration

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23240


commit 612c4c7242f6289d3a1e424a69951be25cd126af
Author: Gengliang Wang 
Date:   2018-12-05T17:44:55Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23215: [SPARK-26263][SQL] Throw exception when Partition...

2018-12-04 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23215

[SPARK-26263][SQL] Throw exception when Partition column value can't be 
converted to user specified type

## What changes were proposed in this pull request?

Currently if user provides data schema, partition column values are 
converted as per it. But if the conversion failed, e.g. converting string to 
int, the column value is null. We should throw exception in such case.

For the following directory
```
/tmp/testDir
├── p=bar
└── p=foo
```
If we run:
```
val schema = StructType(Seq(StructField("p", IntegerType, false)))
spark.read.schema(schema).csv("/tmp/testDir/").show()
```
We will get:
```
++
|   p|
++
|null|
|null|
++
```
This PR proposes to throw exception in such case, instead of converting 
into null value silently:
1. These null partition column values doesn't make sense to users in most 
cases. It is better to show the conversion failure, and then users can adjust 
the schema or ETL jobs to fix it.
2. There are always exceptions on such conversion failure for non-partition 
data columns. Partition columns should have the same behavior.
## How was this patch tested?

Unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark SPARK-26263

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23215.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23215


commit 7060e127de339de42be12ed382ef0a4363ae325d
Author: Gengliang Wang 
Date:   2018-12-04T09:43:03Z

SPARK-26263: Throw exception when partition value can't be converted to 
specific type




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23215: [SPARK-26263][SQL] Throw exception when Partition column...

2018-12-04 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23215
  
@cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...

2018-12-01 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23189
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23189: [SPARK-26235][Core] Change log level for ClassNot...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23189#discussion_r237972852
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -813,14 +813,14 @@ private[spark] class SparkSubmit extends Logging {
   mainClass = Utils.classForName(childMainClass)
 } catch {
   case e: ClassNotFoundException =>
-logWarning(s"Failed to load $childMainClass.", e)
+logError(s"Failed to load class $childMainClass.")
--- End diff --

Here I tried append `e.getMessage` to the output, but it seems the error 
message equals  the class name `childMainClass`. e.g.
```
Error: Failed to load class foo: foo.
```
So I think we can output without the exception message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23189
  
@vanzin Ah, I see.  Thanks for pointing it out!

But I am now thinking overriding `logError` by calling 
`printMessage("Error...")`. What do you think? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23186#discussion_r237963856
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -345,15 +346,18 @@ object PartitioningUtils {
*/
   def resolvePartitions(
   pathsWithPartitionValues: Seq[(Path, PartitionValues)],
+  caseSensitive: Boolean,
   timeZone: TimeZone): Seq[PartitionValues] = {
 if (pathsWithPartitionValues.isEmpty) {
   Seq.empty
 } else {
-  // TODO: Selective case sensitivity.
-  val distinctPartColNames =
-
pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct
+  val distinctPartColNames = if (caseSensitive) {
+pathsWithPartitionValues.map(_._2.columnNames)
+  } else {
+pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase()))
+  }
   assert(
-distinctPartColNames.size == 1,
+distinctPartColNames.distinct.size == 1,
 listConflictingPartitionColumns(pathsWithPartitionValues))
--- End diff --

The method `listConflictingPartitionColumns` also shows the suspicious 
paths.
If case sensitive, the method works fine. 
If case insensitive, it will list all column names without any 
transformation. e.g. 
```
Partition column name list #0: a
Partition column name list #1: A
Partition column name list #2: B
```
I can fix the method listConflictingPartitionColumns. But seems a bit 
trivial, we will have to display the original column names instead of  
transforming all to lower case .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23189
  
@vanzin The `logWarning` call as the other handler below is also not 
overridden:
```
  case e: NoClassDefFoundError =>
logWarning(s"Failed to load $childMainClass: ${e.getMessage()}")
```

My basic point is, the exception/error message should be in the category of 
`ERROR`, even previously Spark prints as warning to `System.err`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23189
  
cc @vanzin 

This is really trivial, but can be helpful to user in certain case.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23189: [SPARK-26235][Core] Change log level for ClassNot...

2018-11-30 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23189

[SPARK-26235][Core] Change log level for 
ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error

## What changes were proposed in this pull request?

In my local setup, I set log4j root category as ERROR 
(https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console
 , first item show up if we google search "set spark log level".) When I run 
such command 
``` 
spark-submit --class foo bar.jar 
``` 
Nothing shows up, and the script exits.  

After quick investigation, I think the log level for 
ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR 
instead of WARN. Since the whole process exit because of the exception/error.

Before https://github.com/apache/spark/pull/20925, the message is not 
controlled by `log4j.rootCategory`. 

## How was this patch tested?

Manual check.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark changeLogLevel

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23189


commit 158d421757989740345be4e52fbc333b08356b5b
Author: Gengliang Wang 
Date:   2018-11-30T14:23:44Z

change log level




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23160: [SPARK-26196]Total tasks title in the stage page is inco...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23160
  
cc @tgravescs @pgandhi999


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, validate...

2018-11-30 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23186
  
@cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...

2018-11-30 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23186

[SPARK-26230][SQL]FileIndex: if case sensitive, validate partitions with 
original column names

## What changes were proposed in this pull request?

Partition column name is required to be unique under the same directory. 
The following paths are invalid partitioned directory:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/b=2
```

If case sensitive, the following paths should be invalid too:
```
hdfs://host:9000/path/a=1
hdfs://host:9000/path/A=2
```
Since column 'a' and 'A' are different, and it is wrong to use either one 
as the column name in partition schema.

Also, there is a `TODO` comment in the code.

This PR is to resolve the problem.

## How was this patch tested?

Add unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark SPARK-26230

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23186


commit 6d052130051a21b9aa7c3ffce56a556bee129a5e
Author: Gengliang Wang 
Date:   2018-11-30T09:23:32Z

 if case sensitive, validate partitions with original column names




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
@gatorsmile thanks! For Jobs don't have associated SQL query, the text 
`Associated SQL query` won't show up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237523177
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -250,7 +276,13 @@ object PartitioningUtils {
   val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
   assert(rawColumnValue.nonEmpty, s"Empty partition column value in 
'$columnSpec'")
 
-  val literal = inferPartitionColumnValue(rawColumnValue, 
typeInference, timeZone)
+  val literal = if (userSpecifiedDataTypes.contains(columnName)) {
+// SPARK-26188: if user provides corresponding column schema, 
process the column as String
+//  type and cast it as user specified data type later.
+inferPartitionColumnValue(rawColumnValue, false, timeZone)
--- End diff --

Thanks. I will use the Cast one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237516228
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -250,7 +276,13 @@ object PartitioningUtils {
   val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
   assert(rawColumnValue.nonEmpty, s"Empty partition column value in 
'$columnSpec'")
 
-  val literal = inferPartitionColumnValue(rawColumnValue, 
typeInference, timeZone)
+  val literal = if (userSpecifiedDataTypes.contains(columnName)) {
+// SPARK-26188: if user provides corresponding column schema, 
process the column as String
+//  type and cast it as user specified data type later.
+inferPartitionColumnValue(rawColumnValue, false, timeZone)
--- End diff --

See my reasons above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237515959
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -250,7 +276,13 @@ object PartitioningUtils {
   val rawColumnValue = columnSpec.drop(equalSignIndex + 1)
   assert(rawColumnValue.nonEmpty, s"Empty partition column value in 
'$columnSpec'")
 
-  val literal = inferPartitionColumnValue(rawColumnValue, 
typeInference, timeZone)
+  val literal = if (userSpecifiedDataTypes.contains(columnName)) {
+// SPARK-26188: if user provides corresponding column schema, 
process the column as String
+//  type and cast it as user specified data type later.
--- End diff --

1. the function returns `Option[(String, Literal)]`
2. the function `inferPartitionColumnValue` is quite complex, don't want 
change it or write duplicated logic.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237513657
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -94,18 +94,34 @@ object PartitioningUtils {
   paths: Seq[Path],
   typeInference: Boolean,
   basePaths: Set[Path],
+  userSpecifiedSchema: Option[StructType],
+  caseSensitive: Boolean,
   timeZoneId: String): PartitionSpec = {
-parsePartitions(paths, typeInference, basePaths, 
DateTimeUtils.getTimeZone(timeZoneId))
+parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
+  caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
   }
 
   private[datasources] def parsePartitions(
   paths: Seq[Path],
   typeInference: Boolean,
   basePaths: Set[Path],
+  userSpecifiedSchema: Option[StructType],
+  caseSensitive: Boolean,
   timeZone: TimeZone): PartitionSpec = {
+val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
--- End diff --

Personally I prefer to make the parameter simple and easy to understand. So 
that the logic of caller(outside the `PartitioningUtils`) looks cleaner.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23165#discussion_r237512471
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -94,18 +94,34 @@ object PartitioningUtils {
   paths: Seq[Path],
   typeInference: Boolean,
   basePaths: Set[Path],
+  userSpecifiedSchema: Option[StructType],
+  caseSensitive: Boolean,
   timeZoneId: String): PartitionSpec = {
-parsePartitions(paths, typeInference, basePaths, 
DateTimeUtils.getTimeZone(timeZoneId))
+parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema,
+  caseSensitive, DateTimeUtils.getTimeZone(timeZoneId))
   }
 
   private[datasources] def parsePartitions(
   paths: Seq[Path],
   typeInference: Boolean,
   basePaths: Set[Path],
+  userSpecifiedSchema: Option[StructType],
+  caseSensitive: Boolean,
   timeZone: TimeZone): PartitionSpec = {
+val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) {
+  val nameToDataType = userSpecifiedSchema.get.fields.map(f => f.name 
-> f.dataType).toMap
+  if (caseSensitive) {
+CaseInsensitiveMap(nameToDataType)
--- End diff --

Yes, thanks for pointing it out :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

2018-11-29 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23165
  
@mgaido91 The user specified schema might not match the full data schema. 
For the missing columns, we still need to infer their data types.
I will come up with a solution soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...

2018-11-28 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23165
  
@cloud-fan @mgaido91 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...

2018-11-28 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23165

[SPARK-26188][SQL] FileIndex: don't infer data types of partition columns 
if user specifies schema

## What changes were proposed in this pull request?

This PR is to fix a regression introduced in: 
https://github.com/apache/spark/pull/21004/files#r236998030

If user specifies schema, Spark don't need to infer data type for of 
partition columns, otherwise the data type might not match with the one user 
provided.
E.g. for partition directory `p=4d`, after data type inference  the column 
value will be `4.0`.
See https://issues.apache.org/jira/browse/SPARK-26188 for more details.

## How was this patch tested?

Add unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark fixFileIndex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23165


commit 2866a9e1c1a7d42e6cf53474733c6f39e812c680
Author: Gengliang Wang 
Date:   2018-11-28T16:11:22Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...

2018-11-28 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21004#discussion_r237092452
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -126,35 +126,32 @@ abstract class PartitioningAwareFileIndex(
 val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
 val timeZoneId = 
caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
   .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
-
-userPartitionSchema match {
+val inferredPartitionSpec = PartitioningUtils.parsePartitions(
+  leafDirs,
+  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
--- End diff --

@mgaido91 Thanks for the investigation!!
I will fix it and add test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-27 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
ping @vanzin . Please review this one :) 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...

2018-11-23 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23125#discussion_r235995724
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -79,6 +79,9 @@ private[ui] class StagePage(parent: StagesTab, store: 
AppStatusStore) extends We
 localityNamesAndCounts.sorted.mkString("; ")
   }
 
+  private def jobURL(request: HttpServletRequest, jobId: Int): String =
--- End diff --

Make sense. I have updated the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...

2018-11-23 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23125
  
@srowen @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...

2018-11-23 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23125

[SPARK-26156][WebUI] Revise summary section of stage page

## What changes were proposed in this pull request?
In the summary section of stage page:

![image](https://user-images.githubusercontent.com/1097932/48935518-ebef2b00-ef42-11e8-8672-eaa4cac92c5e.png)

1.  the following metrics names can be revised:
Output => Output Size / Records
Shuffle Read: => Shuffle Read Size / Records
Shuffle Write => Shuffle Write Size / Records

After changes, the names are more clear, and consistent with the other 
names in the same page.

2. The associated job id URL should not contain the 3 tails spaces. Reduce 
the number of spaces to one, and exclude the space from link. This is 
consistent with SQL execution page.

## How was this patch tested?

Manual check:

![image](https://user-images.githubusercontent.com/1097932/48935538-f7425680-ef42-11e8-8b2a-a4f388d3ea52.png)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark reviseStagePage

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23125


commit 215d3d7b8e187716d68a67a9b16d9eb444a8099f
Author: Gengliang Wang 
Date:   2018-11-23T09:04:38Z

revise stage page




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...

2018-11-21 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23098#discussion_r235317414
  
--- Diff: bin/load-spark-env.cmd ---
@@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and 
ensures it is only loaded
 rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the 
current directory's
 rem conf\ subdirectory.
 
+set SPARK_ENV_CMD="spark-env.cmd"
 if [%SPARK_ENV_LOADED%] == [] (
   set SPARK_ENV_LOADED=1
 
   if [%SPARK_CONF_DIR%] == [] (
 set SPARK_CONF_DIR=%~dp0..\conf
   )
 
-  call :LoadSparkEnv
+  set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%"
+  if exist "%SPARK_ENV_CMD%" (
+call "%SPARK_ENV_CMD%"
+  )
 )
 
 rem Setting SPARK_SCALA_VERSION if not already set.
 
-set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11"
-set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12"
-
-if [%SPARK_SCALA_VERSION%] == [] (
-
-  if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% (
-echo "Presence of build for multiple Scala versions detected."
-echo "Either clean one of them or, set SPARK_SCALA_VERSION in 
spark-env.cmd."
-exit 1
-  )
-  if exist %ASSEMBLY_DIR2% (
-set SPARK_SCALA_VERSION=2.11
-  ) else (
-set SPARK_SCALA_VERSION=2.12
-  )
-)
+rem TODO: revisit for Scala 2.13 support
+set SPARK_SCALA_VERSION=2.12
--- End diff --

I am not familiar with .cmd script. Should we keep the quote here, `"2.12"`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-21 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
The web UI presents "jobs" tab as the default view. Showing the related SQL 
context is helpful, especially for new users.
I think the downside is quite little.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23081: [SPARK-26109][WebUI]Duration in the task summary ...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23081#discussion_r235071529
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -996,7 +996,7 @@ private[ui] object ApiHelper {
 HEADER_EXECUTOR -> TaskIndexNames.EXECUTOR,
 HEADER_HOST -> TaskIndexNames.HOST,
 HEADER_LAUNCH_TIME -> TaskIndexNames.LAUNCH_TIME,
-HEADER_DURATION -> TaskIndexNames.DURATION,
+HEADER_DURATION -> TaskIndexNames.EXEC_RUN_TIME,
--- End diff --

Not big deal but there are two spaces after `SPARK-26109:` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23081: [SPARK-26109][WebUI]Duration in the task summary ...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23081#discussion_r235065240
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -996,7 +996,7 @@ private[ui] object ApiHelper {
 HEADER_EXECUTOR -> TaskIndexNames.EXECUTOR,
 HEADER_HOST -> TaskIndexNames.HOST,
 HEADER_LAUNCH_TIME -> TaskIndexNames.LAUNCH_TIME,
-HEADER_DURATION -> TaskIndexNames.DURATION,
+HEADER_DURATION -> TaskIndexNames.EXEC_RUN_TIME,
--- End diff --

Nit: add comment to explain why it should be executor run time here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23081
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
@srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23067: [SPARK-26097][Web UI] Add the new partitioning descripti...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23067
  
+1. 
We can also try overriding `toString` for better output.

@vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23088
  
As the task section shows all the tasks, I think maybe it would be better 
to show aggregated metrics for all the tasks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...

2018-11-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23088#discussion_r235027227
  
--- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
@@ -238,8 +239,16 @@ private[spark] class AppStatusStore(
 val diff = idx - currentIdx
--- End diff --

`diff` could be negative here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metric...

2018-11-19 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23038#discussion_r234728748
  
--- Diff: 
core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala ---
@@ -1275,6 +1275,49 @@ class AppStatusListenerSuite extends SparkFunSuite 
with BeforeAndAfter {
 assert(allJobs.head.numFailedStages == 1)
   }
 
+  test("SPARK-25451: total tasks in the executor summary should match 
total stage tasks") {
+val testConf = conf.clone.set(LIVE_ENTITY_UPDATE_PERIOD, Long.MaxValue)
+
+val listener = new AppStatusListener(store, testConf, true)
+
+val stage = new StageInfo(1, 0, "stage", 4, Nil, Nil, "details")
+listener.onJobStart(SparkListenerJobStart(1, time, Seq(stage), null))
+listener.onStageSubmitted(SparkListenerStageSubmitted(stage, new 
Properties()))
+
+val tasks = createTasks(4, Array("1", "2"))
+tasks.foreach { task =>
+  listener.onTaskStart(SparkListenerTaskStart(stage.stageId, 
stage.attemptNumber, task))
+}
+
+time += 1
+tasks(0).markFinished(TaskState.FAILED, time)
+listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, 
stage.attemptId, "taskType",
+  ExecutorLostFailure("1", true, Some("Lost executor")), tasks(0), 
null))
+time += 1
+tasks(1).markFinished(TaskState.FAILED, time)
+listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, 
stage.attemptId, "taskType",
+  ExecutorLostFailure("2", true, Some("Lost executor")), tasks(1), 
null))
+
+stage.failureReason = Some("Failed")
+listener.onStageCompleted(SparkListenerStageCompleted(stage))
+time += 1
+listener.onJobEnd(SparkListenerJobEnd(1, time, JobFailed(new 
RuntimeException("Bad Executor"
+
+time += 1
+tasks(2).markFinished(TaskState.FAILED, time)
+listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, 
stage.attemptId, "taskType",
+  ExecutorLostFailure("1", true, Some("Lost executor")), tasks(2), 
null))
+time += 1
+tasks(3).markFinished(TaskState.FAILED, time)
+listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, 
stage.attemptId, "taskType",
+  ExecutorLostFailure("2", true, Some("Lost executor")), tasks(3), 
null))
+
+val esummary = 
store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info)
+esummary.foreach { execSummary =>
+  assert(execSummary.failedTasks == 2)
--- End diff --

Nit: also check `succeededTasks` and `killedTasks`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...

2018-11-19 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23065
  
I tried and it works.
There is a similar warning in UnionRDD.scala, which will cause failure in 
Scala 2.11.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...

2018-11-19 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23065
  
Hi @srowen , 
Could you review and merge https://github.com/srowen/spark/pull/4 ?
I see a lot of warnings as well. We should fix them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-17 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
@vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...

2018-11-17 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23068
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...

2018-11-17 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23068#discussion_r234402149
  
--- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
@@ -56,6 +56,11 @@ private[spark] class AppStatusStore(
 store.read(classOf[JobDataWrapper], jobId).info
   }
 
+  def jobWithAssociatedSql(jobId: Int): (v1.JobData, Option[Long]) = {
--- End diff --

Return JobData and sqlExcuctionId within one look up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...

2018-11-17 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23068#discussion_r234402077
  
--- Diff: 
core/src/main/scala/org/apache/spark/status/AppStatusListener.scala ---
@@ -70,6 +70,8 @@ private[spark] class AppStatusListener(
   private val liveTasks = new HashMap[Long, LiveTask]()
   private val liveRDDs = new HashMap[Int, LiveRDD]()
   private val pools = new HashMap[String, SchedulerPool]()
+
+  private val SQL_EXECUTION_ID_KEY = "spark.sql.execution.id"
--- End diff --

Here we can't access `SQLExecution.EXECUTION_ID_KEY`. So use a variable for 
the key string.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...

2018-11-17 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23068

[SPARK-26098][WebUI] Show associated SQL query in Job page

## What changes were proposed in this pull request?

For jobs associated to SQL queries, it would be easier to understand the 
context to showing the SQL query in Job detail page.
Before code change:


![image](https://user-images.githubusercontent.com/1097932/48659359-96baa180-ea8a-11e8-8419-a0a87c3f30fc.png)

After code change:

![image](https://user-images.githubusercontent.com/1097932/48659390-26f8e680-ea8b-11e8-8fdd-3b58909ea364.png)


We can see the whole context when in the associated SQL detail page:
![Uploading image.png…]()

For Jobs don't have associated SQL query, the text won't be shown.

## How was this patch tested?

Manual test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark addSQLID

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23068


commit e7c2ebbda949918034cb9cb92ac6ef30af17d943
Author: Gengliang Wang 
Date:   2018-11-17T05:42:26Z

add sql id




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...

2018-11-17 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23049
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...

2018-11-16 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23049
  
@vanzin I see your point. I will add a link to 
https://spark.apache.org/docs/latest/configuration.html. Thanks for the 
suggestion.
In my case, I didn't know where to find or edit `spark-env.sh` at that 
time. I tried run `find . -name spark-env.sh` and got nothing.
The script will only print `spark-env.sh` without its location only if 
SPARK_ENV_LOADED is not set.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...

2018-11-16 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23049
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23047: [BACKPORT][SPARK-25883][SQL][MINOR] Override meth...

2018-11-15 Thread gengliangwang
Github user gengliangwang closed the pull request at:

https://github.com/apache/spark/pull/23047


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...

2018-11-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23049
  
Hi @vanzin ,
thanks for pointing it out! I have updated the script and PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23049: [SPARK-26076][Build][Minor] Revise ambiguous erro...

2018-11-15 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23049#discussion_r233902416
  
--- Diff: bin/load-spark-env.sh ---
@@ -47,8 +47,8 @@ if [ -z "$SPARK_SCALA_VERSION" ]; then
   ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.12"
 
   if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then
-echo -e "Presence of build for multiple Scala versions detected." 1>&2
-echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION in 
spark-env.sh.' 1>&2
+echo -e "Presence of build for both scala versions(SCALA 2.11 and 
SCALA 2.12) detected." 1>&2
--- End diff --

Make sense. Let me have a quick update


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23049: [SPARK-26076][Build] Revise ambiguous error message from...

2018-11-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/23049
  
@srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23049: [SPARK-26076][Build] Revise ambiguous error messa...

2018-11-15 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23049

[SPARK-26076][Build] Revise ambiguous error message from load-spark-env.sh

## What changes were proposed in this pull request?

When I try to run scripts (e.g. `./sbin/start-history-server.sh -h` in 
latest master, I got such error:
```
Presence of build for multiple Scala versions detected.

Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.
```

The error message is quite confusing and there is no `spark-env.sh` in our 
code base. 
As now with https://github.com/apache/spark/pull/22967, we can revise the 
error message as following:

```
Presence of build for both scala versions(SCALA 2.11 and SCALA 2.12) 
detected.
Either clean one of them or, export SPARK_SCALA_VERSION=2.12 in 
load-spark-env.sh.

## How was this patch tested?

Manual test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark reviseEnvScript

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23049.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23049


commit bf94264a3e037e00a7b7111a677467de980071c9
Author: Gengliang Wang 
Date:   2018-11-15T15:43:44Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23047: [SPARK-25883][SQL][MINOR] Override method `pretty...

2018-11-15 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/23047

[SPARK-25883][SQL][MINOR] Override method `prettyName` in 
`from_avro`/`to_avro`

## What changes were proposed in this pull request?

Previously in from_avro/to_avro, we override the method `simpleString` and 
`sql` for the string output. However, the override only affects the alias 
naming:
```
Project [from_avro('col, 
...
, (mode,PERMISSIVE)) AS from_avro(col, struct, 
Map(mode -> PERMISSIVE))#11]
```
It only makes the alias name quite long: `from_avro(col, 
struct, Map(mode -> PERMISSIVE))`).

We should follow `from_csv`/`from_json` here, to override the method 
prettyName only, and we will get a clean alias name

```
... AS from_avro(col)#11
```

## How was this patch tested?

Manual check

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark backport_avro_pretty_name

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23047


commit 3323fdb87120b8783a17728dee26dcb738451231
Author: Gengliang Wang 
Date:   2018-10-31T06:59:37Z

[SPARK-25883][SQL][MINOR] Override method `prettyName` in 
`from_avro`/`to_avro`

Previously in from_avro/to_avro, we override the method `simpleString` and 
`sql` for the string output. However, the override only affects the alias 
naming:
```
Project [from_avro('col,
...
, (mode,PERMISSIVE)) AS from_avro(col, struct, 
Map(mode -> PERMISSIVE))#11]
```
It only makes the alias name quite long: `from_avro(col, 
struct, Map(mode -> PERMISSIVE))`).

We should follow `from_csv`/`from_json` here, to override the method 
prettyName only, and we will get a clean alias name

```
... AS from_avro(col)#11
```

Manual check

Closes #22890 from gengliangwang/revise_from_to_avro.

Authored-by: Gengliang Wang 
Signed-off-by: gatorsmile 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22890: [SPARK-25883][SQL][Minor] Override method `prettyName` i...

2018-11-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22890
  
This PR turn out to be a bug fix for this issue:
https://issues.apache.org/jira/browse/SPARK-26063

Back port this to branch-2.4.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...

2018-11-14 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/21688
  
Hi @pgandhi999 
Thanks for the work. One minor comment here:
Currently the table header looks like this

![image](https://user-images.githubusercontent.com/1097932/48502853-841e4d80-e87b-11e8-8d14-291869d25719.png)

The arrow png file is on the most right side, which can be confusing.

In other tables, we are using U+25BE  and  U+25B4,  right next to the text, 
which is more straightforward:

![image](https://user-images.githubusercontent.com/1097932/48503125-3fdf7d00-e87c-11e8-9d7f-d30d1411df8e.png)

Can you adjust the position of the png files, or use the unicode instead to 
keep consistency.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-13 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
@dongjoon-hyun sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-13 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22966#discussion_r233097082
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.Random
+
+import org.apache.spark.SparkConf
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.types._
+
+/**
+ * Benchmark to measure Avro read performance.
+ * {{{
+ *   To run this benchmark:
+ *   1. without sbt: bin/spark-submit --class 
+ *--jars , 
+ *   2. build/sbt "avro/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"avro/test:runMain "
+ *  Results will be written to 
"benchmarks/AvroReadBenchmark-results.txt".
+ * }}}
+ */
+object AvroReadBenchmark extends BenchmarkBase with SQLHelper {
--- End diff --

I see. I have updated this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...

2018-11-12 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23002#discussion_r232713853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
 ---
@@ -159,7 +159,7 @@ class SQLAppStatusListener(
   }
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] 
= {
-val metricIds = exec.metrics.map(_.accumulatorId).sorted
+val metricIds = exec.metrics.map(_.accumulatorId).toSet
 val metricTypes = exec.metrics.map { m => (m.accumulatorId, 
m.metricType) }.toMap
 val metrics = exec.stages.toSeq
   .flatMap { stageId => Option(stageMetrics.get(stageId)) }
--- End diff --

I am also fine with the current code here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...

2018-11-12 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23002#discussion_r232713761
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
 ---
@@ -159,7 +159,7 @@ class SQLAppStatusListener(
   }
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] 
= {
-val metricIds = exec.metrics.map(_.accumulatorId).sorted
+val metricIds = exec.metrics.map(_.accumulatorId).toSet
 val metricTypes = exec.metrics.map { m => (m.accumulatorId, 
m.metricType) }.toMap
 val metrics = exec.stages.toSeq
   .flatMap { stageId => Option(stageMetrics.get(stageId)) }
--- End diff --

If the metrics is large, then using a while loop can reduce the number of 
traversal loops. And it is not complicated to do it in the code here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...

2018-11-12 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23002#discussion_r232706992
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
 ---
@@ -159,7 +159,7 @@ class SQLAppStatusListener(
   }
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] 
= {
-val metricIds = exec.metrics.map(_.accumulatorId).sorted
+val metricIds = exec.metrics.map(_.accumulatorId).toSet
 val metricTypes = exec.metrics.map { m => (m.accumulatorId, 
m.metricType) }.toMap
 val metrics = exec.stages.toSeq
   .flatMap { stageId => Option(stageMetrics.get(stageId)) }
--- End diff --

Consider also change the following `flatMap` / `filter`  / `groupBy` into 
`while` loop


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...

2018-11-12 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/23002#discussion_r232706543
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala
 ---
@@ -159,7 +159,7 @@ class SQLAppStatusListener(
   }
 
   private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] 
= {
-val metricIds = exec.metrics.map(_.accumulatorId).sorted
+val metricIds = exec.metrics.map(_.accumulatorId).toSet
--- End diff --

Actually this one can be merged into `metricTypes`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22966#discussion_r232550388
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.Random
+
+import org.apache.spark.SparkConf
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.types._
+
+/**
+ * Benchmark to measure Avro read performance.
+ * {{{
+ *   To run this benchmark:
+ *   1. without sbt: bin/spark-submit --class 
+ *--jars , 
+ *   2. build/sbt "avro/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"avro/test:runMain "
+ *  Results will be written to 
"benchmarks/AvroReadBenchmark-results.txt".
+ * }}}
+ */
+object AvroReadBenchmark extends BenchmarkBase with SQLHelper {
--- End diff --

@dongjoon-hyun OK, then I think this one is ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-10 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22966#discussion_r232461509
  
--- Diff: external/avro/benchmarks/AvroReadBenchmark-results.txt ---
@@ -0,0 +1,122 @@

+
+SQL Single Numeric Column Scan

+
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
+SQL Single TINYINT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative

+
+Sum   2013 / 2071  7.8 
128.0   1.0X
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
+SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative

+
+Sum   1955 / 1957  8.0 
124.3   1.0X
--- End diff --

Actually all these INT types are processed as INT. The difference is cause 
by JIT.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-09 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22966#discussion_r232272074
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala
 ---
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.benchmark
+
+import java.io.File
+
+import scala.util.Random
+
+import org.apache.spark.SparkConf
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+import org.apache.spark.sql.{DataFrame, SparkSession}
+import org.apache.spark.sql.catalyst.plans.SQLHelper
+import org.apache.spark.sql.types._
+
+/**
+ * Benchmark to measure Avro read performance.
+ * {{{
+ *   To run this benchmark:
+ *   1. without sbt: bin/spark-submit --class 
+ *--jars , 
+ *   2. build/sbt "avro/test:runMain "
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"avro/test:runMain "
+ *  Results will be written to 
"benchmarks/AvroReadBenchmark-results.txt".
+ * }}}
+ */
+object AvroReadBenchmark extends BenchmarkBase with SQLHelper {
--- End diff --

This is following `DataSourceReadBenchmark` and `OrcReadBenchmark`.
Should I change them as well?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
@dongjoon-hyun I think we can merge this one first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22987: [SPARK-25979][SQL] Window function: allow parentheses ar...

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22987
  
@gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22987#discussion_r231962213
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala
 ---
@@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with 
SharedSQLContext {
 
   import testImplicits._
 
+  val empSalaryData = Seq(
--- End diff --

The table and data here is from:  
https://www.postgresql.org/docs/9.1/tutorial-window.html

The naming and data is easier to be understood.  In case we need to compare 
more behavior, we can use this one as alternative.
I can use the `WindowData` and remove this as well.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...

2018-11-08 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/22987

[SPARK-25979][SQL] Window function: allow parentheses around window 
reference

## What changes were proposed in this pull request?

Very minor parser bug, but possibly problematic for code-generated queries:

Consider the following two queries:
```
SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) 
ORDER BY 1
```
and
```
SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER 
BY 1
```
The former, with parens around the OVER condition, fails to parse while the 
latter, without parens, succeeds:
```
Error in SQL statement: ParseException: 
mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 
'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 
'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19)

== SQL ==
SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) 
ORDER BY 1
---^^^
```
This was found when running the cockroach DB tests.

I tried PostgreSQL, The SQL with parentheses  is also workable.

## How was this patch tested?

Unit test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark windowParentheses

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22987.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22987


commit c9d01df0ee9851751469e231069ab587975f5a7c
Author: Gengliang Wang 
Date:   2018-11-08T16:07:46Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22965
  
@dongjoon-hyun sure, done. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22965#discussion_r231795384
  
--- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt ---
@@ -2,268 +2,268 @@
 SQL Single Numeric Column Scan
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single TINYINT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 21508 / 22112  0.7 
   1367.5   1.0X
-SQL Json  8705 / 8825  1.8 
553.4   2.5X
-SQL Parquet Vectorized 157 /  186100.0 
 10.0 136.7X
-SQL Parquet MR1789 / 1794  8.8 
113.8  12.0X
-SQL ORC Vectorized 156 /  166100.9 
  9.9 138.0X
-SQL ORC Vectorized with copy   218 /  225 72.1 
 13.9  98.6X
-SQL ORC MR1448 / 1492 10.9 
 92.0  14.9X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 14108 / 14263  1.1 
896.9   1.0X
+SQL Json  5477 / 5509  2.9 
348.2   2.6X
+SQL Parquet Vectorized 115 /  125137.1 
  7.3 122.9X
+SQL Parquet MR1318 / 1332 11.9 
 83.8  10.7X
+SQL ORC Vectorized 150 /  159104.9 
  9.5  94.1X
+SQL ORC Vectorized with copy   206 /  208 76.4 
 13.1  68.5X
+SQL ORC MR1072 / 1075 14.7 
 68.1  13.2X
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s)  
 Per Row(ns)   Relative
 

-ParquetReader Vectorized   202 /  211 77.7 
 12.9   1.0X
-ParquetReader Vectorized -> Row118 /  120133.5 
  7.5   1.7X
+ParquetReader Vectorized   138 /  152114.0 
  8.8   1.0X
+ParquetReader Vectorized -> Row 80 /   87197.2 
  5.1   1.7X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 23282 / 23312  0.7 
   1480.2   1.0X
-SQL Json  9187 / 9189  1.7 
584.1   2.5X
-SQL Parquet Vectorized 204 /  218 77.0 
 13.0 114.0X
-SQL Parquet MR1941 / 1953  8.1 
123.4  12.0X
-SQL ORC Vectorized 217 /  225 72.6 
 13.8 107.5X
-SQL ORC Vectorized with copy   279 /  289 56.3 
 17.8  83.4X
-SQL ORC MR1541 / 1549 10.2 
 98.0  15.1X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 14495 / 14507  1.1 
921.6   1.0X
+SQL Json  5615 / 5668  2.8 
357.0   2.6X
+SQL Parquet Vectorized 147 /  154107.4 
  9.3  98.9X
+SQL Parquet MR1431 / 1454 11.0 
 91.0  10.1X
+SQL ORC Vectorized 170 /  175 92.4 
 10.8  85.1X
+SQL ORC Vectorized wit

[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...

2018-11-08 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22965
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
Cool, could you introduce it to Spark? That would be very helpful :)
@dbtsai  @jleach4 and @aokolnychyi


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
@dbtsai Great! 
I was thinking the benchmark in this PR is kind of simple, so I didn't add 
it for over months..
The benchmark you mentioned should also workable for other data sources, 
right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22965#discussion_r231766852
  
--- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt ---
@@ -2,268 +2,268 @@
 SQL Single Numeric Column Scan
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single TINYINT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 21508 / 22112  0.7 
   1367.5   1.0X
-SQL Json  8705 / 8825  1.8 
553.4   2.5X
-SQL Parquet Vectorized 157 /  186100.0 
 10.0 136.7X
-SQL Parquet MR1789 / 1794  8.8 
113.8  12.0X
-SQL ORC Vectorized 156 /  166100.9 
  9.9 138.0X
-SQL ORC Vectorized with copy   218 /  225 72.1 
 13.9  98.6X
-SQL ORC MR1448 / 1492 10.9 
 92.0  14.9X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 15974 / 16222  1.0 
   1015.6   1.0X
+SQL Json  5917 / 6174  2.7 
376.2   2.7X
+SQL Parquet Vectorized 115 /  128136.8 
  7.3 138.9X
+SQL Parquet MR1459 / 1571 10.8 
 92.8  10.9X
+SQL ORC Vectorized 164 /  194 95.8 
 10.4  97.3X
+SQL ORC Vectorized with copy   204 /  303 77.2 
 12.9  78.4X
+SQL ORC MR1095 / 1143 14.4 
 69.6  14.6X
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s)  
 Per Row(ns)   Relative
 

-ParquetReader Vectorized   202 /  211 77.7 
 12.9   1.0X
-ParquetReader Vectorized -> Row118 /  120133.5 
  7.5   1.7X
+ParquetReader Vectorized   139 /  156113.1 
  8.8   1.0X
+ParquetReader Vectorized -> Row 83 /   89188.7 
  5.3   1.7X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 23282 / 23312  0.7 
   1480.2   1.0X
-SQL Json  9187 / 9189  1.7 
584.1   2.5X
-SQL Parquet Vectorized 204 /  218 77.0 
 13.0 114.0X
-SQL Parquet MR1941 / 1953  8.1 
123.4  12.0X
-SQL ORC Vectorized 217 /  225 72.6 
 13.8 107.5X
-SQL ORC Vectorized with copy   279 /  289 56.3 
 17.8  83.4X
-SQL ORC MR1541 / 1549 10.2 
 98.0  15.1X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 16394 / 16643  1.0 
   1042.3   1.0X
+SQL Json  6014 / 6020  2.6 
382.4   2.7X
+SQL Parquet Vectorized 147 /  155106.9 
  9.4 111.4X
+SQL Parquet MR1575 / 1581 10.0 
100.1  10.4X
+SQL ORC Vectorized 168 /  173 93.9 
 10.7  97.9X
+SQL ORC Vectorized wit

[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22965#discussion_r231765680
  
--- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt ---
@@ -2,268 +2,268 @@
 SQL Single Numeric Column Scan
 

 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single TINYINT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 21508 / 22112  0.7 
   1367.5   1.0X
-SQL Json  8705 / 8825  1.8 
553.4   2.5X
-SQL Parquet Vectorized 157 /  186100.0 
 10.0 136.7X
-SQL Parquet MR1789 / 1794  8.8 
113.8  12.0X
-SQL ORC Vectorized 156 /  166100.9 
  9.9 138.0X
-SQL ORC Vectorized with copy   218 /  225 72.1 
 13.9  98.6X
-SQL ORC MR1448 / 1492 10.9 
 92.0  14.9X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 15974 / 16222  1.0 
   1015.6   1.0X
+SQL Json  5917 / 6174  2.7 
376.2   2.7X
+SQL Parquet Vectorized 115 /  128136.8 
  7.3 138.9X
+SQL Parquet MR1459 / 1571 10.8 
 92.8  10.9X
+SQL ORC Vectorized 164 /  194 95.8 
 10.4  97.3X
+SQL ORC Vectorized with copy   204 /  303 77.2 
 12.9  78.4X
+SQL ORC MR1095 / 1143 14.4 
 69.6  14.6X
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s)  
 Per Row(ns)   Relative
 

-ParquetReader Vectorized   202 /  211 77.7 
 12.9   1.0X
-ParquetReader Vectorized -> Row118 /  120133.5 
  7.5   1.7X
+ParquetReader Vectorized   139 /  156113.1 
  8.8   1.0X
+ParquetReader Vectorized -> Row 83 /   89188.7 
  5.3   1.7X
 
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6
+Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
 SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 

-SQL CSV 23282 / 23312  0.7 
   1480.2   1.0X
-SQL Json  9187 / 9189  1.7 
584.1   2.5X
-SQL Parquet Vectorized 204 /  218 77.0 
 13.0 114.0X
-SQL Parquet MR1941 / 1953  8.1 
123.4  12.0X
-SQL ORC Vectorized 217 /  225 72.6 
 13.8 107.5X
-SQL ORC Vectorized with copy   279 /  289 56.3 
 17.8  83.4X
-SQL ORC MR1541 / 1549 10.2 
 98.0  15.1X
-
-OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
+SQL CSV 16394 / 16643  1.0 
   1042.3   1.0X
+SQL Json  6014 / 6020  2.6 
382.4   2.7X
+SQL Parquet Vectorized 147 /  155106.9 
  9.4 111.4X
+SQL Parquet MR1575 / 1581 10.0 
100.1  10.4X
+SQL ORC Vectorized 168 /  173 93.9 
 10.7  97.9X
+SQL ORC Vectorized wit

[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
Done, @dongjoon-hyun PTAL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22965
  
@dongjoon-hyun @yucai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [PARK-25965][SQL] Add avro read benchmark

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22966
  
@dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22965#discussion_r231549251
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala ---
@@ -32,9 +32,11 @@ import org.apache.spark.sql.types._
  * Benchmark to measure ORC read performance.
  * {{{
  *   To run this benchmark:
- *   1. without sbt: bin/spark-submit --class  
- *   2. build/sbt "sql/test:runMain "
- *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/test:runMain "
+ *   1. without sbt: bin/spark-submit --class 
+ *--jars 
--- End diff --

The jars here are built by sbt. 
I am surprise that 5 jars are required.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22966: [PARK-25965][SQL] Add avro read benchmark

2018-11-07 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/22966

[PARK-25965][SQL] Add avro read benchmark

## What changes were proposed in this pull request?

Add read benchmark for Avro, which is missing for a period.
The benchmark is similar to `DataSourceReadBenchmark` and `OrcReadBenchmark`

## How was this patch tested?

Manually run benchmark

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark avroReadBenchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22966


commit 2713d6e83e5349ba8237a2c680665fb180d14e94
Author: Gengliang Wang 
Date:   2018-11-07T09:09:20Z

add avro read benchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-07 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22965#discussion_r231541613
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala ---
@@ -266,8 +268,9 @@ object OrcReadBenchmark extends BenchmarkBase with 
SQLHelper {
 s"SELECT IF(RAND(1) < $fractionOfNulls, NULL, CAST(id as 
STRING)) AS c1, " +
 s"IF(RAND(2) < $fractionOfNulls, NULL, CAST(id as STRING)) AS 
c2 FROM t1"))
--- End diff --

@maropu It is trivial, but why it is RAND(2) here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...

2018-11-07 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/22965

[SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSourceReadBenchmark 
case names and execution instructions

## What changes were proposed in this pull request?

1. OrcReadBenchmark is under hive module, so the way to run it should be 
```
build/sbt "hive/test:runMain "
```

2. The benchmark "String with Nulls Scan" should be with case "String with 
Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)"

3. Add the null value percentages in the test case names of 
DataSourceReadBenchmark, for the benchmark "String with Nulls Scan" .


## How was this patch tested?

Re-run benchmarks

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark fixHiveOrcReadBenchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22965


commit f282331d12975687391a7648aacde19a58774936
Author: Gengliang Wang 
Date:   2018-11-07T13:03:05Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-04 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22914
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-11-01 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22878#discussion_r230041592
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala ---
@@ -0,0 +1,533 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.avro
+
+import java.io._
+import java.util.{Map => JMap}
+
+import scala.collection.JavaConverters._
+import scala.language.existentials
+import scala.reflect.ClassTag
+
+import org.apache.avro.Schema
+import org.apache.avro.Schema.Parser
+import org.apache.avro.Schema.Type._
+import org.apache.avro.generic.{GenericData, IndexedRecord}
+import org.apache.avro.reflect.ReflectData
+import org.apache.avro.specific.SpecificRecord
+
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.avro.SchemaConverters._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, 
UnresolvedExtractValue}
+import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.codegen._
+import org.apache.spark.sql.catalyst.expressions.codegen.Block._
+import org.apache.spark.sql.catalyst.expressions.objects.{LambdaVariable 
=> _, _}
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
GenericArrayData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * A Spark-SQL Encoder for Avro objects
+ */
+object AvroEncoder {
+  /**
+   * Provides an Encoder for Avro objects of the given class
+   *
+   * @param avroClass the class of the Avro object for which to generate 
the Encoder
+   * @tparam T the type of the Avro class, must implement SpecificRecord
+   * @return an Encoder for the given Avro class
+   */
+  def of[T <: SpecificRecord](avroClass: Class[T]): Encoder[T] = {
+AvroExpressionEncoder.of(avroClass)
+  }
+  /**
+   * Provides an Encoder for Avro objects implementing the given schema
+   *
+   * @param avroSchema the Schema of the Avro object for which to generate 
the Encoder
+   * @tparam T the type of the Avro class that implements the Schema, must 
implement IndexedRecord
+   * @return an Encoder for the given Avro Schema
+   */
+  def of[T <: IndexedRecord](avroSchema: Schema): Encoder[T] = {
--- End diff --

In `from_avro`, we are using avro schema in json format string, should we 
consider change to that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-11-01 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22914
  
> Returning to current page seems more code change is required. Because we 
are getting all the parameters related to page from the url. Currently we are 
not passing any parameter to get which page the user was on previously.

I come up with a new idea: how about navigate to the last page(max page 
number)? That makes more sense IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...

2018-11-01 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22864
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-11-01 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22878#discussion_r229952971
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -1374,4 +1377,185 @@ class AvroSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   |}
 """.stripMargin)
   }
+
+  test("generic record converts to row and back") {
--- End diff --

I think we can move these test cases to a new suite. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro

2018-11-01 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22878#discussion_r229976669
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -1374,4 +1377,185 @@ class AvroSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils {
   |}
 """.stripMargin)
   }
+
+  test("generic record converts to row and back") {
+val nested =
+  SchemaBuilder.record("simple_record").fields()
+.name("nested1").`type`("int").withDefault(0)
+.name("nested2").`type`("string").withDefault("string").endRecord()
+val mapDefault = new java.util.HashMap[String, String]()
+mapDefault.put("a", "A")
+val schema = SchemaBuilder.record("record").fields()
--- End diff --

We can use a json string to present the schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-10-31 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22914
  
> May be we can highlight above the table, that "Invalid page number, 
falling back to first page"

Yes, that's what I mean. 
No big deal but falling back to the first page seems unnecessary, we can 
return the same page again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...

2018-10-31 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22914
  
I prefer to  just highlight the invalid output. E.g.


![image](https://user-images.githubusercontent.com/1097932/47831557-0e6ea800-ddcc-11e8-9fd1-c4d29f944c9d.png)


![image](https://user-images.githubusercontent.com/1097932/47831561-11699880-ddcc-11e8-9d83-20e49a0c517d.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...

2018-10-30 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22895#discussion_r229564121
  
--- Diff: 
external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala 
---
@@ -100,9 +100,14 @@ case class AvroDataToCatalyst(
   case NonFatal(e) => parseMode match {
 case PermissiveMode => nullResultRow
 case FailFastMode =>
-  throw new SparkException("Malformed records are detected in 
record parsing. " +
+  val msg = if (e.getMessage != null) {
+e.getMessage + "\n"
+  } else {
+""
+  }
+  throw new SparkException(msg + "Malformed records are detected 
in record parsing. " +
 s"Current parse Mode: ${FailFastMode.name}. To process 
malformed records as null " +
-"result, try setting the option 'mode' as 'PERMISSIVE'.", 
e.getCause)
+"result, try setting the option 'mode' as 'PERMISSIVE'.", e)
--- End diff --

Agree, When I post the output, I have the same thought..Haha.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   >