[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23208#discussion_r240101369 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -17,52 +17,49 @@ package org.apache.spark.sql.execution.datasources.v2 -import java.util.UUID - -import scala.collection.JavaConverters._ +import java.util.{Optional, UUID} import org.apache.spark.sql.{AnalysisException, SaveMode} import org.apache.spark.sql.catalyst.analysis.{MultiInstanceRelation, NamedRelation} import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression} import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan, Statistics} import org.apache.spark.sql.catalyst.util.truncatedString -import org.apache.spark.sql.sources.DataSourceRegister import org.apache.spark.sql.sources.v2._ import org.apache.spark.sql.sources.v2.reader._ import org.apache.spark.sql.sources.v2.writer.BatchWriteSupport import org.apache.spark.sql.types.StructType /** - * A logical plan representing a data source v2 scan. + * A logical plan representing a data source v2 table. * - * @param source An instance of a [[DataSourceV2]] implementation. - * @param options The options for this scan. Used to create fresh [[BatchWriteSupport]]. - * @param userSpecifiedSchema The user-specified schema for this scan. + * @param table The table that this relation represents. + * @param options The options for this table operation. It's used to create fresh [[ScanBuilder]] + *and [[BatchWriteSupport]]. */ case class DataSourceV2Relation( -// TODO: remove `source` when we finish API refactor for write. -source: TableProvider, -table: SupportsBatchRead, +table: Table, output: Seq[AttributeReference], -options: Map[String, String], -userSpecifiedSchema: Option[StructType] = None) +// TODO: use a simple case insensitive map instead. +options: DataSourceOptions) extends LeafNode with MultiInstanceRelation with NamedRelation { - import DataSourceV2Relation._ - override def name: String = table.name() override def simpleString: String = { s"RelationV2${truncatedString(output, "[", ", ", "]")} $name" } - def newWriteSupport(): BatchWriteSupport = source.createWriteSupport(options, schema) - - def newScanBuilder(): ScanBuilder = { -val dsOptions = new DataSourceOptions(options.asJava) -table.newScanBuilder(dsOptions) + def newWriteSupport(inputSchema: StructType, mode: SaveMode): Optional[BatchWriteSupport] = { --- End diff -- Nit: add comment for the method. Especially when it will return None. Although it is explained in `SupportsBatchWrite.createBatchWriteSupport` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23215 Thank you @cloud-fan @viirya @HyukjinKwon . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23215 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23208: [SPARK-25530][SQL] data source v2 API refactor (b...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23208#discussion_r239454594 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/SupportsBatchWrite.java --- @@ -25,14 +25,14 @@ import org.apache.spark.sql.types.StructType; /** - * A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to + * A mix-in interface for {@link Table}. Data sources can implement this interface to * provide data writing ability for batch processing. * * This interface is used to create {@link BatchWriteSupport} instances when end users run * {@code Dataset.write.format(...).option(...).save()}. */ @Evolving -public interface BatchWriteSupportProvider extends DataSourceV2 { +public interface SupportsBatchWrite extends Table { --- End diff -- It is quite confusing to have `BatchWriteSupport` and `SupportsBatchWrite` to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23215: [SPARK-26263][SQL] Validate partition values with...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23215#discussion_r239423651 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -272,9 +279,13 @@ object PartitioningUtils { val literal = if (userSpecifiedDataTypes.contains(columnName)) { // SPARK-26188: if user provides corresponding column schema, get the column value without // inference, and then cast it as user specified data type. -val columnValue = inferPartitionColumnValue(rawColumnValue, false, timeZone) -val castedValue = - Cast(columnValue, userSpecifiedDataTypes(columnName), Option(timeZone.getID)).eval() +val dataType = userSpecifiedDataTypes(columnName) +val columnValueLiteral = inferPartitionColumnValue(rawColumnValue, false, timeZone) +val columnValue = columnValueLiteral.eval() +val castedValue = Cast(columnValueLiteral, dataType, Option(timeZone.getID)).eval() +if (validatePartitionColumns && columnValue != null && castedValue == null) { + throw new RuntimeException(s"Failed to cast partition value `$columnValue` to $dataType") --- End diff -- Good suggestion. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23215: [SPARK-26263][SQL] Validate partition values with user p...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23215 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23240: [SPARK-26281][WebUI] Duration column of task table shoul...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23240 Oh, I see. Close this one now. Please change the title in #23160 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23240: [SPARK-26281][WebUI] Duration column of task tabl...
Github user gengliangwang closed the pull request at: https://github.com/apache/spark/pull/23240 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23240: [SPARK-26281][WebUI] Duration column of task table shoul...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23240 @shahidki31 @pgandhi999 @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23240: [SPARK-26281][WebUI] Duration column of task tabl...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23240 [SPARK-26281][WebUI] Duration column of task table should be executor run time instead of real duration ## What changes were proposed in this pull request? In PR https://github.com/apache/spark/pull/23081/ , the duration column is changed to executor run time. The behavior is consistent with the summary metrics table and previous Spark version. However, after PR https://github.com/apache/spark/pull/21688, the issue can be reproduced again. ## How was this patch tested? Before the change, we can see: 1. The minimum duration in aggregation table doesn't match with the task table below. 2. The sorting order is wrong. ![image](https://user-images.githubusercontent.com/1097932/49533048-f7eecb80-f8f8-11e8-9256-2eb524e81be0.png) After the change, the issues are fixed: ![image](https://user-images.githubusercontent.com/1097932/49533069-06d57e00-f8f9-11e8-872b-402e3014f557.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark fixDuration Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23240.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23240 commit 612c4c7242f6289d3a1e424a69951be25cd126af Author: Gengliang Wang Date: 2018-12-05T17:44:55Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23215: [SPARK-26263][SQL] Throw exception when Partition...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23215 [SPARK-26263][SQL] Throw exception when Partition column value can't be converted to user specified type ## What changes were proposed in this pull request? Currently if user provides data schema, partition column values are converted as per it. But if the conversion failed, e.g. converting string to int, the column value is null. We should throw exception in such case. For the following directory ``` /tmp/testDir âââ p=bar âââ p=foo ``` If we run: ``` val schema = StructType(Seq(StructField("p", IntegerType, false))) spark.read.schema(schema).csv("/tmp/testDir/").show() ``` We will get: ``` ++ | p| ++ |null| |null| ++ ``` This PR proposes to throw exception in such case, instead of converting into null value silently: 1. These null partition column values doesn't make sense to users in most cases. It is better to show the conversion failure, and then users can adjust the schema or ETL jobs to fix it. 2. There are always exceptions on such conversion failure for non-partition data columns. Partition columns should have the same behavior. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark SPARK-26263 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23215.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23215 commit 7060e127de339de42be12ed382ef0a4363ae325d Author: Gengliang Wang Date: 2018-12-04T09:43:03Z SPARK-26263: Throw exception when partition value can't be converted to specific type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23215: [SPARK-26263][SQL] Throw exception when Partition column...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23215 @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23189 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23189: [SPARK-26235][Core] Change log level for ClassNot...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23189#discussion_r237972852 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -813,14 +813,14 @@ private[spark] class SparkSubmit extends Logging { mainClass = Utils.classForName(childMainClass) } catch { case e: ClassNotFoundException => -logWarning(s"Failed to load $childMainClass.", e) +logError(s"Failed to load class $childMainClass.") --- End diff -- Here I tried append `e.getMessage` to the output, but it seems the error message equals the class name `childMainClass`. e.g. ``` Error: Failed to load class foo: foo. ``` So I think we can output without the exception message. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23189 @vanzin Ah, I see. Thanks for pointing it out! But I am now thinking overriding `logError` by calling `printMessage("Error...")`. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23186#discussion_r237963856 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -345,15 +346,18 @@ object PartitioningUtils { */ def resolvePartitions( pathsWithPartitionValues: Seq[(Path, PartitionValues)], + caseSensitive: Boolean, timeZone: TimeZone): Seq[PartitionValues] = { if (pathsWithPartitionValues.isEmpty) { Seq.empty } else { - // TODO: Selective case sensitivity. - val distinctPartColNames = - pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())).distinct + val distinctPartColNames = if (caseSensitive) { +pathsWithPartitionValues.map(_._2.columnNames) + } else { +pathsWithPartitionValues.map(_._2.columnNames.map(_.toLowerCase())) + } assert( -distinctPartColNames.size == 1, +distinctPartColNames.distinct.size == 1, listConflictingPartitionColumns(pathsWithPartitionValues)) --- End diff -- The method `listConflictingPartitionColumns` also shows the suspicious paths. If case sensitive, the method works fine. If case insensitive, it will list all column names without any transformation. e.g. ``` Partition column name list #0: a Partition column name list #1: A Partition column name list #2: B ``` I can fix the method listConflictingPartitionColumns. But seems a bit trivial, we will have to display the original column names instead of transforming all to lower case . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23189 @vanzin The `logWarning` call as the other handler below is also not overridden: ``` case e: NoClassDefFoundError => logWarning(s"Failed to load $childMainClass: ${e.getMessage()}") ``` My basic point is, the exception/error message should be in the category of `ERROR`, even previously Spark prints as warning to `System.err`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23189: [SPARK-26235][Core] Change log level for ClassNotFoundEx...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23189 cc @vanzin This is really trivial, but can be helpful to user in certain case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23189: [SPARK-26235][Core] Change log level for ClassNot...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23189 [SPARK-26235][Core] Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error ## What changes were proposed in this pull request? In my local setup, I set log4j root category as ERROR (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console , first item show up if we google search "set spark log level".) When I run such command ``` spark-submit --class foo bar.jar ``` Nothing shows up, and the script exits. After quick investigation, I think the log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR instead of WARN. Since the whole process exit because of the exception/error. Before https://github.com/apache/spark/pull/20925, the message is not controlled by `log4j.rootCategory`. ## How was this patch tested? Manual check. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark changeLogLevel Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23189.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23189 commit 158d421757989740345be4e52fbc333b08356b5b Author: Gengliang Wang Date: 2018-11-30T14:23:44Z change log level --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23160: [SPARK-26196]Total tasks title in the stage page is inco...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23160 cc @tgravescs @pgandhi999 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, validate...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23186 @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23186: [SPARK-26230][SQL]FileIndex: if case sensitive, v...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23186 [SPARK-26230][SQL]FileIndex: if case sensitive, validate partitions with original column names ## What changes were proposed in this pull request? Partition column name is required to be unique under the same directory. The following paths are invalid partitioned directory: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/b=2 ``` If case sensitive, the following paths should be invalid too: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/A=2 ``` Since column 'a' and 'A' are different, and it is wrong to use either one as the column name in partition schema. Also, there is a `TODO` comment in the code. This PR is to resolve the problem. ## How was this patch tested? Add unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark SPARK-26230 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23186.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23186 commit 6d052130051a21b9aa7c3ffce56a556bee129a5e Author: Gengliang Wang Date: 2018-11-30T09:23:32Z if case sensitive, validate partitions with original column names --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 @gatorsmile thanks! For Jobs don't have associated SQL query, the text `Associated SQL query` won't show up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237523177 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -250,7 +276,13 @@ object PartitioningUtils { val rawColumnValue = columnSpec.drop(equalSignIndex + 1) assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'") - val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone) + val literal = if (userSpecifiedDataTypes.contains(columnName)) { +// SPARK-26188: if user provides corresponding column schema, process the column as String +// type and cast it as user specified data type later. +inferPartitionColumnValue(rawColumnValue, false, timeZone) --- End diff -- Thanks. I will use the Cast one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237516228 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -250,7 +276,13 @@ object PartitioningUtils { val rawColumnValue = columnSpec.drop(equalSignIndex + 1) assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'") - val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone) + val literal = if (userSpecifiedDataTypes.contains(columnName)) { +// SPARK-26188: if user provides corresponding column schema, process the column as String +// type and cast it as user specified data type later. +inferPartitionColumnValue(rawColumnValue, false, timeZone) --- End diff -- See my reasons above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237515959 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -250,7 +276,13 @@ object PartitioningUtils { val rawColumnValue = columnSpec.drop(equalSignIndex + 1) assert(rawColumnValue.nonEmpty, s"Empty partition column value in '$columnSpec'") - val literal = inferPartitionColumnValue(rawColumnValue, typeInference, timeZone) + val literal = if (userSpecifiedDataTypes.contains(columnName)) { +// SPARK-26188: if user provides corresponding column schema, process the column as String +// type and cast it as user specified data type later. --- End diff -- 1. the function returns `Option[(String, Literal)]` 2. the function `inferPartitionColumnValue` is quite complex, don't want change it or write duplicated logic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237513657 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -94,18 +94,34 @@ object PartitioningUtils { paths: Seq[Path], typeInference: Boolean, basePaths: Set[Path], + userSpecifiedSchema: Option[StructType], + caseSensitive: Boolean, timeZoneId: String): PartitionSpec = { -parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId)) +parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema, + caseSensitive, DateTimeUtils.getTimeZone(timeZoneId)) } private[datasources] def parsePartitions( paths: Seq[Path], typeInference: Boolean, basePaths: Set[Path], + userSpecifiedSchema: Option[StructType], + caseSensitive: Boolean, timeZone: TimeZone): PartitionSpec = { +val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) { --- End diff -- Personally I prefer to make the parameter simple and easy to understand. So that the logic of caller(outside the `PartitioningUtils`) looks cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23165#discussion_r237512471 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -94,18 +94,34 @@ object PartitioningUtils { paths: Seq[Path], typeInference: Boolean, basePaths: Set[Path], + userSpecifiedSchema: Option[StructType], + caseSensitive: Boolean, timeZoneId: String): PartitionSpec = { -parsePartitions(paths, typeInference, basePaths, DateTimeUtils.getTimeZone(timeZoneId)) +parsePartitions(paths, typeInference, basePaths, userSpecifiedSchema, + caseSensitive, DateTimeUtils.getTimeZone(timeZoneId)) } private[datasources] def parsePartitions( paths: Seq[Path], typeInference: Boolean, basePaths: Set[Path], + userSpecifiedSchema: Option[StructType], + caseSensitive: Boolean, timeZone: TimeZone): PartitionSpec = { +val userSpecifiedDataTypes = if (userSpecifiedSchema.isDefined) { + val nameToDataType = userSpecifiedSchema.get.fields.map(f => f.name -> f.dataType).toMap + if (caseSensitive) { +CaseInsensitiveMap(nameToDataType) --- End diff -- Yes, thanks for pointing it out :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23165 @mgaido91 The user specified schema might not match the full data schema. For the missing columns, we still need to infer their data types. I will come up with a solution soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23165: [SPARK-26188][SQL] FileIndex: don't infer data types of ...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23165 @cloud-fan @mgaido91 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23165: [SPARK-26188][SQL] FileIndex: don't infer data ty...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23165 [SPARK-26188][SQL] FileIndex: don't infer data types of partition columns if user specifies schema ## What changes were proposed in this pull request? This PR is to fix a regression introduced in: https://github.com/apache/spark/pull/21004/files#r236998030 If user specifies schema, Spark don't need to infer data type for of partition columns, otherwise the data type might not match with the one user provided. E.g. for partition directory `p=4d`, after data type inference the column value will be `4.0`. See https://issues.apache.org/jira/browse/SPARK-26188 for more details. ## How was this patch tested? Add unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark fixFileIndex Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23165 commit 2866a9e1c1a7d42e6cf53474733c6f39e812c680 Author: Gengliang Wang Date: 2018-11-28T16:11:22Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21004#discussion_r237092452 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala --- @@ -126,35 +126,32 @@ abstract class PartitioningAwareFileIndex( val caseInsensitiveOptions = CaseInsensitiveMap(parameters) val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION) .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone) - -userPartitionSchema match { +val inferredPartitionSpec = PartitioningUtils.parsePartitions( + leafDirs, + typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled, --- End diff -- @mgaido91 Thanks for the investigation!! I will fix it and add test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 ping @vanzin . Please review this one :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23125#discussion_r235995724 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -79,6 +79,9 @@ private[ui] class StagePage(parent: StagesTab, store: AppStatusStore) extends We localityNamesAndCounts.sorted.mkString("; ") } + private def jobURL(request: HttpServletRequest, jobId: Int): String = --- End diff -- Make sense. I have updated the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23125: [SPARK-26156][WebUI] Revise summary section of stage pag...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23125 @srowen @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23125: [SPARK-26156][WebUI] Revise summary section of st...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23125 [SPARK-26156][WebUI] Revise summary section of stage page ## What changes were proposed in this pull request? In the summary section of stage page: ![image](https://user-images.githubusercontent.com/1097932/48935518-ebef2b00-ef42-11e8-8672-eaa4cac92c5e.png) 1. the following metrics names can be revised: Output => Output Size / Records Shuffle Read: => Shuffle Read Size / Records Shuffle Write => Shuffle Write Size / Records After changes, the names are more clear, and consistent with the other names in the same page. 2. The associated job id URL should not contain the 3 tails spaces. Reduce the number of spaces to one, and exclude the space from link. This is consistent with SQL execution page. ## How was this patch tested? Manual check: ![image](https://user-images.githubusercontent.com/1097932/48935538-f7425680-ef42-11e8-8b2a-a4f388d3ea52.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark reviseStagePage Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23125 commit 215d3d7b8e187716d68a67a9b16d9eb444a8099f Author: Gengliang Wang Date: 2018-11-23T09:04:38Z revise stage page --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23098: [WIP][SPARK-26132][BUILD][CORE] Remove support fo...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23098#discussion_r235317414 --- Diff: bin/load-spark-env.cmd --- @@ -21,37 +21,42 @@ rem This script loads spark-env.cmd if it exists, and ensures it is only loaded rem spark-env.cmd is loaded from SPARK_CONF_DIR if set, or within the current directory's rem conf\ subdirectory. +set SPARK_ENV_CMD="spark-env.cmd" if [%SPARK_ENV_LOADED%] == [] ( set SPARK_ENV_LOADED=1 if [%SPARK_CONF_DIR%] == [] ( set SPARK_CONF_DIR=%~dp0..\conf ) - call :LoadSparkEnv + set SPARK_ENV_CMD="%SPARK_CONF_DIR%/%SPARK_ENV_CMD%" + if exist "%SPARK_ENV_CMD%" ( +call "%SPARK_ENV_CMD%" + ) ) rem Setting SPARK_SCALA_VERSION if not already set. -set ASSEMBLY_DIR2="%SPARK_HOME%\assembly\target\scala-2.11" -set ASSEMBLY_DIR1="%SPARK_HOME%\assembly\target\scala-2.12" - -if [%SPARK_SCALA_VERSION%] == [] ( - - if exist %ASSEMBLY_DIR2% if exist %ASSEMBLY_DIR1% ( -echo "Presence of build for multiple Scala versions detected." -echo "Either clean one of them or, set SPARK_SCALA_VERSION in spark-env.cmd." -exit 1 - ) - if exist %ASSEMBLY_DIR2% ( -set SPARK_SCALA_VERSION=2.11 - ) else ( -set SPARK_SCALA_VERSION=2.12 - ) -) +rem TODO: revisit for Scala 2.13 support +set SPARK_SCALA_VERSION=2.12 --- End diff -- I am not familiar with .cmd script. Should we keep the quote here, `"2.12"` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 The web UI presents "jobs" tab as the default view. Showing the related SQL context is helpful, especially for new users. I think the downside is quite little. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23081: [SPARK-26109][WebUI]Duration in the task summary ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23081#discussion_r235071529 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -996,7 +996,7 @@ private[ui] object ApiHelper { HEADER_EXECUTOR -> TaskIndexNames.EXECUTOR, HEADER_HOST -> TaskIndexNames.HOST, HEADER_LAUNCH_TIME -> TaskIndexNames.LAUNCH_TIME, -HEADER_DURATION -> TaskIndexNames.DURATION, +HEADER_DURATION -> TaskIndexNames.EXEC_RUN_TIME, --- End diff -- Not big deal but there are two spaces after `SPARK-26109:` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23081: [SPARK-26109][WebUI]Duration in the task summary ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23081#discussion_r235065240 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala --- @@ -996,7 +996,7 @@ private[ui] object ApiHelper { HEADER_EXECUTOR -> TaskIndexNames.EXECUTOR, HEADER_HOST -> TaskIndexNames.HOST, HEADER_LAUNCH_TIME -> TaskIndexNames.LAUNCH_TIME, -HEADER_DURATION -> TaskIndexNames.DURATION, +HEADER_DURATION -> TaskIndexNames.EXEC_RUN_TIME, --- End diff -- Nit: add comment to explain why it should be executor run time here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23081 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23067: [SPARK-26097][Web UI] Add the new partitioning descripti...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23067 +1. We can also try overriding `toString` for better output. @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23088: [SPARK-26119][CORE][WEBUI]Task summary table should cont...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23088 As the task section shows all the tasks, I think maybe it would be better to show aggregated metrics for all the tasks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23088: [SPARK-26119][CORE][WEBUI]Task summary table shou...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r235027227 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -238,8 +239,16 @@ private[spark] class AppStatusStore( val diff = idx - currentIdx --- End diff -- `diff` could be negative here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23038: [SPARK-25451][SPARK-26100][CORE]Aggregated metric...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23038#discussion_r234728748 --- Diff: core/src/test/scala/org/apache/spark/status/AppStatusListenerSuite.scala --- @@ -1275,6 +1275,49 @@ class AppStatusListenerSuite extends SparkFunSuite with BeforeAndAfter { assert(allJobs.head.numFailedStages == 1) } + test("SPARK-25451: total tasks in the executor summary should match total stage tasks") { +val testConf = conf.clone.set(LIVE_ENTITY_UPDATE_PERIOD, Long.MaxValue) + +val listener = new AppStatusListener(store, testConf, true) + +val stage = new StageInfo(1, 0, "stage", 4, Nil, Nil, "details") +listener.onJobStart(SparkListenerJobStart(1, time, Seq(stage), null)) +listener.onStageSubmitted(SparkListenerStageSubmitted(stage, new Properties())) + +val tasks = createTasks(4, Array("1", "2")) +tasks.foreach { task => + listener.onTaskStart(SparkListenerTaskStart(stage.stageId, stage.attemptNumber, task)) +} + +time += 1 +tasks(0).markFinished(TaskState.FAILED, time) +listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, stage.attemptId, "taskType", + ExecutorLostFailure("1", true, Some("Lost executor")), tasks(0), null)) +time += 1 +tasks(1).markFinished(TaskState.FAILED, time) +listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, stage.attemptId, "taskType", + ExecutorLostFailure("2", true, Some("Lost executor")), tasks(1), null)) + +stage.failureReason = Some("Failed") +listener.onStageCompleted(SparkListenerStageCompleted(stage)) +time += 1 +listener.onJobEnd(SparkListenerJobEnd(1, time, JobFailed(new RuntimeException("Bad Executor" + +time += 1 +tasks(2).markFinished(TaskState.FAILED, time) +listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, stage.attemptId, "taskType", + ExecutorLostFailure("1", true, Some("Lost executor")), tasks(2), null)) +time += 1 +tasks(3).markFinished(TaskState.FAILED, time) +listener.onTaskEnd(SparkListenerTaskEnd(stage.stageId, stage.attemptId, "taskType", + ExecutorLostFailure("2", true, Some("Lost executor")), tasks(3), null)) + +val esummary = store.view(classOf[ExecutorStageSummaryWrapper]).asScala.map(_.info) +esummary.foreach { execSummary => + assert(execSummary.failedTasks == 2) --- End diff -- Nit: also check `succeededTasks` and `killedTasks` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23065 I tried and it works. There is a similar warning in UnionRDD.scala, which will cause failure in Scala 2.11. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23065 Hi @srowen , Could you review and merge https://github.com/srowen/spark/pull/4 ? I see a lot of warnings as well. We should fix them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job pa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23068 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23068#discussion_r234402149 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -56,6 +56,11 @@ private[spark] class AppStatusStore( store.read(classOf[JobDataWrapper], jobId).info } + def jobWithAssociatedSql(jobId: Int): (v1.JobData, Option[Long]) = { --- End diff -- Return JobData and sqlExcuctionId within one look up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23068#discussion_r234402077 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusListener.scala --- @@ -70,6 +70,8 @@ private[spark] class AppStatusListener( private val liveTasks = new HashMap[Long, LiveTask]() private val liveRDDs = new HashMap[Int, LiveRDD]() private val pools = new HashMap[String, SchedulerPool]() + + private val SQL_EXECUTION_ID_KEY = "spark.sql.execution.id" --- End diff -- Here we can't access `SQLExecution.EXECUTION_ID_KEY`. So use a variable for the key string. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23068: [SPARK-26098][WebUI] Show associated SQL query in...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23068 [SPARK-26098][WebUI] Show associated SQL query in Job page ## What changes were proposed in this pull request? For jobs associated to SQL queries, it would be easier to understand the context to showing the SQL query in Job detail page. Before code change: ![image](https://user-images.githubusercontent.com/1097932/48659359-96baa180-ea8a-11e8-8419-a0a87c3f30fc.png) After code change: ![image](https://user-images.githubusercontent.com/1097932/48659390-26f8e680-ea8b-11e8-8fdd-3b58909ea364.png) We can see the whole context when in the associated SQL detail page: ![Uploading image.pngâ¦]() For Jobs don't have associated SQL query, the text won't be shown. ## How was this patch tested? Manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark addSQLID Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23068.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23068 commit e7c2ebbda949918034cb9cb92ac6ef30af17d943 Author: Gengliang Wang Date: 2018-11-17T05:42:26Z add sql id --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23049 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23049 @vanzin I see your point. I will add a link to https://spark.apache.org/docs/latest/configuration.html. Thanks for the suggestion. In my case, I didn't know where to find or edit `spark-env.sh` at that time. I tried run `find . -name spark-env.sh` and got nothing. The script will only print `spark-env.sh` without its location only if SPARK_ENV_LOADED is not set. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23049 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23047: [BACKPORT][SPARK-25883][SQL][MINOR] Override meth...
Github user gengliangwang closed the pull request at: https://github.com/apache/spark/pull/23047 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23049: [SPARK-26076][Build][Minor] Revise ambiguous error messa...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23049 Hi @vanzin , thanks for pointing it out! I have updated the script and PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23049: [SPARK-26076][Build][Minor] Revise ambiguous erro...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23049#discussion_r233902416 --- Diff: bin/load-spark-env.sh --- @@ -47,8 +47,8 @@ if [ -z "$SPARK_SCALA_VERSION" ]; then ASSEMBLY_DIR1="${SPARK_HOME}/assembly/target/scala-2.12" if [[ -d "$ASSEMBLY_DIR2" && -d "$ASSEMBLY_DIR1" ]]; then -echo -e "Presence of build for multiple Scala versions detected." 1>&2 -echo -e 'Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh.' 1>&2 +echo -e "Presence of build for both scala versions(SCALA 2.11 and SCALA 2.12) detected." 1>&2 --- End diff -- Make sense. Let me have a quick update --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23049: [SPARK-26076][Build] Revise ambiguous error message from...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/23049 @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23049: [SPARK-26076][Build] Revise ambiguous error messa...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23049 [SPARK-26076][Build] Revise ambiguous error message from load-spark-env.sh ## What changes were proposed in this pull request? When I try to run scripts (e.g. `./sbin/start-history-server.sh -h` in latest master, I got such error: ``` Presence of build for multiple Scala versions detected. Either clean one of them or, export SPARK_SCALA_VERSION in spark-env.sh. ``` The error message is quite confusing and there is no `spark-env.sh` in our code base. As now with https://github.com/apache/spark/pull/22967, we can revise the error message as following: ``` Presence of build for both scala versions(SCALA 2.11 and SCALA 2.12) detected. Either clean one of them or, export SPARK_SCALA_VERSION=2.12 in load-spark-env.sh. ## How was this patch tested? Manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark reviseEnvScript Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23049.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23049 commit bf94264a3e037e00a7b7111a677467de980071c9 Author: Gengliang Wang Date: 2018-11-15T15:43:44Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23047: [SPARK-25883][SQL][MINOR] Override method `pretty...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/23047 [SPARK-25883][SQL][MINOR] Override method `prettyName` in `from_avro`/`to_avro` ## What changes were proposed in this pull request? Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct, Map(mode -> PERMISSIVE))#11] ``` It only makes the alias name quite long: `from_avro(col, struct, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)#11 ``` ## How was this patch tested? Manual check You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark backport_avro_pretty_name Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23047 commit 3323fdb87120b8783a17728dee26dcb738451231 Author: Gengliang Wang Date: 2018-10-31T06:59:37Z [SPARK-25883][SQL][MINOR] Override method `prettyName` in `from_avro`/`to_avro` Previously in from_avro/to_avro, we override the method `simpleString` and `sql` for the string output. However, the override only affects the alias naming: ``` Project [from_avro('col, ... , (mode,PERMISSIVE)) AS from_avro(col, struct, Map(mode -> PERMISSIVE))#11] ``` It only makes the alias name quite long: `from_avro(col, struct, Map(mode -> PERMISSIVE))`). We should follow `from_csv`/`from_json` here, to override the method prettyName only, and we will get a clean alias name ``` ... AS from_avro(col)#11 ``` Manual check Closes #22890 from gengliangwang/revise_from_to_avro. Authored-by: Gengliang Wang Signed-off-by: gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22890: [SPARK-25883][SQL][Minor] Override method `prettyName` i...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22890 This PR turn out to be a bug fix for this issue: https://issues.apache.org/jira/browse/SPARK-26063 Back port this to branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21688: [SPARK-21809] : Change Stage Page to use datatables to s...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/21688 Hi @pgandhi999 Thanks for the work. One minor comment here: Currently the table header looks like this ![image](https://user-images.githubusercontent.com/1097932/48502853-841e4d80-e87b-11e8-8d14-291869d25719.png) The arrow png file is on the most right side, which can be confusing. In other tables, we are using U+25BE and U+25B4, right next to the text, which is more straightforward: ![image](https://user-images.githubusercontent.com/1097932/48503125-3fdf7d00-e87c-11e8-9d7f-d30d1411df8e.png) Can you adjust the position of the png files, or use the unicode instead to keep consistency. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 @dongjoon-hyun sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22966#discussion_r233097082 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala --- @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.Random + +import org.apache.spark.SparkConf +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.types._ + +/** + * Benchmark to measure Avro read performance. + * {{{ + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + *--jars , + * 2. build/sbt "avro/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain " + * Results will be written to "benchmarks/AvroReadBenchmark-results.txt". + * }}} + */ +object AvroReadBenchmark extends BenchmarkBase with SQLHelper { --- End diff -- I see. I have updated this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23002#discussion_r232713853 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala --- @@ -159,7 +159,7 @@ class SQLAppStatusListener( } private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = { -val metricIds = exec.metrics.map(_.accumulatorId).sorted +val metricIds = exec.metrics.map(_.accumulatorId).toSet val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap val metrics = exec.stages.toSeq .flatMap { stageId => Option(stageMetrics.get(stageId)) } --- End diff -- I am also fine with the current code here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23002#discussion_r232713761 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala --- @@ -159,7 +159,7 @@ class SQLAppStatusListener( } private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = { -val metricIds = exec.metrics.map(_.accumulatorId).sorted +val metricIds = exec.metrics.map(_.accumulatorId).toSet val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap val metrics = exec.stages.toSeq .flatMap { stageId => Option(stageMetrics.get(stageId)) } --- End diff -- If the metrics is large, then using a while loop can reduce the number of traversal loops. And it is not complicated to do it in the code here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23002#discussion_r232706992 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala --- @@ -159,7 +159,7 @@ class SQLAppStatusListener( } private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = { -val metricIds = exec.metrics.map(_.accumulatorId).sorted +val metricIds = exec.metrics.map(_.accumulatorId).toSet val metricTypes = exec.metrics.map { m => (m.accumulatorId, m.metricType) }.toMap val metrics = exec.stages.toSeq .flatMap { stageId => Option(stageMetrics.get(stageId)) } --- End diff -- Consider also change the following `flatMap` / `filter` / `groupBy` into `while` loop --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23002: [SPARK-26003] Improve SQLAppStatusListener.aggreg...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/23002#discussion_r232706543 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLAppStatusListener.scala --- @@ -159,7 +159,7 @@ class SQLAppStatusListener( } private def aggregateMetrics(exec: LiveExecutionData): Map[Long, String] = { -val metricIds = exec.metrics.map(_.accumulatorId).sorted +val metricIds = exec.metrics.map(_.accumulatorId).toSet --- End diff -- Actually this one can be merged into `metricTypes`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22966#discussion_r232550388 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala --- @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.Random + +import org.apache.spark.SparkConf +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.types._ + +/** + * Benchmark to measure Avro read performance. + * {{{ + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + *--jars , + * 2. build/sbt "avro/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain " + * Results will be written to "benchmarks/AvroReadBenchmark-results.txt". + * }}} + */ +object AvroReadBenchmark extends BenchmarkBase with SQLHelper { --- End diff -- @dongjoon-hyun OK, then I think this one is ready. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22966#discussion_r232461509 --- Diff: external/avro/benchmarks/AvroReadBenchmark-results.txt --- @@ -0,0 +1,122 @@ + +SQL Single Numeric Column Scan + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz +SQL Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +Sum 2013 / 2071 7.8 128.0 1.0X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz +SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +Sum 1955 / 1957 8.0 124.3 1.0X --- End diff -- Actually all these INT types are processed as INT. The difference is cause by JIT. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22966#discussion_r232272074 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala --- @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.Random + +import org.apache.spark.SparkConf +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.catalyst.plans.SQLHelper +import org.apache.spark.sql.types._ + +/** + * Benchmark to measure Avro read performance. + * {{{ + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + *--jars , + * 2. build/sbt "avro/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "avro/test:runMain " + * Results will be written to "benchmarks/AvroReadBenchmark-results.txt". + * }}} + */ +object AvroReadBenchmark extends BenchmarkBase with SQLHelper { --- End diff -- This is following `DataSourceReadBenchmark` and `OrcReadBenchmark`. Should I change them as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 @dongjoon-hyun I think we can merge this one first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22987: [SPARK-25979][SQL] Window function: allow parentheses ar...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22987 @gatorsmile @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22987#discussion_r231962213 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/SQLWindowFunctionSuite.scala --- @@ -31,6 +32,19 @@ class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext { import testImplicits._ + val empSalaryData = Seq( --- End diff -- The table and data here is from: https://www.postgresql.org/docs/9.1/tutorial-window.html The naming and data is easier to be understood. In case we need to compare more behavior, we can use this one as alternative. I can use the `WindowData` and remove this as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22987: [SPARK-25979][SQL] Window function: allow parenth...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/22987 [SPARK-25979][SQL] Window function: allow parentheses around window reference ## What changes were proposed in this pull request? Very minor parser bug, but possibly problematic for code-generated queries: Consider the following two queries: ``` SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` and ``` SELECT avg(k) OVER w FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ``` The former, with parens around the OVER condition, fails to parse while the latter, without parens, succeeds: ``` Error in SQL statement: ParseException: mismatched input '(' expecting {, ',', 'FROM', 'WHERE', 'GROUP', 'ORDER', 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 19) == SQL == SELECT avg(k) OVER (w) FROM kv WINDOW w AS (PARTITION BY v ORDER BY w) ORDER BY 1 ---^^^ ``` This was found when running the cockroach DB tests. I tried PostgreSQL, The SQL with parentheses is also workable. ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark windowParentheses Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22987.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22987 commit c9d01df0ee9851751469e231069ab587975f5a7c Author: Gengliang Wang Date: 2018-11-08T16:07:46Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22965 @dongjoon-hyun sure, done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22965#discussion_r231795384 --- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt --- @@ -2,268 +2,268 @@ SQL Single Numeric Column Scan -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 21508 / 22112 0.7 1367.5 1.0X -SQL Json 8705 / 8825 1.8 553.4 2.5X -SQL Parquet Vectorized 157 / 186100.0 10.0 136.7X -SQL Parquet MR1789 / 1794 8.8 113.8 12.0X -SQL ORC Vectorized 156 / 166100.9 9.9 138.0X -SQL ORC Vectorized with copy 218 / 225 72.1 13.9 98.6X -SQL ORC MR1448 / 1492 10.9 92.0 14.9X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 14108 / 14263 1.1 896.9 1.0X +SQL Json 5477 / 5509 2.9 348.2 2.6X +SQL Parquet Vectorized 115 / 125137.1 7.3 122.9X +SQL Parquet MR1318 / 1332 11.9 83.8 10.7X +SQL ORC Vectorized 150 / 159104.9 9.5 94.1X +SQL ORC Vectorized with copy 206 / 208 76.4 13.1 68.5X +SQL ORC MR1072 / 1075 14.7 68.1 13.2X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -ParquetReader Vectorized 202 / 211 77.7 12.9 1.0X -ParquetReader Vectorized -> Row118 / 120133.5 7.5 1.7X +ParquetReader Vectorized 138 / 152114.0 8.8 1.0X +ParquetReader Vectorized -> Row 80 / 87197.2 5.1 1.7X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 23282 / 23312 0.7 1480.2 1.0X -SQL Json 9187 / 9189 1.7 584.1 2.5X -SQL Parquet Vectorized 204 / 218 77.0 13.0 114.0X -SQL Parquet MR1941 / 1953 8.1 123.4 12.0X -SQL ORC Vectorized 217 / 225 72.6 13.8 107.5X -SQL ORC Vectorized with copy 279 / 289 56.3 17.8 83.4X -SQL ORC MR1541 / 1549 10.2 98.0 15.1X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 14495 / 14507 1.1 921.6 1.0X +SQL Json 5615 / 5668 2.8 357.0 2.6X +SQL Parquet Vectorized 147 / 154107.4 9.3 98.9X +SQL Parquet MR1431 / 1454 11.0 91.0 10.1X +SQL ORC Vectorized 170 / 175 92.4 10.8 85.1X +SQL ORC Vectorized wit
[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22965 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [SPARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 Cool, could you introduce it to Spark? That would be very helpful :) @dbtsai @jleach4 and @aokolnychyi --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 @dbtsai Great! I was thinking the benchmark in this PR is kind of simple, so I didn't add it for over months.. The benchmark you mentioned should also workable for other data sources, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22965#discussion_r231766852 --- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt --- @@ -2,268 +2,268 @@ SQL Single Numeric Column Scan -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 21508 / 22112 0.7 1367.5 1.0X -SQL Json 8705 / 8825 1.8 553.4 2.5X -SQL Parquet Vectorized 157 / 186100.0 10.0 136.7X -SQL Parquet MR1789 / 1794 8.8 113.8 12.0X -SQL ORC Vectorized 156 / 166100.9 9.9 138.0X -SQL ORC Vectorized with copy 218 / 225 72.1 13.9 98.6X -SQL ORC MR1448 / 1492 10.9 92.0 14.9X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 15974 / 16222 1.0 1015.6 1.0X +SQL Json 5917 / 6174 2.7 376.2 2.7X +SQL Parquet Vectorized 115 / 128136.8 7.3 138.9X +SQL Parquet MR1459 / 1571 10.8 92.8 10.9X +SQL ORC Vectorized 164 / 194 95.8 10.4 97.3X +SQL ORC Vectorized with copy 204 / 303 77.2 12.9 78.4X +SQL ORC MR1095 / 1143 14.4 69.6 14.6X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -ParquetReader Vectorized 202 / 211 77.7 12.9 1.0X -ParquetReader Vectorized -> Row118 / 120133.5 7.5 1.7X +ParquetReader Vectorized 139 / 156113.1 8.8 1.0X +ParquetReader Vectorized -> Row 83 / 89188.7 5.3 1.7X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 23282 / 23312 0.7 1480.2 1.0X -SQL Json 9187 / 9189 1.7 584.1 2.5X -SQL Parquet Vectorized 204 / 218 77.0 13.0 114.0X -SQL Parquet MR1941 / 1953 8.1 123.4 12.0X -SQL ORC Vectorized 217 / 225 72.6 13.8 107.5X -SQL ORC Vectorized with copy 279 / 289 56.3 17.8 83.4X -SQL ORC MR1541 / 1549 10.2 98.0 15.1X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 16394 / 16643 1.0 1042.3 1.0X +SQL Json 6014 / 6020 2.6 382.4 2.7X +SQL Parquet Vectorized 147 / 155106.9 9.4 111.4X +SQL Parquet MR1575 / 1581 10.0 100.1 10.4X +SQL ORC Vectorized 168 / 173 93.9 10.7 97.9X +SQL ORC Vectorized wit
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22965#discussion_r231765680 --- Diff: sql/core/benchmarks/DataSourceReadBenchmark-results.txt --- @@ -2,268 +2,268 @@ SQL Single Numeric Column Scan -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 21508 / 22112 0.7 1367.5 1.0X -SQL Json 8705 / 8825 1.8 553.4 2.5X -SQL Parquet Vectorized 157 / 186100.0 10.0 136.7X -SQL Parquet MR1789 / 1794 8.8 113.8 12.0X -SQL ORC Vectorized 156 / 166100.9 9.9 138.0X -SQL ORC Vectorized with copy 218 / 225 72.1 13.9 98.6X -SQL ORC MR1448 / 1492 10.9 92.0 14.9X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 15974 / 16222 1.0 1015.6 1.0X +SQL Json 5917 / 6174 2.7 376.2 2.7X +SQL Parquet Vectorized 115 / 128136.8 7.3 138.9X +SQL Parquet MR1459 / 1571 10.8 92.8 10.9X +SQL ORC Vectorized 164 / 194 95.8 10.4 97.3X +SQL ORC Vectorized with copy 204 / 303 77.2 12.9 78.4X +SQL ORC MR1095 / 1143 14.4 69.6 14.6X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -ParquetReader Vectorized 202 / 211 77.7 12.9 1.0X -ParquetReader Vectorized -> Row118 / 120133.5 7.5 1.7X +ParquetReader Vectorized 139 / 156113.1 8.8 1.0X +ParquetReader Vectorized -> Row 83 / 89188.7 5.3 1.7X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.13.6 +Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz SQL Single SMALLINT Column Scan: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 23282 / 23312 0.7 1480.2 1.0X -SQL Json 9187 / 9189 1.7 584.1 2.5X -SQL Parquet Vectorized 204 / 218 77.0 13.0 114.0X -SQL Parquet MR1941 / 1953 8.1 123.4 12.0X -SQL ORC Vectorized 217 / 225 72.6 13.8 107.5X -SQL ORC Vectorized with copy 279 / 289 56.3 17.8 83.4X -SQL ORC MR1541 / 1549 10.2 98.0 15.1X - -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +SQL CSV 16394 / 16643 1.0 1042.3 1.0X +SQL Json 6014 / 6020 2.6 382.4 2.7X +SQL Parquet Vectorized 147 / 155106.9 9.4 111.4X +SQL Parquet MR1575 / 1581 10.0 100.1 10.4X +SQL ORC Vectorized 168 / 173 93.9 10.7 97.9X +SQL ORC Vectorized wit
[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 Done, @dongjoon-hyun PTAL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSo...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22965 @dongjoon-hyun @yucai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [PARK-25965][SQL] Add avro read benchmark
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22966 @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22965#discussion_r231549251 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala --- @@ -32,9 +32,11 @@ import org.apache.spark.sql.types._ * Benchmark to measure ORC read performance. * {{{ * To run this benchmark: - * 1. without sbt: bin/spark-submit --class - * 2. build/sbt "sql/test:runMain " - * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * 1. without sbt: bin/spark-submit --class + *--jars --- End diff -- The jars here are built by sbt. I am surprise that 5 jars are required. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22966: [PARK-25965][SQL] Add avro read benchmark
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/22966 [PARK-25965][SQL] Add avro read benchmark ## What changes were proposed in this pull request? Add read benchmark for Avro, which is missing for a period. The benchmark is similar to `DataSourceReadBenchmark` and `OrcReadBenchmark` ## How was this patch tested? Manually run benchmark You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark avroReadBenchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22966.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22966 commit 2713d6e83e5349ba8237a2c680665fb180d14e94 Author: Gengliang Wang Date: 2018-11-07T09:09:20Z add avro read benchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22965#discussion_r231541613 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala --- @@ -266,8 +268,9 @@ object OrcReadBenchmark extends BenchmarkBase with SQLHelper { s"SELECT IF(RAND(1) < $fractionOfNulls, NULL, CAST(id as STRING)) AS c1, " + s"IF(RAND(2) < $fractionOfNulls, NULL, CAST(id as STRING)) AS c2 FROM t1")) --- End diff -- @maropu It is trivial, but why it is RAND(2) here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22965: [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/22965 [SPARK-25964][SQL][Minor] Revise OrcReadBenchmark/DataSourceReadBenchmark case names and execution instructions ## What changes were proposed in this pull request? 1. OrcReadBenchmark is under hive module, so the way to run it should be ``` build/sbt "hive/test:runMain " ``` 2. The benchmark "String with Nulls Scan" should be with case "String with Nulls Scan(5%/50%/95%)", not "(0.05%/0.5%/0.95%)" 3. Add the null value percentages in the test case names of DataSourceReadBenchmark, for the benchmark "String with Nulls Scan" . ## How was this patch tested? Re-run benchmarks You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark fixHiveOrcReadBenchmark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22965.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22965 commit f282331d12975687391a7648aacde19a58774936 Author: Gengliang Wang Date: 2018-11-07T13:03:05Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22914 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22878#discussion_r230041592 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroEncoder.scala --- @@ -0,0 +1,533 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.avro + +import java.io._ +import java.util.{Map => JMap} + +import scala.collection.JavaConverters._ +import scala.language.existentials +import scala.reflect.ClassTag + +import org.apache.avro.Schema +import org.apache.avro.Schema.Parser +import org.apache.avro.Schema.Type._ +import org.apache.avro.generic.{GenericData, IndexedRecord} +import org.apache.avro.reflect.ReflectData +import org.apache.avro.specific.SpecificRecord + +import org.apache.spark.sql.Encoder +import org.apache.spark.sql.avro.SchemaConverters._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.{GetColumnByOrdinal, UnresolvedExtractValue} +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.expressions.codegen.Block._ +import org.apache.spark.sql.catalyst.expressions.objects.{LambdaVariable => _, _} +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +/** + * A Spark-SQL Encoder for Avro objects + */ +object AvroEncoder { + /** + * Provides an Encoder for Avro objects of the given class + * + * @param avroClass the class of the Avro object for which to generate the Encoder + * @tparam T the type of the Avro class, must implement SpecificRecord + * @return an Encoder for the given Avro class + */ + def of[T <: SpecificRecord](avroClass: Class[T]): Encoder[T] = { +AvroExpressionEncoder.of(avroClass) + } + /** + * Provides an Encoder for Avro objects implementing the given schema + * + * @param avroSchema the Schema of the Avro object for which to generate the Encoder + * @tparam T the type of the Avro class that implements the Schema, must implement IndexedRecord + * @return an Encoder for the given Avro Schema + */ + def of[T <: IndexedRecord](avroSchema: Schema): Encoder[T] = { --- End diff -- In `from_avro`, we are using avro schema in json format string, should we consider change to that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22914 > Returning to current page seems more code change is required. Because we are getting all the parameters related to page from the url. Currently we are not passing any parameter to get which page the user was on previously. I come up with a new idea: how about navigate to the last page(max page number)? That makes more sense IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22864: [SPARK-25861][Minor][WEBUI] Remove unused refreshInterva...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22864 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22878#discussion_r229952971 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -1374,4 +1377,185 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { |} """.stripMargin) } + + test("generic record converts to row and back") { --- End diff -- I think we can move these test cases to a new suite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22878: [SPARK-25789][SQL] Support for Dataset of Avro
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22878#discussion_r229976669 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -1374,4 +1377,185 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { |} """.stripMargin) } + + test("generic record converts to row and back") { +val nested = + SchemaBuilder.record("simple_record").fields() +.name("nested1").`type`("int").withDefault(0) +.name("nested2").`type`("string").withDefault("string").endRecord() +val mapDefault = new java.util.HashMap[String, String]() +mapDefault.put("a", "A") +val schema = SchemaBuilder.record("record").fields() --- End diff -- We can use a json string to present the schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22914 > May be we can highlight above the table, that "Invalid page number, falling back to first page" Yes, that's what I mean. No big deal but falling back to the first page seems unnecessary, we can return the same page again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22914: [SPARK-25900][WEBUI]When the page number is more than th...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22914 I prefer to just highlight the invalid output. E.g. ![image](https://user-images.githubusercontent.com/1097932/47831557-0e6ea800-ddcc-11e8-9fd1-c4d29f944c9d.png) ![image](https://user-images.githubusercontent.com/1097932/47831561-11699880-ddcc-11e8-9d83-20e49a0c517d.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22895: [SPARK-25886][SQL][Minor] Improve error message o...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22895#discussion_r229564121 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -100,9 +100,14 @@ case class AvroDataToCatalyst( case NonFatal(e) => parseMode match { case PermissiveMode => nullResultRow case FailFastMode => - throw new SparkException("Malformed records are detected in record parsing. " + + val msg = if (e.getMessage != null) { +e.getMessage + "\n" + } else { +"" + } + throw new SparkException(msg + "Malformed records are detected in record parsing. " + s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", e.getCause) +"result, try setting the option 'mode' as 'PERMISSIVE'.", e) --- End diff -- Agree, When I post the output, I have the same thought..Haha. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org