date:20171011

[GitHub] spark issue #19465: [SPARK-21988][SS]Implement StreamingRelation.computeStat...

2017-10-11 Thread joseph-torres

Github user joseph-torres commented on the issue:

https://github.com/apache/spark/pull/19465
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19269: [SPARK-22026][SQL][WIP] data source v2 write path

2017-10-11 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19269#discussion_r144097061
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java
 ---
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2.writer;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SaveMode;
+import org.apache.spark.sql.sources.v2.DataSourceV2Options;
+import org.apache.spark.sql.sources.v2.WriteSupport;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * A data source writer that is returned by
+ * {@link WriteSupport#createWriter(StructType, SaveMode, 
DataSourceV2Options)}.
+ * It can mix in various writing optimization interfaces to speed up the 
data saving. The actual
+ * writing logic is delegated to {@link DataWriter}.
+ *
+ * The writing procedure is:
+ *   1. Create a writer factory by {@link #createWriterFactory()}, 
serialize and send it to all the
+ *  partitions of the input data(RDD).
+ *   2. For each partition, create the data writer, and write the data of 
the partition with this
+ *  writer. If all the data are written successfully, call {@link 
DataWriter#commit()}. If
+ *  exception happens during the writing, call {@link 
DataWriter#abort()}.
+ *   3. If all writers are successfully committed, call {@link 
#commit(WriterCommitMessage[])}. If
+ *  some writers are aborted, or the job failed with an unknown 
reason, call
+ *  {@link #abort(WriterCommitMessage[])}.
+ *
+ * Spark won't retry failed writing jobs, users should do it manually in 
their Spark applications if
+ * they want to retry.
+ *
+ * Please refer to the document of commit/abort methods for detailed 
specifications.
+ *
+ * Note that, this interface provides a protocol between Spark and data 
sources for transactional
+ * data writing, but the transaction here is Spark-level transaction, 
which may not be the
+ * underlying storage transaction. For example, Spark successfully writes 
data to a Cassandra data
+ * source, but Cassandra may need some more time to reach consistency at 
storage level.
+ */
+@InterfaceStability.Evolving
+public interface DataSourceV2Writer {
+
+  /**
+   * Creates a writer factory which will be serialized and sent to 
executors.
+   */
+  DataWriterFactory createWriterFactory();
+
+  /**
+   * Commits this writing job with a list of commit messages. The commit 
messages are collected from
+   * successful data writers and are produced by {@link 
DataWriter#commit()}. If this method
+   * fails(throw exception), this writing job is considered to be failed, 
and
+   * {@link #abort(WriterCommitMessage[])} will be called. The written 
data should only be visible
+   * to data source readers if this method successes.
+   *
+   * Note that, one partition may have multiple committed data writers 
because of speculative tasks.
+   * Spark will pick the first successful one and get its commit message. 
Implementations should be
+   * aware of this and handle it correctly, e.g., have a mechanism to make 
sure only one data writer
+   * can commit successfully, or have a way to clean up the data of 
already-committed writers.
+   */
+  void commit(WriterCommitMessage[] messages);
+
+  /**
+   * Aborts this writing job because some data writers are failed to write 
the records and aborted,
+   * or the Spark job fails with some unknown reasons, or {@link 
#commit(WriterCommitMessage[])}
+   * fails. If this method fails(throw exception), the underlying data 
source may have garbage that
+   * need to be cleaned manually, but these garbage should not be visible 
to data source readers.
+   *
+   * Unless the abortion is triggered by the failure of commit, the given 
messages should have some
--- End diff --

[GitHub] spark pull request #19466: [SPARK-22237] [CORE] Fix spark submit file downlo...

2017-10-11 Thread loneknightpy

Github user loneknightpy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19466#discussion_r144100747
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -366,6 +366,16 @@ object SparkSubmit extends CommandLineUtils with 
Logging {
   localPyFiles = Option(args.pyFiles).map {
 downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
   }.orNull
+
+  if (clusterManager == STANDALONE || clusterManager == LOCAL) {
+// Use local files for standalone client mode.
+args.primaryResource = localPrimaryResource
--- End diff --

Thanks for your input. It seems the issue is caused by a conflict with our 
internal patch. I will close this PR for now. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19466: [SPARK-22237] [CORE] Fix spark submit file downlo...

2017-10-11 Thread loneknightpy

Github user loneknightpy closed the pull request at:

https://github.com/apache/spark/pull/19466


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19474
  
**[Test build #82639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)**
 for PR 19474 at commit 
[`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Command extends LeafNode `
  * `trait RunnableCommand extends Command `
  * `case class ExecutedCommandExec(cmd: RunnableCommand) extends 
LeafExecNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19451: SPARK-22181 Adds ReplaceExceptWithNotFilter rule

2017-10-11 Thread sathiyapk

Github user sathiyapk commented on the issue:

https://github.com/apache/spark/pull/19451
  
@gengliangwang Ready for a next review :)

> put case ... in a new line

Are your sure? I thought according to the coding style, while calling on a 
partial function if there is only one case, we put the case on the same line as 
the function invocation. isn't it?

PS: `nonFilterChild` is really a perfect name, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18386: [SPARK-21165] [SQL] [2.2] Use executedPlan instea...

2017-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18386#discussion_r144097084
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -111,9 +111,18 @@ object FileFormatWriter extends Logging {
 job.setOutputValueClass(classOf[InternalRow])
 FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
 
-val allColumns = queryExecution.logical.output
+val allColumns = queryExecution.executedPlan.output
--- End diff --

Yes. We should always use `analyzed.output` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19474
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19474
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82639/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTas...

2017-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18979#discussion_r144094437
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
 ---
@@ -57,7 +60,14 @@ class BasicWriteTaskStatsTracker(hadoopConf: 
Configuration)
   private def getFileSize(filePath: String): Long = {
 val path = new Path(filePath)
 val fs = path.getFileSystem(hadoopConf)
-fs.getFileStatus(path).getLen()
+try {
+  fs.getFileStatus(path).getLen()
+} catch {
+  case e: FileNotFoundException =>
+// may arise against eventually consistent object stores
+logInfo(s"File $path is not yet visible", e)
--- End diff --

Could you update the log message and indicate the size zero might be wrong? 
For example negative caching in S3 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-11 Thread akopich

Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
@WeichenXu123, yes sure. But can this wait until this PR is merged? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144088592
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,10 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
--- End diff --

`AnalysisException`? Shouldn't this be `SparkException`? By the time this 
runs, Spark has already analyzed, optimized, and planned the job. Doesn't seem 
like failing analysis is appropriate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #30: SPARK-1004. PySpark on YARN

2017-10-11 Thread swaapnika-guntaka

Github user swaapnika-guntaka commented on the issue:

https://github.com/apache/spark/pull/30
  
I see the Java EOF Exception when I run python packaged jar(using JDK 8) 
using Spark-2.2
I'm trying to run this using the below command.
`time bash -x $SPARK_HOME/bin/spark-submit --driver-class-path 
.:: -v $PYTHONPATH/ >& run.log` 
```
Recent failure: Lost task 3.3 in stage 0.0 (TID 36, 10.15.163.25, executor 
0): java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:395)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19474
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82638/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19474
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144086925
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala
 ---
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.FileNotFoundException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+import org.apache.parquet.hadoop.{ParquetOutputCommitter, 
ParquetOutputFormat}
+
+import org.apache.spark.{LocalSparkContext, SparkFunSuite}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+/**
+ * Test logic related to choice of output commtters
+ */
+class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils
+  with LocalSparkContext {
+
+  private val PARQUET_COMMITTER = 
classOf[ParquetOutputCommitter].getCanonicalName
+
+  protected var spark: SparkSession = _
+
+  /**
+   * Create a new [[SparkSession]] running in local-cluster mode with 
unsafe and codegen enabled.
+   */
+  override def beforeAll(): Unit = {
+super.beforeAll()
+spark = SparkSession.builder()
+  .master("local-cluster[2,1,1024]")
+  .appName("testing")
+  .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+if (spark != null) {
+  spark.stop()
+  spark = null
+}
+super.afterAll()
--- End diff --



```Scala
try {
  ...
} finally {
  super.afterAll()
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19474
  
**[Test build #82638 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)**
 for PR 19474 at commit 
[`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Command extends LeafNode `
  * `trait RunnableCommand extends Command `
  * `case class ExecutedCommandExec(cmd: RunnableCommand) extends 
LeafExecNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-10-11 Thread markhamstra

Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/19468
  
> iron out the kinks

A large chunk of the difficulty in identifying and ironing out kinks in 
such a project is the difficulty of writing adequate tests of the scheduler 
code. I'd expect test coverage to take roughly the same amount of effort as all 
of the rest of the scheduler plug-in effort.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19467: [SPARK-22238] Fix plan resolution bug caused by EnsureSt...

2017-10-11 Thread brkyvz

Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/19467
  
cc @tdas 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144082508
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,10 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+require(!conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
--- End diff --

We need to issue an `AnalysisException` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-11 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r144082145
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
+val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
+StructField(name, dataType, nullable = f1.nullable || f2.nullable)
--- End diff --

Please see 
[TypeCoercionSuite.checkWidenType](https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala#L130-L142).

In order to use the first type name, we need to loosen this test helper 
function and to break the existing commutative assumption. I'm ok for that if 
you want.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19429: [SPARK-20055] [Docs] Added documentation for loading csv...

2017-10-11 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/19429
  
+1 for more detailed documentation (we should steer away from `inferSchema`)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18805
  
**[Test build #82644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82644/testReport)**
 for PR 18805 at commit 
[`029a753`](https://github.com/apache/spark/commit/029a753ad4be6881c4e1721eecdfaad0f8b158bd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-11 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18805
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18805: [SPARK-19112][CORE] Support for ZStandard codec

2017-10-11 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/18805
  
The 
[code](https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/util/Native.java)
 overwrites the original exception message that might shed some light on what's 
going on... and also ignores some exceptions it shouldn't be ignoring (like 
errors on close, which may indicate low disk space).

Anyway, let's try again to see if it's at least consistent.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-11 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r144078834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
+val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
+StructField(name, dataType, nullable = f1.nullable || f2.nullable)
--- End diff --

Sure, right. It's for commutativity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18460: [SPARK-21247][SQL] Type comparison should respect...

2017-10-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18460#discussion_r144075681
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -100,6 +101,17 @@ object TypeCoercion {
 case (_: TimestampType, _: DateType) | (_: DateType, _: TimestampType) 
=>
   Some(TimestampType)
 
+case (t1 @ StructType(fields1), t2 @ StructType(fields2)) if 
t1.sameType(t2) =>
+  Some(StructType(fields1.zip(fields2).map { case (f1, f2) =>
+// Since `t1.sameType(t2)` is true, two StructTypes have the same 
DataType
+// except `name` (in case of `spark.sql.caseSensitive=false`) and 
`nullable`.
+// - Different names: use a lower case name because 
findTightestCommonType is commutative.
+// - Different nullabilities: `nullable` is true iff one of them 
is nullable.
+val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
+val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
+StructField(name, dataType, nullable = f1.nullable || f2.nullable)
--- End diff --

```
val name = if (f1.name == f2.name) f1.name else 
f1.name.toLowerCase(Locale.ROOT)
```

The above code changes the case, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19473
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19473
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82637/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251][SQL] Metric 'aggregate time' is incorrect ...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19473
  
**[Test build #82637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82637/testReport)**
 for PR 19473 at commit 
[`470b54f`](https://github.com/apache/spark/commit/470b54f41ca7da5331d8ace22911a1a722c9c51b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18460: [SPARK-21247][SQL] Type comparison should respect case-s...

2017-10-11 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18460
  
Please let me know if there is something to do more~ Thank you always, 
@gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19471
  
+1 for this change. BTW, wow, there are lots of test case failures: 81 
failures.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19448
  
**[Test build #82643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82643/testReport)**
 for PR 19448 at commit 
[`d634f9e`](https://github.com/apache/spark/commit/d634f9e3467885759fa02657466a1cdd859ab704).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19466: [SPARK-22237] [CORE] Fix spark submit file download for ...

2017-10-11 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19466
  
> here with this change all the resources should be fetched from local 
driver

That's a good point. You should download resources just to add them to the 
driver's classpath, but executors can download them directly from the source.

As I requested, a test case would really help in understanding what is 
broken here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82642/testReport)**
 for PR 19472 at commit 
[`a814eb3`](https://github.com/apache/spark/commit/a814eb3f08085b09a16f336b36fba8da24e4f34a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19456: [SPARK] [Scheduler] Configurable default scheduling mode

2017-10-11 Thread blyncsy-david-lewis

Github user blyncsy-david-lewis commented on the issue:

https://github.com/apache/spark/pull/19456
  
I have a multiuser application where I use the userId as the name of the 
scheduling pool so that users are balanced equally by spark and within a user's 
workload I can set the scheduling mode to FAIR (or whatever I want). It is 
unreasonable to specify xml for each user and restart spark, so this patch 
allows me to specify the default configuration used by undefined pools.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-11 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18664
  
Thanks @ueshin , I agree it is better to convert the timezone to Python 
system local first and then localize to make tz-naive in case the Python system 
local tz is different that `DateTimeUtils.defaultTimeZone()`.  I'll apply your 
patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18664
  
**[Test build #82641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82641/testReport)**
 for PR 18664 at commit 
[`d1617fd`](https://github.com/apache/spark/commit/d1617fde80697bc3423d661c23997300cbb896ce).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144065810
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,13 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+  && 
!classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+  // output summary is requested, but the class is not a Parquet 
Committer
+  throw new RuntimeException(s"Committer $committerClass is not a 
ParquetOutputCommitter" +
+s" and cannot create job summaries.")
--- End diff --

aah. in the move to require() everything is going back onto a single line. 
so now moot


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144065074
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,13 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+  && 
!classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+  // output summary is requested, but the class is not a Parquet 
Committer
+  throw new RuntimeException(s"Committer $committerClass is not a 
ParquetOutputCommitter" +
+s" and cannot create job summaries.")
--- End diff --

aah



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82636/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82636/testReport)**
 for PR 19472 at commit 
[`0738359`](https://github.com/apache/spark/commit/0738359193f4c3a7e7cbeb85d1fc10967f06ff4f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r144065041
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.FileNotFoundException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+import org.apache.parquet.hadoop.{ParquetOutputCommitter, 
ParquetOutputFormat}
+
+import org.apache.spark.{LocalSparkContext, SparkFunSuite}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+/**
+ * Test logic related to choice of output commtters
+ */
+class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils
+  with LocalSparkContext {
+
+  private val PARQUET_COMMITTER = 
classOf[ParquetOutputCommitter].getCanonicalName
+
+  protected var spark: SparkSession = _
+
+  /**
+   * Create a new [[SparkSession]] running in local-cluster mode with 
unsafe and codegen enabled.
+   */
+  override def beforeAll(): Unit = {
+super.beforeAll()
+spark = SparkSession.builder()
+  .master("local-cluster[2,1,1024]")
+  .appName("testing")
+  .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+spark.stop()
+spark = null
--- End diff --

done, + will add a check for spark==null so if a failure happens during 
setup, the exception doesn't get lost in teardown


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18098: [SPARK-16944][Mesos] Improve data locality when launchin...

2017-10-11 Thread PerilousApricot

Github user PerilousApricot commented on the issue:

https://github.com/apache/spark/pull/18098
  
Is there any documentation for this feature? How would I expose my topology 
to mesos/spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19474
  
**[Test build #82640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82640/testReport)**
 for PR 19474 at commit 
[`9d4c7a2`](https://github.com/apache/spark/commit/9d4c7a236d9d6e95f1ae355ed8cb07154df5f04e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r144060858
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * A (positive) learning parameter that controls the convergence of 
variational inference.
+   * Smaller value will lead to more accuracy model and longer training 
time.
--- End diff --

Thanks, I will update the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19474
  
For a simple command `Seq(1 -> "a").toDF("i", 
"j").write.parquet("/tmp/qwe")`, the UI before this PR:
https://user-images.githubusercontent.com/3182036/31452520-bc74bb44-aee1-11e7-8721-234925856411.png;>

The UI after this PR:
https://user-images.githubusercontent.com/3182036/31452534-c6ba5622-aee1-11e7-865d-b2af359d529d.png;>

The scan node is no longer visible above the insert node, I'll fix this 
later. The writer bug is more important and we should fix it ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19474
  
cc @gatorsmile @viirya


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19474
  
**[Test build #82639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82639/testReport)**
 for PR 19474 at commit 
[`0667ac8`](https://github.com/apache/spark/commit/0667ac8bc893c50a37b607dc4713c24db12300e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19468: [SPARK-18278] [Scheduler] Spark on Kubernetes - Basic Sc...

2017-10-11 Thread foxish

Github user foxish commented on the issue:

https://github.com/apache/spark/pull/19468
  
@skonto, there was some discussion about this on the SPIP. We see them as 
separate and independent issues with the pluggable API being a long term goal.  
It would involve a working group of people drawn from all cluster manager 
integration maintainers - because the changes are extensive and require 
scalability/performance testing in a variety of environments - probably over 
2-3 releases before we can iron out the kinks. 

In the short term however, adding the package here enables supporting the 
growing number of K8s users of Spark that currently rely on our fork, and the 
integration testing makes us confident about not impacting the release process 
or adding complexity to the maintainers workflow. The K8s community will set up 
all the requisite testing, and ensure health of this package. The integration 
testing would also carry forward to when we have pluggable APIs in the future. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread steveloughran

Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/18979
  
@viirya : the new data writer API will allow for a broader set of stats to 
be propagated back from workers. When you are working with the object stores, 
an useful stat to get back is throttle count & retry count as they can be the 
cause of why things are slow ... and if it is due to throttling, throwing more 
workers at the job will actually slow things down. They'd be the ones to look 
at first


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data

2017-10-11 Thread brad-kaiser

Github user brad-kaiser commented on the issue:

https://github.com/apache/spark/pull/19041
  
Thanks @vanzin I will work on these comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...

2017-10-11 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/19473
  
Could you add [SQL] to the title? That makes it easier for others to scan 
PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...

2017-10-11 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/19473
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19474: [SPARK-22252][SQL] FileFormatWriter should respect the i...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19474
  
**[Test build #82638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82638/testReport)**
 for PR 19474 at commit 
[`3b1174f`](https://github.com/apache/spark/commit/3b1174f7e1ed9caae890936ceeb4fb54e58eadcc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19474: [SPARK-22252][SQL] FileFormatWriter should respec...

2017-10-11 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19474

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/18064, we allowed `RunnableCommand` 
to have children in order to fix some UI issues. Then we made `InsertIntoXXX` 
commands take the input `query` as a child, when we do the actual writing, we 
just pass the physical plan to the writer(`FileFormatWriter.write`).

However this is problematic. In Spark SQL, optimizer and planner are 
allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will 
remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the 
output schema from "" to ``. When it comes to writing, 
especially for self-description data format like parquet, we may write the 
wrong schema to the file and cause null values at the read path.

Fortunately, in https://github.com/apache/spark/pull/18450 , we decided to 
allow nested execution and one query can map to multiple executions in the UI. 
This releases the major restriction in #18604 , and now we don't have to take 
the input `query` as child of `InsertIntoXXX` commands.

So the fix is simple, this PR partially revert #18064 and make 
`InsertIntoXXX` commands leaf nodes again.
 
## How was this patch tested?

new regression test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19474.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19474


commit 3b1174f7e1ed9caae890936ceeb4fb54e58eadcc
Author: Wenchen Fan 
Date:   2017-10-11T14:30:38Z

FileFormatWriter should respect the input query schema




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19077#discussion_r144038007
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -116,9 +116,10 @@ private [sql] object GenArrayData {
s"final ArrayData $arrayDataName = new 
$genericArrayClass($arrayName);",
arrayDataName)
 } else {
+  val numBytes = elementType.defaultSize * numElements
   val unsafeArraySizeInBytes =
 UnsafeArrayData.calculateHeaderPortionInBytes(numElements) +
-
ByteArrayMethods.roundNumberOfBytesToNearestWord(elementType.defaultSize * 
numElements)
+ByteArrayMethods.roundNumberOfBytesToNearestWord(numBytes).toInt
--- End diff --

We should really inline that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19077#discussion_r144037194
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java ---
@@ -48,6 +49,15 @@ public long size() {
   }
 
   /**
+   * Reset the size of the memory block.
+   */
+  public void resetSize(long len) {
+assert (ByteArrayMethods.roundNumberOfBytesToNearestWord(length) ==
--- End diff --

Also leave some message if the check fails.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19077#discussion_r144037771
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
 ---
@@ -57,7 +57,7 @@ public void initialize(BufferHolder holder, int 
numElements, int elementSize) {
 
 // Grows the global buffer ahead for header and fixed size data.
 int fixedPartInBytes =
-  ByteArrayMethods.roundNumberOfBytesToNearestWord(elementSize * 
numElements);
+  (int)ByteArrayMethods.roundNumberOfBytesToNearestWord(elementSize * 
numElements);
--- End diff --

nit: extra space after `(int) `, also please update the other similar 
changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19077: [SPARK-21860][core]Improve memory reuse for heap ...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19077#discussion_r144037069
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/MemoryBlock.java ---
@@ -48,6 +49,15 @@ public long size() {
   }
 
   /**
+   * Reset the size of the memory block.
+   */
+  public void resetSize(long len) {
+assert (ByteArrayMethods.roundNumberOfBytesToNearestWord(length) ==
--- End diff --

We'd better use `require` instead of `assert`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18386: [SPARK-21165] [SQL] [2.2] Use executedPlan instea...

2017-10-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18386#discussion_r144037286
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
 ---
@@ -111,9 +111,18 @@ object FileFormatWriter extends Logging {
 job.setOutputValueClass(classOf[InternalRow])
 FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
 
-val allColumns = queryExecution.logical.output
+val allColumns = queryExecution.executedPlan.output
--- End diff --

This is problematic.  The physical plan may have different schema from 
logical plan(schema name may be different), and the writer should respect the 
logical schema as that what users expects.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19456: [SPARK] [Scheduler] Configurable default scheduling mode

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19456
  
Could you elaborate on the scenario that you should need to make these 
settings configurable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144030461
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -196,7 +196,10 @@ class HadoopRDD[K, V](
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
 val inputFormat = getInputFormat(jobConf)
-val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
--- End diff --

How about:
```
val inputSplits = if (..) {
inputFormat.getSplits(jobConf, minPartitions).filter(_.getLength > 0)
} else {
   inputFormat.getSplits(jobConf, minPartitions)
}
```
We should alway try to not use `var`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19316: [SPARK-22097][CORE]Request an accurate memory aft...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19316#discussion_r144034339
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -388,7 +388,13 @@ private[spark] class MemoryStore(
 // perform one final call to attempt to allocate additional memory if 
necessary.
 if (keepUnrolling) {
   serializationStream.close()
-  reserveAdditionalMemoryIfNecessary()
+  if (bbos.size > unrollMemoryUsedByThisBlock) {
+val amountToRequest = bbos.size - unrollMemoryUsedByThisBlock
--- End diff --

Why not just fix that in `reserveAdditionalMemoryIfNecessary` ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144029852
  
--- Diff: docs/configuration.md ---
@@ -1211,6 +1211,14 @@ Apart from these, the following properties are also 
available, and may be useful
 data may need to be rewritten to pre-existing output directories 
during checkpoint recovery.
 
 
+spark.hadoop.filterOutEmptySplit
--- End diff --

We should add the config to `internal/config`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...

2017-10-11 Thread ala

Github user ala commented on the issue:

https://github.com/apache/spark/pull/19473
  
@hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144031728
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.hadoop.filterOutEmptySplit") {
+val sf = new SparkConf()
+
sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit",
 "true")
+sc = new SparkContext(sf)
+val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
+emptyRDD.saveAsHadoopFile[TextOutputFormat[String, 
String]](tempDir.getPath + "/output")
+assert(new File(tempDir.getPath + "/output/part-0").exists() === 
true)
+
+val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-0")
--- End diff --

We should also add the following test cases:
1. Ensure that if no split is empty, we don't lose any splits;
2. Ensure that if part of the splits are empty, we remove the splits 
correctly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144030646
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -196,7 +196,10 @@ class HadoopRDD[K, V](
 // add the credentials here as this can be called before SparkContext 
initialized
 SparkHadoopUtil.get.addCredentials(jobConf)
 val inputFormat = getInputFormat(jobConf)
-val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+var inputSplits = inputFormat.getSplits(jobConf, minPartitions)
+if 
(sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
+  inputSplits = inputSplits.filter(_.getLength>0)
--- End diff --

nit: extra space around operator.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144031167
  
--- Diff: core/src/test/scala/org/apache/spark/FileSuite.scala ---
@@ -510,4 +510,16 @@ class FileSuite extends SparkFunSuite with 
LocalSparkContext {
 }
   }
 
+  test("spark.hadoop.filterOutEmptySplit") {
+val sf = new SparkConf()
+
sf.setAppName("test").setMaster("local").set("spark.hadoop.filterOutEmptySplit",
 "true")
+sc = new SparkContext(sf)
+val emptyRDD = sc.parallelize(Array.empty[Tuple2[String, String]], 1)
+emptyRDD.saveAsHadoopFile[TextOutputFormat[String, 
String]](tempDir.getPath + "/output")
+assert(new File(tempDir.getPath + "/output/part-0").exists() === 
true)
+
+val hadoopRDD = sc.textFile(tempDir.getPath + "/output/part-0")
+assert(hadoopRDD.partitions.length === 0)
--- End diff --

You should recycle the resources you required in the test case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19250: [SPARK-12297] Table timezone correction for Timestamps

2017-10-11 Thread zivanfi

Github user zivanfi commented on the issue:

https://github.com/apache/spark/pull/19250
  
@attilajeges has just found a problem with the behavior specified in the 
requirements:
* Partitions of a table can use different file formats.
* As a result, a single table can have data files of different file formats 
at the same time.
* Timestamps are already handled differently in these formats (this was our 
original problem to begin with).
* As a result, **no uniform adjustment can fix timestamps for all file 
formats of the same table at the same time.**

We can still solve the issue using a file-format-specific table property 
though. @rxin, I would like to ask you opinion of such a solution:
* It **is** implemented in the analyzer, as you asked.
* It **is** writer-agnostic, as you asked.
* It **is not** file-format-agnostic, but Parquet-specific instead for the 
time being.

Would you find such a soltion be acceptable, given that a 
file-format-agnostic fix seems infeasible at this point?

Thanks,

Zoltan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-11 Thread susanxhuynh

Github user susanxhuynh commented on the issue:

https://github.com/apache/spark/pull/19437
  
@vanzin Would you mind reviewing this PR? A followup to ArtRand's secrets 
PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19473: [SPARK-22251] Metric 'aggregate time' is incorrect when ...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19473
  
**[Test build #82637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82637/testReport)**
 for PR 19473 at commit 
[`470b54f`](https://github.com/apache/spark/commit/470b54f41ca7da5331d8ace22911a1a722c9c51b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19437: [SPARK-22131][MESOS] Mesos driver secrets

2017-10-11 Thread susanxhuynh

Github user susanxhuynh commented on the issue:

https://github.com/apache/spark/pull/19437
  
@skonto I haven't tested with TLS; I'll work on that in the next couple of 
days.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19473: [SPARK-22251] Metric 'aggregate time' is incorrec...

2017-10-11 Thread ala

GitHub user ala opened a pull request:

https://github.com/apache/spark/pull/19473

[SPARK-22251] Metric 'aggregate time' is incorrect when codegen is off

## What changes were proposed in this pull request?

Adding the code for setting 'aggregate time' metric to non-codegen path in 
HashAggregateExec.

## How was this patch tested?

Tested manually.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ala/spark fix-agg-time

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19473.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19473


commit 470b54f41ca7da5331d8ace22911a1a722c9c51b
Author: Ala Luszczak 
Date:   2017-10-11T13:59:35Z

Fix 'aggregate time' when codegen is off.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2017-10-11 Thread a10y

Github user a10y commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r144025706
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -175,6 +175,25 @@ trait CodegenSupport extends SparkPlan {
   }
 
   /**
+   * In Java, a method descriptor is valid only if it represents method 
parameters with a total
+   * length of 255 or less. `this` contributes one unit and a parameter of 
type long or double
+   * contributes two units. Besides, for nullable parameters, we also need 
to pass a boolean
+   * for the null status.
+   */
+  private def isValidParamLength(ctx: CodegenContext): Boolean = {
+var paramLength = 1 // for `this` parameter.
+output.foreach { attr =>
--- End diff --

(nit: This could be written as a `foldLeft` and then you can eliminate the 
`var`)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18979: [SPARK-21762][SQL] FileFormatWriter/BasicWriteTaskStatsT...

2017-10-11 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18979
  
I don't have strong opinion against this. Incorrect size is an issue but I 
can't think a better solution for now...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82636/testReport)**
 for PR 19472 at commit 
[`0738359`](https://github.com/apache/spark/commit/0738359193f4c3a7e7cbeb85d1fc10967f06ff4f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/18924
  
@akopich  LGTM. and do you have time to create a PR to resolve random seed 
not working issue mentioned by @hhbyyh ?  Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82634/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19472
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19472: [WIP][SPARK-22246][SQL] Use MemoryBlock in UnsafeRow, Un...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19472
  
**[Test build #82634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82634/testReport)**
 for PR 19472 at commit 
[`e51fb6a`](https://github.com/apache/spark/commit/e51fb6ae246a660ec2b0bdd1f6a53a0987d4a063).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public abstract class MemoryBlock implements Serializable `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r143996060
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,13 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+  && 
!classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+  // output summary is requested, but the class is not a Parquet 
Committer
+  throw new RuntimeException(s"Committer $committerClass is not a 
ParquetOutputCommitter" +
+s" and cannot create job summaries.")
--- End diff --

Oh, I mean .. s in `s" .. "`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82635/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18029
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #82635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82635/testReport)**
 for PR 18029 at commit 
[`72703a0`](https://github.com/apache/spark/commit/72703a072e34407b52d555bab98a435414d2ed25).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r143992362
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetCommitterSuite.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.FileNotFoundException
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.mapreduce.{JobContext, TaskAttemptContext}
+import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
+import org.apache.parquet.hadoop.{ParquetOutputCommitter, 
ParquetOutputFormat}
+
+import org.apache.spark.{LocalSparkContext, SparkFunSuite}
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+/**
+ * Test logic related to choice of output commtters
+ */
+class ParquetCommitterSuite extends SparkFunSuite with SQLTestUtils
+  with LocalSparkContext {
+
+  private val PARQUET_COMMITTER = 
classOf[ParquetOutputCommitter].getCanonicalName
+
+  protected var spark: SparkSession = _
+
+  /**
+   * Create a new [[SparkSession]] running in local-cluster mode with 
unsafe and codegen enabled.
+   */
+  override def beforeAll(): Unit = {
+super.beforeAll()
+spark = SparkSession.builder()
+  .master("local-cluster[2,1,1024]")
+  .appName("testing")
+  .getOrCreate()
+  }
+
+  override def afterAll(): Unit = {
+spark.stop()
+spark = null
+  }
+
+  test("alternative output committer, merge schema") {
+intercept[RuntimeException] {
+  val stat = writeDataFrame(MarkingFileOutput.COMMITTER, true, true)
+  logError(s"Created marker file $stat")
+}
+  }
+
+  test("alternative output committer, no merge schema") {
+writeDataFrame(MarkingFileOutput.COMMITTER, false, true)
--- End diff --

OK


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r143992319
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,13 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+  && 
!classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+  // output summary is requested, but the class is not a Parquet 
Committer
+  throw new RuntimeException(s"Committer $committerClass is not a 
ParquetOutputCommitter" +
+s" and cannot create job summaries.")
--- End diff --

Depends on the policy about "what to do if it's not a parquet committer 
*and* the option for  job summaries is set. It could just mean "you don't get 
summaries", which worksforme :). May want to log at info though?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19448: [SPARK-22217] [SQL] ParquetFileFormat to support ...

2017-10-11 Thread steveloughran

Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/19448#discussion_r143992018
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -138,6 +138,13 @@ class ParquetFileFormat
   conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
 }
 
+if (conf.getBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false)
+  && 
!classOf[ParquetOutputCommitter].isAssignableFrom(committerClass)) {
+  // output summary is requested, but the class is not a Parquet 
Committer
+  throw new RuntimeException(s"Committer $committerClass is not a 
ParquetOutputCommitter" +
--- End diff --

will do


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18029
  
**[Test build #82635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82635/testReport)**
 for PR 18029 at commit 
[`72703a0`](https://github.com/apache/spark/commit/72703a072e34407b52d555bab98a435414d2ed25).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-11 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r143984294
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -122,7 +122,10 @@ class NewHadoopRDD[K, V](
   case _ =>
 }
 val jobContext = new JobContextImpl(_conf, jobId)
-val rawSplits = inputFormat.getSplits(jobContext).toArray
+var rawSplits = 
inputFormat.getSplits(jobContext).toArray(Array.empty[InputSplit])
+if 
(sparkContext.getConf.getBoolean("spark.hadoop.filterOutEmptySplit", false)) {
+  rawSplits = rawSplits.filter(_.getLength>0)
--- End diff --

Space around operator.
You should filter before making an array.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19464: [SPARK-22233] [core] Allow user to filter out empty spli...

2017-10-11 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19464
  
Interesting. On the one hand I don't like adding yet another flag that 
changes behavior, when the user often can't meaningfully decide to set it. 
There is probably no value in processing an empty partition, sure. Then again 
it does change behavior slightly, and I wonder if that impacts assumptions that 
apps rely on somehow. 

If there's no reason to expect downside, we could do this in Spark 3.x, or 
make the change now but yes introduce a flag as a safety valve to go back to 
old behavior, leaving the default to true.

But first are there any known impacts to skipping the empty partitions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19471
  
Fair enough to me. To check this change reasonable, we might be able to 
send a dev/user list email to social feedbacks. I saw marmbrus doing so when 
adding the json API;
https://github.com/apache/spark/pull/15274#issuecomment-250092074

http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-JSON-Column-Support-td19132.html
If we have no response or positive feedbacks, we could quickly/safely drop 
the support.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17819
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17819
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82632/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-10-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17819
  
**[Test build #82632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82632/testReport)**
 for PR 17819 at commit 
[`bb19708`](https://github.com/apache/spark/commit/bb19708dcd359c434c0fac779125f949541f9b8c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19471
  
waiting for more feedbacks before moving forward :)

Another thing I wanna point out: for `sql("create table t using parquet 
options(skipHiveMetadata=true) location '/tmp/t'")`, it works in Spark 2.0, and 
the created table has a schema that the partition column is at the beginning. 
In Spark 2.1, it also works, and `DESC TABLE` also shows the table schema has 
partition column at the beginning. However, if you query the table, the output 
schema has partition column at the end.

It's been a long time since Spark 2.1 was released and no one reports this 
behavior change. It seems this is really a corner case and makes me feel we 
should not compilcate our code too much for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19471
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82633/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19471
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19471: [SPARK-22245][SQL] partitioned data set should always pu...

2017-10-11 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/19471
  
Does this change affect some other tests for the overlapped cases like 
[DataStreamReaderWriterSuite](https://github.com/apache/spark/blob/655f6f86f84ff5241d1d20766e1ef83bb32ca5e0/sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataStreamReaderWriterSuite.scala#L550)
 and `OrcPartitionDiscoverySuite`? Since we already have some amount of these 
tests in multiple places, (I know you've already considered this aspect 
though) I'm a little worried about if this change in minor releases makes 
users confused.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 >

201 - 300 of 394 matches

Mail list logo