from:"lw\-lin"

[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...

2016-06-13 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13595
  
Thanks! 
This patch introduced an compilation error because `DataFrameReader.text`'s 
return type had been changed back to `DataFrame` very recently, and I should 
have noticed this and updated this patch accordingly. Sorry for the 
inconvenience.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13606: [SPARK-15086] [CORE] [STREAMING] Deprecate old Java accu...

2016-06-10 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13606
  
@srowen , the [[streaming programming guide] - 
accumulators-and-broadcast-variables](https://github.com/apache/spark/blob/1e2c9311871968426e019164b129652fd6d0037f/docs/streaming-programming-guide.md#accumulators-and-broadcast-variables)
  section might also need an update to reflect the code change here, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...

2016-06-10 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13595
  
@zsxwing @tdas, sure, this can wait. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` check in `...

2016-06-10 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13597
  
@marmbrus @cloud-fan @zsxwing , would you mind taking a look? Thanks! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...

2016-06-10 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13595
  
@tdas @zsxwing , would you mind taking a look? Thanks! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` che...

2016-06-10 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13597#discussion_r66596861
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -572,8 +573,13 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
 
   private def assertNotBucketed(operation: String): Unit = {
 if (numBuckets.isDefined || sortColumnNames.isDefined) {
-  throw new IllegalArgumentException(
--- End diff --

@cloud-fan, would you mind if this patch changes `IllegalArgumentException` 
here to `AnalysisException`? Since in most places in this class we throw 
`AnalysisException`, :-)

This change should be OK as this has not been actually in any non-preview 
releases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` che...

2016-06-10 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13597

[SPARK-15871][SQL] Add `assertNotPartitioned` check in `DataFrameWriter`

## What changes were proposed in this pull request?

Sometimes it doesn't make sense to specify partitioning parameters, e.g. 
when we write data out from Datasets/DataFrames into `jdbc` tables or streaming 
`ForeachWriter`s.

This patch adds `assertNotPartitioned` check in `DataFrameWriter`.

## How was this patch tested?

New dedicated tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-assertNotPartitioned

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13597


commit 83b2230def85beaac685f77ba666117089bf955f
Author: Liwei Lin 
Date:   2016-06-10T10:50:56Z

Add assertNotPartitioned check




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...

2016-06-10 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13595#discussion_r66589007
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala
 ---
@@ -371,66 +371,80 @@ class DataFrameReaderWriterSuite extends StreamTest 
with BeforeAndAfter {
 
   private def newTextInput = Utils.createTempDir(namePrefix = 
"text").getCanonicalPath
 
-  test("check trigger() can only be called on continuous queries") {
+  test("check trigger() can only be called on streaming 
Datasets/DataFrames") {
 val df = spark.read.text(newTextInput)
 val w = df.write.option("checkpointLocation", newMetadataDir)
 val e = intercept[AnalysisException](w.trigger(ProcessingTime("10 
seconds")))
-assert(e.getMessage == "trigger() can only be called on continuous 
queries;")
+assert(e.getMessage == "trigger() can only be called on streaming 
Datasets/DataFrames;")
   }
 
-  test("check queryName() can only be called on continuous queries") {
+  test("check queryName() can only be called on streaming 
Datasets/DataFrames") {
 val df = spark.read.text(newTextInput)
 val w = df.write.option("checkpointLocation", newMetadataDir)
 val e = intercept[AnalysisException](w.queryName("queryName"))
-assert(e.getMessage == "queryName() can only be called on continuous 
queries;")
+assert(e.getMessage == "queryName() can only be called on streaming 
Datasets/DataFrames;")
   }
 
-  test("check startStream() can only be called on continuous queries") {
+  test("check startStream() can only be called on streaming 
Datasets/DataFrames") {
 val df = spark.read.text(newTextInput)
 val w = df.write.option("checkpointLocation", newMetadataDir)
 val e = intercept[AnalysisException](w.startStream())
-assert(e.getMessage == "startStream() can only be called on continuous 
queries;")
+assert(e.getMessage == "startStream() can only be called on streaming 
Datasets/DataFrames;")
   }
 
-  test("check startStream(path) can only be called on continuous queries") 
{
+  test("check startStream(path) can only be called on streaming 
Datasets/DataFrames") {
 val df = spark.read.text(newTextInput)
 val w = df.write.option("checkpointLocation", newMetadataDir)
 val e = intercept[AnalysisException](w.startStream("non_exist_path"))
-assert(e.getMessage == "startStream() can only be called on continuous 
queries;")
+assert(e.getMessage == "startStream() can only be called on streaming 
Datasets/DataFrames;")
   }
 
-  test("check mode(SaveMode) can only be called on non-continuous 
queries") {
+  test("check foreach() can only be called on streaming 
Datasets/DataFrames") {
+val df = spark.read.text(newTextInput)
+val w = df.write.option("checkpointLocation", newMetadataDir)
+val foreachWriter = new ForeachWriter[String] {
+  override def open(partitionId: Long, version: Long): Boolean = false
+  override def process(value: String): Unit = {}
+  override def close(errorOrNull: Throwable): Unit = {}
+}
+val e = intercept[AnalysisException](w.foreach(foreachWriter))
+Seq("foreach()", "streaming Datasets/DataFrames").foreach { s =>
+  assert(e.getMessage.toLowerCase.contains(s.toLowerCase))
+}
+  }
+
--- End diff --

here we add a new test, checking foreach() can only be called on streaming 
Datasets/DataFrames


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...

2016-06-10 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13595#discussion_r66588898
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -433,8 +433,7 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   @Experimental
   def foreach(writer: ForeachWriter[T]): ContinuousQuery = {
 assertNotBucketed("foreach")
-assertStreaming(
-  "foreach() can only be called on streaming Datasets/DataFrames.")
+assertStreaming("foreach() can only be called on streaming 
Datasets/DataFrames.")
 
--- End diff --

these two lines are merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...

2016-06-10 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13595

[MINOR][SQL] Standardize 'continuous queries' to 'streaming 
Datasets/DataFrames'

## What changes were proposed in this pull request?

This patch does some replacing (since `streaming Datasets/DataFrames` is 
the term we've chosen in 
[SPARK-15593](https://github.com/apache/spark/commit/00c310133df4f3893dd90d801168c2ab9841b102)):
 - `continuous queries` -> `streaming Datasets/DataFrames`
 - `non-continuous queries` -> `non-streaming Datasets/DataFrames`


This patch also adds `test("check foreach() can only be called on streaming 
Datasets/DataFrames")`.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark 
continuous-queries-to-streaming-dss-dfs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13595


commit 4780124b6541e663afe5bc84cf5785ffd6ec2874
Author: Liwei Lin 
Date:   2016-06-10T09:42:35Z

continuous/non-continuous queries -> streaming/non-streaming 
Datasets/DataFrames

commit 9322954380b4dc6bfebb9dd4156a4e8c04388107
Author: Liwei Lin 
Date:   2016-06-10T09:42:56Z

Add test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66467191
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
 ---
@@ -120,24 +109,31 @@ class TextFileFormat extends FileFormat with 
DataSourceRegister {
   }
 }
   }
+
+  override def buildWriter(
+  sqlContext: SQLContext,
+  dataSchema: StructType,
+  options: Map[String, String]): OutputWriterFactory = {
+verifySchema(dataSchema)
+new StreamingTextOutputWriterFactory(
+  sqlContext.conf,
+  dataSchema,
+  sqlContext.sparkContext.hadoopConfiguration,
+  options)
+  }
 }
 
-class TextOutputWriter(path: String, dataSchema: StructType, context: 
TaskAttemptContext)
+/**
+ * Base TextOutputWriter class for 'batch' TextOutputWriter and 
'streaming' TextOutputWriter. The
+ * writing logic to a single file resides in this base class.
+ */
+private[text] abstract class TextOutputWriterBase(context: 
TaskAttemptContext)
--- End diff --

This `TextOutputWriterBase` is basically the original `TextOutputWriter`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66467095
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -488,7 +488,12 @@ private[sql] class ParquetOutputWriterFactory(
 // Custom ParquetOutputFormat that disable use of committer and writes 
to the given path
 val outputFormat = new ParquetOutputFormat[InternalRow]() {
   override def getOutputCommitter(c: TaskAttemptContext): 
OutputCommitter = { null }
-  override def getDefaultWorkFile(c: TaskAttemptContext, ext: String): 
Path = { new Path(path) }
+  override def getDefaultWorkFile(c: TaskAttemptContext, ext: String): 
Path = {
+// It has the `.parquet` extension at the end because 
(de)compression tools
+// such as gunzip would not be able to decompress this as the 
compression
+// is not applied on this whole file but on each "page" in Parquet 
format.
+new Path(s"$path$ext")
+  }
--- End diff --

This patch appends an extension to the assigned `path`; new `path` would be 
like `some_path.gz.parquet`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66466672
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala
 ---
@@ -146,16 +173,53 @@ class JsonFileFormat extends FileFormat with 
DataSourceRegister {
 }
   }
 
-  override def toString: String = "JSON"
-
   override def hashCode(): Int = getClass.hashCode()
 
   override def equals(other: Any): Boolean = 
other.isInstanceOf[JsonFileFormat]
 }
 
-private[json] class JsonOutputWriter(
-path: String,
-bucketId: Option[Int],
+/**
+ * A factory for generating [[OutputWriter]]s for writing json files. This 
is implemented different
+ * from the 'batch' JsonOutputWriter as this does not use any 
[[OutputCommitter]]. It simply
+ * writes the data to the path used to generate the output writer. Callers 
of this factory
+ * has to ensure which files are to be considered as committed.
+ */
+private[json] class StreamingJsonOutputWriterFactory(
+sqlConf: SQLConf,
+dataSchema: StructType,
+hadoopConf: Configuration,
+options: Map[String, String]) extends StreamingOutputWriterFactory {
+
+  private val serializableConf = {
+val conf = Job.getInstance(hadoopConf).getConfiguration
+JsonFileFormat.prepareConfForWriting(conf, options)
+new SerializableConfiguration(conf)
+  }
+
+  /**
+   * Returns a [[OutputWriter]] that writes data to the give path without 
using an
+   * [[OutputCommitter]].
+   */
+  override private[sql] def newWriter(path: String): OutputWriter = {
+val hadoopTaskAttempId = new TaskAttemptID(new TaskID(new JobID, 
TaskType.MAP, 0), 0)
+val hadoopAttemptContext =
+  new TaskAttemptContextImpl(serializableConf.value, 
hadoopTaskAttempId)
+// Returns a 'streaming' JsonOutputWriter
+new JsonOutputWriterBase(dataSchema, hadoopAttemptContext) {
+  override private[json] val recordWriter: RecordWriter[NullWritable, 
Text] =
+createNoCommitterTextRecordWriter(
+  path,
+  hadoopAttemptContext,
+  (c: TaskAttemptContext, ext: String) => { new 
Path(s"$path.json$ext") })
+}
+  }
+}
+
+/**
+ * Base JsonOutputWriter class for 'batch' JsonOutputWriter and 
'streaming' JsonOutputWriter. The
+ * writing logic to a single file resides in this base class.
+ */
+private[json] abstract class JsonOutputWriterBase(
--- End diff --

This `JsonOutputWriterBase` is basically the original `JsonOutputWriter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66466502
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala
 ---
@@ -143,39 +146,99 @@ object CSVRelation extends Logging {
   if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
 }
   }
+
+  /**
+   * Setup writing configurations into the given [[Configuration]], and 
then return the
+   * wrapped [[CSVOptions]].
+   * Both continuous-queries writing process and non-continuous-queries 
writing process will
+   * call this function.
+   */
+  private[csv] def prepareConfForWriting(
+  conf: Configuration,
+  options: Map[String, String]): CSVOptions = {
+val csvOptions = new CSVOptions(options)
+csvOptions.compressionCodec.foreach { codec =>
+  CompressionCodecs.setCodecConfiguration(conf, codec)
+}
+csvOptions
+  }
 }
 
-private[sql] class CSVOutputWriterFactory(params: CSVOptions) extends 
OutputWriterFactory {
+/**
+ * A factory for generating OutputWriters for writing csv files. This is 
implemented different
+ * from the 'batch' CSVOutputWriter as this does not use any 
[[OutputCommitter]]. It simply
+ * writes the data to the path used to generate the output writer. Callers 
of this factory
+ * has to ensure which files are to be considered as committed.
+ */
+private[csv] class StreamingCSVOutputWriterFactory(
+  sqlConf: SQLConf,
+  dataSchema: StructType,
+  hadoopConf: Configuration,
+  options: Map[String, String]) extends StreamingOutputWriterFactory {
+
+  private val (csvOptions: CSVOptions, serializableConf: 
SerializableConfiguration) = {
+val conf = Job.getInstance(hadoopConf).getConfiguration
+val csvOptions = CSVRelation.prepareConfForWriting(conf, options)
+(csvOptions, new SerializableConfiguration(conf))
+  }
+
+  /**
+   * Returns a [[OutputWriter]] that writes data to the give path without 
using an
+   * [[OutputCommitter]].
+   */
+  override private[sql] def newWriter(path: String): OutputWriter = {
+val hadoopTaskAttempId = new TaskAttemptID(new TaskID(new JobID, 
TaskType.MAP, 0), 0)
+val hadoopAttemptContext =
+  new TaskAttemptContextImpl(serializableConf.value, 
hadoopTaskAttempId)
+// Returns a 'streaming' CSVOutputWriter
+new CSVOutputWriterBase(dataSchema, hadoopAttemptContext, csvOptions) {
+  override private[csv] val recordWriter: RecordWriter[NullWritable, 
Text] =
+createNoCommitterTextRecordWriter(
+  path,
+  hadoopAttemptContext,
+  (c: TaskAttemptContext, ext: String) => { new 
Path(s"$path.csv$ext") })
+}
+  }
+}
+
+private[csv] class BatchCSVOutputWriterFactory(params: CSVOptions) extends 
OutputWriterFactory {
   override def newInstance(
   path: String,
   bucketId: Option[Int],
   dataSchema: StructType,
   context: TaskAttemptContext): OutputWriter = {
 if (bucketId.isDefined) sys.error("csv doesn't support bucketing")
-new CsvOutputWriter(path, dataSchema, context, params)
+// Returns a 'batch' CSVOutputWriter
+new CSVOutputWriterBase(dataSchema, context, params) {
+  private[csv] override val recordWriter: RecordWriter[NullWritable, 
Text] = {
+new TextOutputFormat[NullWritable, Text]() {
+  override def getDefaultWorkFile(context: TaskAttemptContext, 
extension: String): Path = {
+val conf = context.getConfiguration
+val uniqueWriteJobId = 
conf.get(CreateDataSourceTableUtils.DATASOURCE_WRITEJOBUUID)
+val taskAttemptId = context.getTaskAttemptID
+val split = taskAttemptId.getTaskID.getId
+new Path(path, 
f"part-r-$split%05d-$uniqueWriteJobId.csv$extension")
+  }
+}.getRecordWriter(context)
+  }
+}
   }
 }
 
-private[sql] class CsvOutputWriter(
-path: String,
+/**
+ * Base CSVOutputWriter class for 'batch' CSVOutputWriter and 'streaming' 
CSVOutputWriter. The
+ * writing logic to a single file resides in this base class.
+ */
+private[csv] abstract class CSVOutputWriterBase(
--- End diff --

This `CSVOutputWriterBase` is basically the original `CsvOutputWriter`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
-

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66466310
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala
 ---
@@ -143,39 +146,99 @@ object CSVRelation extends Logging {
   if (nonEmptyLines.hasNext) nonEmptyLines.drop(1)
 }
   }
+
+  /**
+   * Setup writing configurations into the given [[Configuration]], and 
then return the
+   * wrapped [[CSVOptions]].
+   * Both continuous-queries writing process and non-continuous-queries 
writing process will
+   * call this function.
+   */
+  private[csv] def prepareConfForWriting(
+  conf: Configuration,
+  options: Map[String, String]): CSVOptions = {
+val csvOptions = new CSVOptions(options)
+csvOptions.compressionCodec.foreach { codec =>
+  CompressionCodecs.setCodecConfiguration(conf, codec)
+}
+csvOptions
+  }
--- End diff --

These mostly are moved from `CSVFileFormat.prepareWrite()` to here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13575#discussion_r66465201
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -246,7 +247,12 @@ case class DataSource(
   case s: StreamSinkProvider =>
 s.createSink(sparkSession.sqlContext, options, partitionColumns, 
outputMode)
 
-  case parquet: parquet.ParquetFileFormat =>
+  // TODO: Remove the `isInstanceOf` check when other formats have 
been ported
+  case fileFormat: FileFormat
+if (fileFormat.isInstanceOf[CSVFileFormat]
+  || fileFormat.isInstanceOf[JsonFileFormat]
--- End diff --

@ScrapCodes , thanks! But I'm afraid that syntax would raise a compilation 
error:
```
[ERROR] .../datasources/DataSource.scala:250: illegal variable in pattern 
alternative
[ERROR]   case fileFormat: CSVFileFormat | JsonFileFormat | 
ParquetFileFormat | TextFileFormat =>
[ERROR]^
```
A work-around can be the following, but I found it somewhat less intuitive:
```scala
case fileFormat@(_: CSVFileFormat |
 _: JsonFileFormat |
 _: ParquetFileFormat |
 _: TextFileFormat) =>
  // other code
  ... fileFormat.asInstanceOf[FileFormat] ...
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13575: [SPARK-15472][SQL] Add support for writing in `csv`, `js...

2016-06-09 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13575
  
@marmbrus @tdas @zsxwing , would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-09 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13575

[SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats 
in Structured Streaming

## What changes were proposed in this pull request?

This patch adds support for writing in `csv`, `json`, `text` formats in 
Structured Streaming:

**1. at a high level, this patch forms the following hierarchy**(`text` as 
an example):
```

  â
 TextOutputWriterBase
 â  â
BatchTextOutputWriter   StreamingTextOutputWriter
```
```

â  â
BatchTextOutputWriterFactory   StreamingOutputWriterFactory
  â
  StreamingTextOutputWriterFactory
```
The `StreamingTextOutputWriter` and other 'streaming' output writers would 
write data **without** using an `OutputCommitter`. This was the same approach 
taken by [SPARK-14716](https://github.com/apache/spark/pull/12409).

**2. to support compression, this patch attaches an extension to the path 
assigned by `FileStreamSink`**, which is slightly different from 
[SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we 
are writing out using the `gzip` compression and `FileStreamSink` assigns path 
`${uuid}` to a text writer, then in the end the file written out will be 
`${uuid}.txt.gz` -- so that when we read the file back, we'll correctly 
interpret it as `gzip` compressed.

## How was this patch tested?

`FileStreamSinkSuite` is expanded much more to cover the added `csv`, 
`json`, `text` formats:

```scala
test(" csv - unpartitioned data - codecs: none/gzip")
test("json - unpartitioned data - codecs: none/gzip")
test("text - unpartitioned data - codecs: none/gzip")

test(" csv - partitioned data - codecs: none/gzip")
test("json - partitioned data - codecs: none/gzip")
test("text - partitioned data - codecs: none/gzip")

test(" csv - unpartitioned writing and batch reading - codecs: none/gzip")
test("json - unpartitioned writing and batch reading - codecs: none/gzip")
test("text - unpartitioned writing and batch reading - codecs: none/gzip")

test(" csv - partitioned writing and batch reading - codecs: none/gzip")
test("json - partitioned writing and batch reading - codecs: none/gzip")
test("text - partitioned writing and batch reading - codecs: none/gzip")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-csv-json-text-in-ss

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13575.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13575


commit c70083e9f76c20f6bf48e7ec821452f9bf63783a
Author: Liwei Lin 
Date:   2016-06-05T09:03:04Z

Add csv, json, text

commit bc28f4112ca9eca6a9f1602a891dd0388fa3185c
Author: Liwei Lin 
Date:   2016-06-09T03:31:59Z

Fix parquet extension




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...

2016-06-09 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...

2016-06-08 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13518
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...

2016-06-08 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13518
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...

2016-06-08 Thread lw-lin

GitHub user lw-lin reopened a pull request:

https://github.com/apache/spark/pull/13518

[WIP][SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` 
formats in Structured Streaming

## What changes were proposed in this pull request?

This patch adds support for writing in `csv`, `json`, `text` formats in 
Structured Streaming:

**1. at a high level, this patch forms the following hierarchy**(`text` as 
an example):
```

  â
 TextOutputWriterBase
 â  â
BatchTextOutputWriter   StreamingTextOutputWriter
```
```

â  â
BatchTextOutputWriterFactory   StreamingOutputWriterFactory
  â
  StreamingTextOutputWriterFactory
```
The `StreamingTextOutputWriter` and other 'streaming' output writers would 
write data **without** using an `OutputCommitter`. This was the same approach 
taken by [SPARK-14716](https://github.com/apache/spark/pull/12409).

**2. to support compression, this patch attaches an extension to the path 
assigned by `FileStreamSink`**, which is slightly different from 
[SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we 
are writing out using the `gzip` compression and `FileStreamSink` assigns path 
`${uuid}` to a text writer, then in the end the file written out will be 
`${uuid}.txt.gz` -- so that when we read the file back, we'll correctly 
interpret it as `gzip` compressed.

## How was this patch tested?

`FileStreamSinkSuite` is expanded much more to cover the added `csv`, 
`json`, `text` formats:

```scala
test(" csv - unpartitioned data - codecs: none/gzip")
test("json - unpartitioned data - codecs: none/gzip")
test("text - unpartitioned data - codecs: none/gzip")

test(" csv - partitioned data - codecs: none/gzip")
test("json - partitioned data - codecs: none/gzip")
test("text - partitioned data - codecs: none/gzip")

test(" csv - unpartitioned writing and batch reading - codecs: none/gzip")
test("json - unpartitioned writing and batch reading - codecs: none/gzip")
test("text - unpartitioned writing and batch reading - codecs: none/gzip")

test(" csv - partitioned writing and batch reading - codecs: none/gzip")
test("json - partitioned writing and batch reading - codecs: none/gzip")
test("text - partitioned writing and batch reading - codecs: none/gzip")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13518


commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf
Author: Liwei Lin 
Date:   2016-06-05T09:03:04Z

Add csv, json, text

commit 2035b597b44aa519d8da3b155036446f88b3050e
Author: Liwei Lin 
Date:   2016-06-05T09:03:15Z

Fix parquet extension

commit 4737361489fd680405b291ec498ab91374685ffe
Author: Liwei Lin 
Date:   2016-06-05T11:52:14Z

Fix style

commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b
Author: Liwei Lin 
Date:   2016-06-06T08:02:32Z

Fix tests

commit daec480bd16ed52137d32f332debe3806953f4d2
Author: Liwei Lin 
Date:   2016-06-06T09:03:08Z

Revert "Fix tests"

This reverts commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b.

commit 43b68d426e9b64061095eca7a1db0e762843adef
Author: Liwei Lin 
Date:   2016-06-06T09:09:10Z

Fix tests

commit 56dbb9b4f0f7e2bf76935e0d1d2fc6c6cdf141ff
Author: Liwei Lin 
Date:   2016-06-06T12:34:07Z

Investigate test

commit 91e51aed5caf663d5068057e5cf28f21eb768310
Author: Liwei Lin 
Date:   2016-06-06T12:43:39Z

Investigate test

commit eb2090ce9fa04efbd23370f7d3e6cb98fd0b4c74
Author: Liwei Lin 
Date:   2016-06-06T13:00:26Z

Update run-tests

commit 1e9af1cef892706de8b07728c192dd8ca5e5851e
Author: Liwei Lin 
Date:   2016-06-07T04:10:34Z

Investigate test

commit f76b1d9fe5e4970ad66dfb6d4b2fed939b68728c
Author: Liwei Lin 
Date:   2016-06-07T04:49:12Z

Fix tests

commit 7fca579d9fec2589f13403b0eb0f2d5f5e6bd52a
Author: Liwei Lin 
Date:   2016-06-07T04:50:24Z

Fix tests

commit 2ead307d01d8f908951fbc059b8e49bbc77947b1
Author: Liwei Lin 
Date:   2016-06-07T05:02:55Z

Fix tests

commit b64afc64d3121479eb5c3f8c8b5663b6e05349b7
Author: Liwei Lin 
Date:   2016-06-07T05:35:34Z

Fix tests

commit 2c52c2c6ac961d0b71c2dd3802cd53d51ba1926a
Author: Liwei Lin 
Date:   2016-06-07

[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...

2016-06-07 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...

2016-06-06 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13518
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...

2016-06-06 Thread lw-lin

GitHub user lw-lin reopened a pull request:

https://github.com/apache/spark/pull/13518

[WIP][SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` 
formats in Structured Streaming

## What changes were proposed in this pull request?

This patch adds support for writing in `csv`, `json`, `text` formats in 
Structured Streaming:

**1. at a high level, this patch forms the following hierarchy**(`text` as 
an example):
```

  â
 TextOutputWriterBase
 â  â
BatchTextOutputWriter   StreamingTextOutputWriter
```
```

â  â
BatchTextOutputWriterFactory   StreamingOutputWriterFactory
  â
  StreamingTextOutputWriterFactory
```
The `StreamingTextOutputWriter` and other 'streaming' output writers would 
write data **without** using an `OutputCommitter`. This was the same approach 
taken by [SPARK-14716](https://github.com/apache/spark/pull/12409).

**2. to support compression, this patch attaches an extension to the path 
assigned by `FileStreamSink`**, which is slightly different from 
[SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we 
are writing out using the `gzip` compression and `FileStreamSink` assigns path 
`${uuid}` to a text writer, then in the end the file written out will be 
`${uuid}.txt.gz` -- so that when we read the file back, we'll correctly 
interpret it as `gzip` compressed.

## How was this patch tested?

`FileStreamSinkSuite` is expanded much more to cover the added `csv`, 
`json`, `text` formats:

```scala
test(" csv - unpartitioned data - codecs: none/gzip")
test("json - unpartitioned data - codecs: none/gzip")
test("text - unpartitioned data - codecs: none/gzip")

test(" csv - partitioned data - codecs: none/gzip")
test("json - partitioned data - codecs: none/gzip")
test("text - partitioned data - codecs: none/gzip")

test(" csv - unpartitioned writing and batch reading - codecs: none/gzip")
test("json - unpartitioned writing and batch reading - codecs: none/gzip")
test("text - unpartitioned writing and batch reading - codecs: none/gzip")

test(" csv - partitioned writing and batch reading - codecs: none/gzip")
test("json - partitioned writing and batch reading - codecs: none/gzip")
test("text - partitioned writing and batch reading - codecs: none/gzip")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13518


commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf
Author: Liwei Lin 
Date:   2016-06-05T09:03:04Z

Add csv, json, text

commit 2035b597b44aa519d8da3b155036446f88b3050e
Author: Liwei Lin 
Date:   2016-06-05T09:03:15Z

Fix parquet extension

commit 4737361489fd680405b291ec498ab91374685ffe
Author: Liwei Lin 
Date:   2016-06-05T11:52:14Z

Fix style

commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b
Author: Liwei Lin 
Date:   2016-06-06T08:02:32Z

Fix tests

commit daec480bd16ed52137d32f332debe3806953f4d2
Author: Liwei Lin 
Date:   2016-06-06T09:03:08Z

Revert "Fix tests"

This reverts commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b.

commit 43b68d426e9b64061095eca7a1db0e762843adef
Author: Liwei Lin 
Date:   2016-06-06T09:09:10Z

Fix tests

commit 56dbb9b4f0f7e2bf76935e0d1d2fc6c6cdf141ff
Author: Liwei Lin 
Date:   2016-06-06T12:34:07Z

Investigate test

commit 91e51aed5caf663d5068057e5cf28f21eb768310
Author: Liwei Lin 
Date:   2016-06-06T12:43:39Z

Investigate test

commit eb2090ce9fa04efbd23370f7d3e6cb98fd0b4c74
Author: Liwei Lin 
Date:   2016-06-06T13:00:26Z

Update run-tests

commit 1e9af1cef892706de8b07728c192dd8ca5e5851e
Author: Liwei Lin 
Date:   2016-06-07T04:10:34Z

Investigate test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...

2016-06-06 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13518: [SPARK-15472][SQL] Add support for writing in `csv`, `js...

2016-06-05 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13518
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-05 Thread lw-lin

GitHub user lw-lin reopened a pull request:

https://github.com/apache/spark/pull/13518

[SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats 
in Structured Streaming

## What changes were proposed in this pull request?

This patch adds support for writing in `csv`, `json`, `text` formats in 
Structured Streaming:

**1. at a high level, this patch forms the following hierarchy**(`text` as 
an example):
```

  â
 TextOutputWriterBase
 â  â
BatchTextOutputWriter   StreamingTextOutputWriter
```
```

â  â
BatchTextOutputWriterFactory   StreamingOutputWriterFactory
  â
  StreamingTextOutputWriterFactory
```
The `StreamingTextOutputWriter` and other 'streaming' output writers would 
write data **without** using an `OutputCommitter`. This was the same approach 
taken by [SPARK-14716](https://github.com/apache/spark/pull/12409).

**2. to support compression, this patch attaches an extension to the path 
assigned by `FileStreamSink`**, which is slightly different from 
[SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we 
are writing out using the `gzip` compression and `FileStreamSink` assigns path 
`${uuid}` to a text writer, then in the end the file written out will be 
`${uuid}.txt.gz` -- so that when we read the file back, we'll correctly 
interpret it as `gzip` compressed.

## How was this patch tested?

`FileStreamSinkSuite` is expanded much more to cover the added `csv`, 
`json`, `text` formats:

```scala
test(" csv - unpartitioned data - codecs: none/gzip")
test("json - unpartitioned data - codecs: none/gzip")
test("text - unpartitioned data - codecs: none/gzip")

test(" csv - partitioned data - codecs: none/gzip")
test("json - partitioned data - codecs: none/gzip")
test("text - partitioned data - codecs: none/gzip")

test(" csv - unpartitioned writing and batch reading - codecs: none/gzip")
test("json - unpartitioned writing and batch reading - codecs: none/gzip")
test("text - unpartitioned writing and batch reading - codecs: none/gzip")

test(" csv - partitioned writing and batch reading - codecs: none/gzip")
test("json - partitioned writing and batch reading - codecs: none/gzip")
test("text - partitioned writing and batch reading - codecs: none/gzip")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13518


commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf
Author: Liwei Lin 
Date:   2016-06-05T09:03:04Z

Add csv, json, text

commit 2035b597b44aa519d8da3b155036446f88b3050e
Author: Liwei Lin 
Date:   2016-06-05T09:03:15Z

Fix parquet extension

commit 4737361489fd680405b291ec498ab91374685ffe
Author: Liwei Lin 
Date:   2016-06-05T11:52:14Z

Fix style




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-05 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13518


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13258: [SPARK-15472][SQL][Streaming] Add partitioned `cs...

2016-06-05 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13258


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...

2016-06-05 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13518

[SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats 
in Structured Streaming

## What changes were proposed in this pull request?

This patch adds support for writing in `csv`, `json`, `text` formats in 
Structured Streaming:

**1. at a high level, this patch forms the following hierarchy**(`text` as 
an example):
```

  â
 TextOutputWriterBase
 â  â
BatchTextOutputWriter   StreamingTextOutputWriter
```
```

â  â
BatchTextOutputWriterFactory   StreamingOutputWriterFactory
  â
  StreamingTextOutputWriterFactory
```
The `StreamingTextOutputWriter` and other 'streaming' output writers would 
write data **without** using an `OutputCommitter`. This was the same approach 
taken by [SPARK-14716](https://github.com/apache/spark/pull/12409).

**2. to support compression, this patch attaches an extension to the path 
assigned by `FileStreamSink`.** For example, if we are writing out using the 
`gzip` compression and `FileStreamSink` assigns path `{$uuid}` to a text 
writer, then in the end the file written out will be `${uuid}.txt.gz` -- so 
that when we read the file back, we'll correctly interpret it as `gzip` 
compressed.


 ## How was this patch tested?

`FileStreamSinkSuite` is much more expanded to cover the added `csv`, 
`json`, `text` format:

```scala
test(" csv - unpartitioned data - codecs: none/gzip")
test("json - unpartitioned data - codecs: none/gzip")
test("text - unpartitioned data - codecs: none/gzip")

test(" csv - partitioned data - codecs: none/gzip")
test("json - partitioned data - codecs: none/gzip")
test("text - partitioned data - codecs: none/gzip")

test(" csv - unpartitioned writing and batch reading - codecs: none/gzip")
test("json - unpartitioned writing and batch reading - codecs: none/gzip")
test("text - unpartitioned writing and batch reading - codecs: none/gzip")

test(" csv - partitioned writing and batch reading - codecs: none/gzip")
test("json - partitioned writing and batch reading - codecs: none/gzip")
test("text - partitioned writing and batch reading - codecs: none/gzip")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13518


commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf
Author: Liwei Lin 
Date:   2016-06-05T09:03:04Z

Add csv, json, text

commit 2035b597b44aa519d8da3b155036446f88b3050e
Author: Liwei Lin 
Date:   2016-06-05T09:03:15Z

Fix parquet extension




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13515: [MINOR] Fix Typos 'an -> a'

2016-06-05 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13515#discussion_r65812444
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -366,7 +366,7 @@ object SparkSubmit {
   }
 }
 
-// In YARN mode for an R app, add the SparkR package archive and the R 
package
+// In YARN mode for a R app, add the SparkR package archive and the R 
package
--- End diff --

maybe we don't want to change modify this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13507: [SPARK-15765][SQL][Streaming] Make continuous Parquet wr...

2016-06-03 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13507
  
@liancheng @tdas @zsxwing would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13507: [SPARK-15765][SQL][Streaming] Make continuous Par...

2016-06-03 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13507

[SPARK-15765][SQL][Streaming] Make continuous Parquet writing consistent 
with non-consistent Parquet writing

## What changes were proposed in this pull request?

Currently there are some code duplicates in continuous Parquet writing (as 
in Structured Streaming) and non-continuous batch writing; see 
[ParquetFileFormat#prepareWrite()](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68)
 and 
[ParquetFileFormat#ParquetOutputWriterFactory](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414).

This may lead to inconsistent behavior, when we only change one piece of 
code but not the other.

By extracting the common code out, this patch fixes the inconsistency. As a 
result, Structured Streaming now also enjoys 
[SPARK-15719](https://github.com/apache/spark/pull/13455).

## How was this patch tested?

Just code refactoring without any logic change, this should be covered by 
existing suits.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark parquet-conf-deduplicate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13507.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13507


commit 60a2c8ee7c610a783e65b78ac21e25661b84f49d
Author: Liwei Lin 
Date:   2016-06-03T14:31:56Z

Make continuous writing consistent with non-consistent writing




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13258: [SPARK-15472][SQL][Streaming] Add partitioned `csv`, `js...

2016-06-02 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13258
  
@zsxwing , yeah let me update this in the next one day or so - I was 
waiting for https://github.com/apache/spark/pull/13431. Thanks for the reminder!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12981: [WIP][SPARK-15208][Core][Streaming][Docs] Update Spark e...

2016-06-02 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/12981
  
@srowen, there are no other examples that need an update :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-15208][Core][Streaming][Docs] Update Spark e...

2016-06-01 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12981
  
@rxin sure, I'll resolve the conflicts very soon; and once the Java APIs 
are updated in the next couple of days, I'll update the Java examples 
accordingly.

Thank you for bringing this up!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...

2016-05-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/13258#issuecomment-222035774
  
@zsxwing sure let's add an abstract layer. I'll rebase and do this in the 
next two days or so. Thanks for the review! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...

2016-05-23 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/13258#issuecomment-220917708
  
@marmbrus @tdas @zsxwing would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...

2016-05-22 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13258

[SPARK-15472][SQL][Streaming] Add partitioned `csv`, `json`, `text` format 
support for FileStreamSink

## What changes were proposed in this pull request?

This patch adds partitioned `csv`, `json`, `text` format support for 
`FileStreamSink`.

For each format, this patch adds a dedicated `OutputWriterFactory`, which 
would return a dedicated `OutputWriter` writing data without using an 
`OutputCommitter`. This is the same approach taken by 
[SPARK-14716](https://github.com/apache/spark/pull/12409) in which we had added 
partitioned `parquet` format support.

## How was this patch tested?
`FileStreamSinkSuite` is expanded to cover the added `csv`, `json`, `text` 
format:

```scala
test(" csv - unpartitioned data")
test("json - unpartitioned data")
test("text - unpartitioned data")

test(" csv - partitioned data")
test("json - partitioned data")
test("text - partitioned data")

test(" csv - unpartitioned writing and batch reading")
test("json - unpartitioned writing and batch reading")
test("text - unpartitioned writing and batch reading")

test(" csv - partitioned writing and batch reading")
test("json - partitioned writing and batch reading")
test("text - partitioned writing and batch reading")
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-cvs-json-text

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13258


commit f3bac77b307cc56aa6c468ef75ab255b5ad85459
Author: Liwei Lin 
Date:   2016-05-22T08:54:11Z

Add csv, json, text




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...

2016-05-22 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13251


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...

2016-05-22 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/13251#issuecomment-220833527
  
This is still WIP; will reopen soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...

2016-05-22 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/13251#issuecomment-220826352
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...

2016-05-22 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/13251

[SPARK-15472][SQL][Streaming] Add support for partitioned `text` format in 
FileStreamSink

## What changes were proposed in this pull request?

This patch adds support for partitioned `text` format in FileStreamSink.

Specifically, this patch adds a new `TextOutputWriterFactory`, which would 
return an `OutputWriter` writing data without using an `OutputCommitter`.

 This is the same manner employed by 
[SPARK-14716](https://github.com/apache/spark/pull/12409) in which we had added 
support for partitioned `parquet` format.

## How was this patch tested?

The `FileStreamSinkSuite` is expanded to cover tests added for the 
partitioned `text` format.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-text-format

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13251


commit b6da2287b8d7924a978106b953a83dabad288980
Author: Liwei Lin 
Date:   2016-05-22T08:54:11Z

Some refactor

commit b58e97f97930a81183d737319ba431e6722905a5
Author: Liwei Lin 
Date:   2016-05-22T08:55:26Z

Add `text` format support for FileStreamSink




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-05-15 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-219336043
  
@marmbrus @zsxwing maybe this is ready to go? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-05-09 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-218016900
  
@zsxwing would you take another look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-15208][Core][Docs] Update spark ex...

2016-05-07 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12981#issuecomment-217689734
  
Once the AccumulatorV2 Java API is finalized, I would update the Java & 
Python part. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-15208][Core][Docs] Update spark ex...

2016-05-07 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12981

[WIP][SPARK-15208][Core][Docs] Update spark examples with AccumulatorV2

## What changes were proposed in this pull request?

The patch updates the codes & docs in the example module as well as the 
related doc module:

- [ ] [docs] `streaming-programming-guide.md`
  - [ ] scala code part
  - [ ] java code part
  - [ ] python code part
- [ ] [examples] `RecoverableNetworkWordCount.scala`
- [ ] [examples] `JavaRecoverableNetworkWordCount.java`
- [ ] [examples] `recoverable_network_wordcount.py`

## How was this patch tested?

Ran the examples and verified results manually.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark accumulatorV2-examples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12981.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12981


commit 2d5198265b3df9c48af2be4bf572a3c815fc9bb1
Author: Liwei Lin 
Date:   2016-05-08T03:31:20Z

update scala example code & doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-05-06 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-217608390
  
I've addressed comments and expanded tests; @zsxwing would you mind taking 
another look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-05-06 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12725#discussion_r62410545
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala
 ---
@@ -27,12 +27,12 @@ import org.apache.spark.sql.execution.{QueryExecution, 
SparkPlan, SparkPlanner,
  * A variant of [[QueryExecution]] that allows the execution of the given 
[[LogicalPlan]]
  * plan incrementally. Possibly preserving state in between each execution.
  */
-class IncrementalExecution(
+class IncrementalExecution private[sql](
 sparkSession: SparkSession,
 logicalPlan: LogicalPlan,
 outputMode: OutputMode,
 checkpointLocation: String,
-currentBatchId: Long)
+val currentBatchId: Long)
--- End diff --

expose this to tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-05-06 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12725#discussion_r62410548
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -122,7 +122,7 @@ class StreamExecution(
* processing is done.  Thus, the Nth record in this log indicated data 
that is currently being
* processed and the N-1th entry indicates which offsets have been 
durably committed to the sink.
*/
-  private val offsetLog =
+  private[sql] val offsetLog =
--- End diff --

expose this to test suits


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-03 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12797#issuecomment-216705209
  
@marmbrus @zsxwing thank you for the patient review! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-03 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61977204
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala
 ---
@@ -65,8 +65,13 @@ case class ProcessingTimeExecutor(processingTime: 
ProcessingTime, clock: Clock =
   s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} 
milliseconds")
   }
 
-  /** Return the next multiple of intervalMs */
+  /**
+   * Returns the start time in milliseconds for the next batch interval, 
given the current time.
+   * Note that a batch interval is inclusive with respect to its start 
time, and thus calling
+   * `nextBatchTime` with the result of a previous call should return the 
next interval. (i.e. given
+   * an interval of `100 ms`, `nextBatchTime(nextBatchTime(0)) = 200` 
rather than `0`).
--- End diff --

Thank you always for the prompt review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-03 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61976675
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala
 ---
@@ -65,8 +65,13 @@ case class ProcessingTimeExecutor(processingTime: 
ProcessingTime, clock: Clock =
   s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} 
milliseconds")
   }
 
-  /** Return the next multiple of intervalMs */
+  /**
+   * Returns the start time in milliseconds for the next batch interval, 
given the current time.
+   * Note that a batch interval is inclusive with respect to its start 
time, and thus calling
+   * `nextBatchTime` with the result of a previous call should return the 
next interval. (i.e. given
+   * an interval of `100 ms`, `nextBatchTime(nextBatchTime(0)) = 200` 
rather than `0`).
--- End diff --

`nextBatchTime(0) = 100`, so `nextBatchTime(nextBatchTime(0)) = 200`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-03 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61976475
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala ---
@@ -142,7 +142,10 @@ trait StreamTest extends QueryTest with Timeouts {
   case object StopStream extends StreamAction with StreamMustBeRunning
 
   /** Starts the stream, resuming if data has already been processed.  It 
must not be running. */
-  case object StartStream extends StreamAction
+  case class StartStream(trigger: Trigger = null, triggerClock: Clock = 
null) extends StreamAction
--- End diff --

This layer of `null`s was intended to delegate the default values of 
`StreamExecution` into these tests, so that we don't have to set the same 
default values in many places and maintain their consistency. But since it 
seems very unlikely that we would change the default values, so I've removed 
the `null`s and followed your comments.

Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-02 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61837747
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ProcessingTimeExecutorSuite.scala
 ---
@@ -21,19 +21,41 @@ import java.util.concurrent.{CountDownLatch, TimeUnit}
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.ProcessingTime
-import org.apache.spark.util.ManualClock
+import org.apache.spark.util.{Clock, ManualClock, SystemClock}
 
 class ProcessingTimeExecutorSuite extends SparkFunSuite {
 
   test("nextBatchTime") {
 val processingTimeExecutor = 
ProcessingTimeExecutor(ProcessingTime(100))
+assert(processingTimeExecutor.nextBatchTime(0) === 100)
 assert(processingTimeExecutor.nextBatchTime(1) === 100)
 assert(processingTimeExecutor.nextBatchTime(99) === 100)
-assert(processingTimeExecutor.nextBatchTime(100) === 100)
+assert(processingTimeExecutor.nextBatchTime(100) === 200)
 assert(processingTimeExecutor.nextBatchTime(101) === 200)
 assert(processingTimeExecutor.nextBatchTime(150) === 200)
   }
 
+  private def testNextBatchTimeAgainstClock(clock: Clock) {
+val IntervalMS = 100
--- End diff --

Sure; let me fix this typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-02 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61837727
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala
 ---
@@ -65,8 +65,22 @@ case class ProcessingTimeExecutor(processingTime: 
ProcessingTime, clock: Clock =
   s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} 
milliseconds")
   }
 
-  /** Return the next multiple of intervalMs */
+  /** Return the next multiple of intervalMs
+   *
+   * e.g. for intervalMs = 100
+   * nextBatchTime(0) = 100
+   * nextBatchTime(1) = 100
+   * ...
+   * nextBatchTime(99) = 100
+   * nextBatchTime(100) = 200
+   * nextBatchTime(101) = 200
+   * ...
+   * nextBatchTime(199) = 200
+   * nextBatchTime(200) = 300
+   *
+   * Note, this way, we'll get nextBatchTime(nextBatchTime(0)) = 200, 
rather than = 0
--- End diff --

Let me update it with your much clearer verson! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...

2016-05-02 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61837469
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala
 ---
@@ -65,8 +65,22 @@ case class ProcessingTimeExecutor(processingTime: 
ProcessingTime, clock: Clock =
   s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} 
milliseconds")
   }
 
-  /** Return the next multiple of intervalMs */
+  /** Return the next multiple of intervalMs
+   *
+   * e.g. for intervalMs = 100
+   * nextBatchTime(0) = 100
+   * nextBatchTime(1) = 100
+   * ...
+   * nextBatchTime(99) = 100
+   * nextBatchTime(100) = 200
+   * nextBatchTime(101) = 200
+   * ...
+   * nextBatchTime(199) = 200
+   * nextBatchTime(200) = 300
+   *
+   * Note, this way, we'll get nextBatchTime(nextBatchTime(0)) = 200, 
rather than = 0
+   * */
   def nextBatchTime(now: Long): Long = {
-(now - 1) / intervalMs * intervalMs + intervalMs
+now / intervalMs * intervalMs + intervalMs
--- End diff --

@zsxwing thanks for clarifying on this! :-)

[1]
The issue is triggered when both `batchElapsedTimeMs == 0` and 
`batchEndTimeMs` is multiple of `intervalMS` hold, e.g. `batchStartTimeMs == 
50` and `batchEndTimeMS == 50` given `intervalMS == 100` won't trigger the 
issue. So, we might have to do like this:

```scala
if (batchElapsedTimeMs == 0 && batchEndTimeMs % intervalMS == 0) {
  clock.waitTillTime(batchEndTimeMs + intervalMs)
} else {
  clock.waitTillTime(nextBatchTime(batchEndTimeMs))
}
```

For me It seems a little hard to interpret...

[2]
>
... deal with one case: If a batch takes exactly intervalMs, we should run 
the next batch at once instead of sleeping intervalMs

This is a good point! I've done some calculations based on your comments, 
and it seems we would still run the next batch at once when the last job takes 
exactly `intervalMs`?

prior to this path:
```
batch  | job
-
[  0,  99] |
[100, 199] | job x starts at 100, stops at 199, takes 100
[200, 299] |
```
after this patch, it's still the same:
```
batch  | job
-
[  0,  99] |
[100, 199] | job y starts at 100, stops at 199, takes 100
[200, 299] |
```
--
@zsxwing thoughts on the above [1] and [2]? Thanks! :-)





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...

2016-05-02 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12650#issuecomment-216427977
  
@zsxwing I've made updates per your comments; would you take a another 
look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...

2016-05-02 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12650#discussion_r61835377
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala ---
@@ -46,7 +45,11 @@ private[sql] object SQLExecution {
   val executionId = SQLExecution.nextExecutionId
   sc.setLocalProperty(EXECUTION_ID_KEY, executionId.toString)
   val r = try {
-val callSite = Utils.getCallSite()
+// We first try to pick up any call site that was set previously, 
then fall back to
+// Utils.getCallSite(); because call Utils.getCallSite() on 
continuous queries directly
+// would give us call site like "run at :0"
+val callSite = sparkSession.sparkContext.getCallSite()
--- End diff --

@zsxwing sorry for the vague comments. The fall back logic is contained in 
the `sparkSession.sparkContext.getCallSite()` itself. I've updated the comments 
accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...

2016-05-02 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12650#discussion_r61835312
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -99,7 +99,15 @@ class StreamExecution(
   /** The thread that runs the micro-batches of this stream. */
   private[sql] val microBatchThread =
 new UninterruptibleThread(s"stream execution thread for $name") {
-  override def run(): Unit = { runBatches() }
+
+  private val callSite = Utils.getCallSite()
--- End diff --

@zsxwing making it as a field of `StreamExecution` is surely more 
reasonable; I've updated this within the new commit. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...

2016-04-29 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12797#issuecomment-215928373
  
@marmbrus @tdas @zsxwing would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-04-29 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-215927916
  
For things to be easy to review, I've added the manual timed executor for 
testing general cases in [a separate 
PR](https://github.com/apache/spark/pull/12797). When that patch get merged, 
I'll add some dedicated tests here for this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...

2016-04-29 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61663843
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala ---
@@ -136,6 +136,22 @@ class StreamSuite extends StreamTest with 
SharedSQLContext {
   testStream(ds)()
 }
   }
+
+  // This would fail for now -- error is "Timed out waiting for stream"
+  // Root cause is that data generated in batch 0 may not get processed in 
batch 1
+  // Let's enable this after SPARK-14942: Reduce delay between batch 
construction and execution
+  ignore("minimize delay between batch construction and execution") {
+val inputData = MemoryStream[Int]
+testStream(inputData.toDS())(
+  StartStream(ProcessingTime("10 seconds"), new ManualClock),
+  /* -- batch 0 --- */
+  AddData(inputData, 1),
+  AddData(inputData, 2),
+  AddData(inputData, 3),
+  AdvanceManualClock(10 * 1000), // 10 seconds
+  /* -- batch 1 --- */
+  CheckAnswer(1, 2, 3))
+  }
--- End diff --

The above test takes advantage of the new `StartStream` and 
`AdvanceManualClock` action.
It's testing against `SPARK-14942: Reduce delay between batch construction 
and execution` with a manually timed executor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...

2016-04-29 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12797#discussion_r61663787
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ProcessingTimeExecutorSuite.scala
 ---
@@ -21,19 +21,41 @@ import java.util.concurrent.{CountDownLatch, TimeUnit}
 
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.ProcessingTime
-import org.apache.spark.util.ManualClock
+import org.apache.spark.util.{Clock, ManualClock, SystemClock}
 
 class ProcessingTimeExecutorSuite extends SparkFunSuite {
 
   test("nextBatchTime") {
 val processingTimeExecutor = 
ProcessingTimeExecutor(ProcessingTime(100))
+assert(processingTimeExecutor.nextBatchTime(0) === 100)
 assert(processingTimeExecutor.nextBatchTime(1) === 100)
 assert(processingTimeExecutor.nextBatchTime(99) === 100)
-assert(processingTimeExecutor.nextBatchTime(100) === 100)
+assert(processingTimeExecutor.nextBatchTime(100) === 200)
 assert(processingTimeExecutor.nextBatchTime(101) === 200)
 assert(processingTimeExecutor.nextBatchTime(150) === 200)
   }
 
+  private def testNextBatchTimeAgainstClock(clock: Clock) {
+val IntervalMS = 100
+val processingTimeExecutor = 
ProcessingTimeExecutor(ProcessingTime(IntervalMS), clock)
+
+val ITERATION = 10
+var nextBatchTime: Long = 0
+for (it <- 1 to ITERATION)
+  nextBatchTime = processingTimeExecutor.nextBatchTime(nextBatchTime)
+
+// nextBatchTime should be 1000
+assert(nextBatchTime === IntervalMS * ITERATION)
+  }
+
+  test("nextBatchTime against SystemClock") {
+testNextBatchTimeAgainstClock(new SystemClock)
+  }
+
+  test("nextBatchTime against ManualClock") {
--- End diff --

Please note the `ProcessingTimeExecutor` issue would fail this test without 
this patch, but would pass with this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...

2016-04-29 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12797

[SPARK-15022][SPARK-15023] Add support for testing against the 
`ProcessingTime(intervalMS > 0)` trigger and `ManualClock`

## What changes were proposed in this pull request?

Currently in `StreamTest`, we have a `StartStream` which will start a 
streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`.

We also need to test cases against `ProcessTime(intervalMS > 0)`, which 
often requires `ManualClock`.

This patch:
- fixes an issue of `ProcessingTimeExecutor`, where for a batch it should 
run `batchRunner` only once but might run multiple times under certain 
conditions;
- adds support for testing against the `ProcessingTime(intervalMS > 0)` 
trigger and `ManualClock`, by specifying them as fields for `StartStream`, and 
by adding an `AdvanceClock` action;
- adds a test, which takes advantage of the new `StartStream` and 
`AdvanceClock`, to test against [PR#[SPARK-14942] Reduce delay between batch 
construction and execution ](https://github.com/apache/spark/pull/12725).

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark add-trigger-test-support

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12797.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12797


commit 90ed69285fc34ae43a0f454ceb25837618212e28
Author: Liwei Lin 
Date:   2016-04-30T01:32:48Z

fix an issue of nextBatchTime against ManualClock

commit 9d80b15e33151f5f87207d72f89f016c18a21b01
Author: Liwei Lin 
Date:   2016-04-30T01:33:48Z

Add support for testing against ProcessingTime(intervalMS>0)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...

2016-04-29 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12797#issuecomment-215924685
  
still editing, will explain why ProcessingTimeExecutor, where for a batch 
it should run batchRunner only once but might run multiple times under certain 
conditions;


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...

2016-04-28 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-215405018
  
Sure, I'll add a manual timed executor and some dedicated tests as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214983254
  
@davies thanks for the review & merging  :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] First construct ...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12725#issuecomment-214979767
  
@marmbrus @tdas @zsxwing would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] First construct ...

2016-04-26 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12725

[SPARK-14942][SQL][Streaming] First construct a batch then run the batch 
for continuous queries

## Problem

Currently in `StreamExecution`, we first run the batch, then construct the 
next:
```scala
if (dataAvailable) runBatch()
constructNextBatch()
```

This is good if we run batches ASAP, where data would get processed in the 
**very next batch**:


![1](https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png)

However, if we run batches at trigger like `ProcessTime("1 minute")`, data 
- such as _y_ below - may not get processed in the very next batch i.e. _batch 
1_, but in _batch 2_:


![2](https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png)

## What changes were proposed in this pull request?

This patch reverse the order of `constructNextBatch()` and `runBatch()`. 
After this patch, data would get processed in the **very next batch**, i.e. 
_batch 1_:


![3](https://cloud.githubusercontent.com/assets/15843379/14779816/6f36ee62-0b0c-11e6-9e53-bc8397fade18.png)

In addition, this patch alters when we do `currentBatchId += 1`: let's do 
that when the processing of the current batch of data is complete, so we won't 
bother passing `currentBatchId + 1` or  `currentBatchId - 1` to states or sinks.


## How was this patch tested?

This should be covered by existing test suits including stress tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark construct-before-run-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12725


commit 8c8d73a85550349faafef1cbde1c7094d384809e
Author: Liwei Lin 
Date:   2016-04-27T02:19:14Z

constructNextBatch() before runBatch()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-214943354
  
just rebase to master to resolve some conflicts


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-26 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12638#discussion_r61193722
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -88,7 +88,7 @@ class FileStreamSource(
   }
 
   /**
-   * Returns the next batch of data that is available after `start`, if 
any is available.
--- End diff --

This doc should be updated, following `Source.getBatch()`'s doc change from 
`Returns the next batch of data that is available after start, if any is 
available.` to `Returns the data that is between the offsets (start, end].`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214939654
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214753491
  
Some flaky tests unrelated to this PR

Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-214714643
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-26 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214713457
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...

2016-04-26 Thread lw-lin

Github user lw-lin commented on a diff in the pull request:

https://github.com/apache/spark/pull/12521#discussion_r61041131
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
 ---
@@ -368,4 +368,79 @@ class DataFrameReaderWriterSuite extends StreamTest 
with SharedSQLContext with B
   "org.apache.spark.sql.streaming.test",
   Map.empty)
   }
+
+  test("check continuous query methods should not be called on 
non-continuous queries") {
--- End diff --

updated - thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214617805
  
Some flaky tests not caused by this PR.

Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-214609215
  
some build issues unrelated to this PR.

Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-214606875
  
@marmbrus thanks for the patient reminder!

Since I've reverted the renaming, and I've checked there's no other 
completely unused class under `o.a.s.sql.execution.streaming` package, maybe 
this is ready to go (pending tests). So would mind take another look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-25 Thread lw-lin

GitHub user lw-lin reopened a pull request:

https://github.com/apache/spark/pull/12638

[SPARK-14874][SQL][Streaming] Remove the obsolete Batch representation

## What changes were proposed in this pull request?

The `Batch` class, which had been used to indicate progress in a stream, 
was abandoned by [[SPARK-13985][SQL] Deterministic batches with 
ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b)
 and then became useless.

This patch:
- removes the `Batch` class
- renames `getBatch(...)` to `getData(...)` for `Source`: 
 - prior to 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_**
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_**
 - proposed in this patch: get**_Data_**(start: Option[Offset], end: 
Offset): DataFrame
- renames `addBatch(...)` to `addData(...)` for `Sink`:
 - prior to 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: addBatch(**_batch: Batch_**)
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: addBatch(batchId: Long, **_data: DataFrame_**)
 - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame)

The renaming of public methods should be OK since they have not been in any 
release yet.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark remove-batch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12638


commit c79cba9059b7ac2d6398c81b57ceece50b6b7526
Author: Liwei Lin 
Date:   2016-04-23T10:15:51Z

remove the useless Batch class

commit 60aaf97c30f55f5791ec53f81f3ed90481586fed
Author: Liwei Lin 
Date:   2016-04-26T04:03:10Z

revert renaming

commit 14e690037fbfc7840672e9a78b9abbccbc07c3ee
Author: Liwei Lin 
Date:   2016-04-26T04:03:23Z

revert renaming




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214605511
  
@davies (who made the first change) might want to take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12681#issuecomment-214601397
  
Actually this wouldn't cause any problem and wouldn't fail any test suits 
**_for now_**, because the read of `acquiredButNotUsed` is guaranteed to see 
most recent value due to the existing `synchronized(this) {consumers...}` block 
before the read of `acquiredButNotUsed`. It is kind of ["Piggybacking" on 
synchronization](http://www.javamex.com/tutorials/synchronization_piggyback.shtml)
 -- but let's not rely on this because it's vulnerable to future code changes? 
:-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...

2016-04-25 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12681

[SPARK-14911][Core] Fix a potential data race in TaskMemoryManager

## What changes were proposed in this pull request?

[[SPARK-13210][SQL] catch OOM when allocate memory and expand 
array](https://github.com/apache/spark/commit/37bc203c8dd5022cb11d53b697c28a737ee85bcc)
 introduced an `acquiredButNotUsed` field, but it might not be correctly 
synchronized:
- the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see 
[here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271));
- the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, 
taskAttemptId, tungstenMemoryMode)` (see 
[here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400))
 might not be correctly synchronized, and thus might not see 
`acquiredButNotUsed`'s new written value.

This patch makes `acquiredButNotUsed` to fix this.

## How was this patch tested?

This should be covered by existing suits.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark fix-acquiredButNotUsed

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12681.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12681


commit 6b72b963d54855771dcabc1fca8ed963be28303c
Author: Liwei Lin 
Date:   2016-04-26T03:11:53Z

fix a potential data race in TaskMemoryManager




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-214586964
  
Sure, so I'm closing this PR since the removal itself is not worthy for 
committers to process.
@marmbrus thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-25 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/12638


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...

2016-04-25 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12521#issuecomment-214585357
  
Updates: thanks to [[SPARK-14473][SQL] Define analysis rules to catch 
operations not supported in 
streaming](https://github.com/apache/spark/commit/775cf17eaaae1a38efe47b282b1d6bbdb99bd759),
 now we have friendly messages for most of the incorrect usages:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Queries with streaming sources must be executed with write.startStream();

and

> Exception in thread "main" org.apache.spark.sql.AnalysisException: 
Queries without streaming sources cannot be executed with write.startStream();

-

That leaves this patch capturing other incorrect usages such as calling 
`.trigger()`/`.queryName()` on non-continuous queries, or calling 
`bucketBy()`/`sortBy()` on continuous queries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...

2016-04-24 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12650#issuecomment-213919607
  
@andrewor14 @zsxwing would you mind taking a look? Thanks! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...

2016-04-24 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12650

[SPARK-14884][SQL][Streaming][WebUI] Fix call site for continuous queries

## What changes were proposed in this pull request?

Since we've been processing continuous queries in separate threads, the 
call sites are then `run at :0`. It's not wrong but provides very 
little information; in addition, we can not distinguish two queries only from 
their call sites. 

This patch fixes this.

### Before
[Jobs Tab]

![s1a](https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png)
[SQL Tab]

![s1b](https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png)
### After
[Jobs Tab]

![s2a](https://cloud.githubusercontent.com/assets/15843379/14766104/a89705b6-0a30-11e6-9830-0d40ec68527b.png)
[SQL Tab]

![s2b](https://cloud.githubusercontent.com/assets/15843379/14766103/a8966728-0a30-11e6-8e4d-c2e326400478.png)


## How was this patch tested?

Manually checks - see screenshots above.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark fix-call-site

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12650


commit db15de5661ba9f5dbf94caedd02f5fdd90a8fe61
Author: Liwei Lin 
Date:   2016-04-24T07:05:15Z

fix call site for continuous queries

commit 44d93c780cbeb97937f211b50c79bd0bf8291909
Author: Liwei Lin 
Date:   2016-04-24T07:10:19Z

fix style




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-23 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12638#issuecomment-213870571
  
@marmbrus @tdas would you mind taking a look? Thanks! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...

2016-04-23 Thread lw-lin

GitHub user lw-lin reopened a pull request:

https://github.com/apache/spark/pull/12638

[SPARK-14874][SQL][Streaming] Remove the obsolete Batch representation

## What changes were proposed in this pull request?

The `Batch` class, which had been used to indicate progress in a stream, 
was abandoned by [[SPARK-13985][SQL] Deterministic batches with 
ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b)
 and then became useless.

This patch:
- removes the `Batch` class
- renames `getBatch(...)` to `getData(...)` for `Source`: 
 - before 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_**
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_**
 - proposed in this patch: get**_Data_**(start: Option[Offset], end: 
Offset): DataFrame
- renames `addBatch(...)` to `addData(...)` for `Sink`:
 - before 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: addBatch(**_batch: Batch_**)
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: addBatch(batchId: Long, **_data: DataFrame_**)
 - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame)

The renaming of public methods should be OK since they have not been in any 
release yet.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark remove-batch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12638


commit c79cba9059b7ac2d6398c81b57ceece50b6b7526
Author: Liwei Lin 
Date:   2016-04-23T10:15:51Z

remove the useless Batch class




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Cleanup the usel...

2016-04-23 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/12638


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Cleanup the usel...

2016-04-23 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12638

[SPARK-14874][SQL][Streaming] Cleanup the useless Batch class

## What changes were proposed in this pull request?

The `Batch` class, which had been used to indicate progress in a stream, 
was abandoned by [[SPARK-13985][SQL] Deterministic batches with 
ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b)
 and then became useless.

This patch:
- removes the `Batch` class
- renames `getBatch(...)` to `getData(...)` for `Source`: 
 - before 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_**
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_**
 - proposed in this patch: get**_Data_**(start: Option[Offset], end: 
Offset): DataFrame
- renames `addBatch(...)` to `addData(...)` for `Sink`:
 - before 
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it was: addBatch(**_batch: Batch_**)
 - after  
[SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b),
 it became: addBatch(batchId: Long, **_data: DataFrame_**)
 - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame)

## How was this patch tested?

The changes should be covered by existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark remove-batch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12638.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12638


commit c79cba9059b7ac2d6398c81b57ceece50b6b7526
Author: Liwei Lin 
Date:   2016-04-23T10:15:51Z

remove the useless Batch class




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-14687][Core][SQL][MLlib] Call path.getF...

2016-04-20 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12450#issuecomment-212405238
  
@srowen thank you for the review & merging :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...

2016-04-20 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12521#issuecomment-212303861
  
@rxin @marmbrus would you mind taking a look when you have time? Thanks! :-)

And I'm not sure we should disallow calling methods like `parquet()`, 
`text()` on continuous queries or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...

2016-04-20 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12521

[SPARK-14747][SQL] Add assertStreaming/assertNoneStreaming checks in 
DataFrameWriter

## Problem

If an end user happens to write code mixed with continuous-query-oriented 
methods and non-continuous-query-oriented methods:

```scala
ctx.read
   .format("text")
   .stream("...")  // continuous query
   .write
   .text("...")// non-continuous query; should be startStream() here
```

He/she would get this somehow confusing exception:

>
Exception in thread "main" java.lang.AssertionError: assertion failed: No 
plan for FileSource[./continuous_query_test_input]
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at ...

## What changes were proposed in this pull request?

This PR adds checks for continuous-query-oriented methods and 
non-continuous-query-oriented methods in `DataFrameWriter`:




continuous query
non-continuous query


mode

â


trigger
â



format
â
â


option/options
â
â


partitionBy
â
â


bucketBy

â


sortBy

â


save

â


queryName
â



startStream
â



insertInto

â


saveAsTable

â


jdbc

â


json

â


parquet

â


orc

â


text

â


csv

â



## How was this patch tested?

dedicated unit tests were added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark dataframe-writer-check

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12521


commit 449b7bd204970311b26638e1b8bc0de3d4faaa8c
Author: Liwei Lin 
Date:   2016-04-20T05:58:05Z

add checks for continuous-queries-oriented methods / 
non-continuous-queries-oriented methods




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14701][Streaming] First stop the event ...

2016-04-18 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12489#issuecomment-211724261
  
@zsxwing would you mind taking a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14701][Streaming] First stop the event ...

2016-04-18 Thread lw-lin

GitHub user lw-lin opened a pull request:

https://github.com/apache/spark/pull/12489

[SPARK-14701][Streaming] First stop the event loop, then stop the 
checkpoint writer in JobGenerator

## What changes were proposed in this pull request?

The stopping order of the `event loop` and `checkpoint writer` was reversed.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lw-lin/spark spark-14701

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12489.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12489


commit 48453fda59c1392a51d987962aa4eeda20fefbbc
Author: Liwei Lin 
Date:   2016-04-19T03:10:00Z

First stop the event loop, then stop the checkpoint writer in JobGenerator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-14687][Core][SQL][MLlib] Call path.getF...

2016-04-18 Thread lw-lin

Github user lw-lin commented on the pull request:

https://github.com/apache/spark/pull/12450#issuecomment-211679362
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

401 - 500 of 597 matches

Mail list logo