[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13595 Thanks! This patch introduced an compilation error because `DataFrameReader.text`'s return type had been changed back to `DataFrame` very recently, and I should have noticed this and updated this patch accordingly. Sorry for the inconvenience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13606: [SPARK-15086] [CORE] [STREAMING] Deprecate old Java accu...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13606 @srowen , the [[streaming programming guide] - accumulators-and-broadcast-variables](https://github.com/apache/spark/blob/1e2c9311871968426e019164b129652fd6d0037f/docs/streaming-programming-guide.md#accumulators-and-broadcast-variables) section might also need an update to reflect the code change here, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13595 @zsxwing @tdas, sure, this can wait. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` check in `...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13597 @marmbrus @cloud-fan @zsxwing , would you mind taking a look? Thanks! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13595: [MINOR][SQL] Standardize 'continuous queries' to 'stream...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13595 @tdas @zsxwing , would you mind taking a look? Thanks! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` che...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13597#discussion_r66596861 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -572,8 +573,13 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { private def assertNotBucketed(operation: String): Unit = { if (numBuckets.isDefined || sortColumnNames.isDefined) { - throw new IllegalArgumentException( --- End diff -- @cloud-fan, would you mind if this patch changes `IllegalArgumentException` here to `AnalysisException`? Since in most places in this class we throw `AnalysisException`, :-) This change should be OK as this has not been actually in any non-preview releases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13597: [SPARK-15871][SQL] Add `assertNotPartitioned` che...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13597 [SPARK-15871][SQL] Add `assertNotPartitioned` check in `DataFrameWriter` ## What changes were proposed in this pull request? Sometimes it doesn't make sense to specify partitioning parameters, e.g. when we write data out from Datasets/DataFrames into `jdbc` tables or streaming `ForeachWriter`s. This patch adds `assertNotPartitioned` check in `DataFrameWriter`. ## How was this patch tested? New dedicated tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-assertNotPartitioned Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13597 commit 83b2230def85beaac685f77ba666117089bf955f Author: Liwei Lin Date: 2016-06-10T10:50:56Z Add assertNotPartitioned check --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13595#discussion_r66589007 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/test/DataFrameReaderWriterSuite.scala --- @@ -371,66 +371,80 @@ class DataFrameReaderWriterSuite extends StreamTest with BeforeAndAfter { private def newTextInput = Utils.createTempDir(namePrefix = "text").getCanonicalPath - test("check trigger() can only be called on continuous queries") { + test("check trigger() can only be called on streaming Datasets/DataFrames") { val df = spark.read.text(newTextInput) val w = df.write.option("checkpointLocation", newMetadataDir) val e = intercept[AnalysisException](w.trigger(ProcessingTime("10 seconds"))) -assert(e.getMessage == "trigger() can only be called on continuous queries;") +assert(e.getMessage == "trigger() can only be called on streaming Datasets/DataFrames;") } - test("check queryName() can only be called on continuous queries") { + test("check queryName() can only be called on streaming Datasets/DataFrames") { val df = spark.read.text(newTextInput) val w = df.write.option("checkpointLocation", newMetadataDir) val e = intercept[AnalysisException](w.queryName("queryName")) -assert(e.getMessage == "queryName() can only be called on continuous queries;") +assert(e.getMessage == "queryName() can only be called on streaming Datasets/DataFrames;") } - test("check startStream() can only be called on continuous queries") { + test("check startStream() can only be called on streaming Datasets/DataFrames") { val df = spark.read.text(newTextInput) val w = df.write.option("checkpointLocation", newMetadataDir) val e = intercept[AnalysisException](w.startStream()) -assert(e.getMessage == "startStream() can only be called on continuous queries;") +assert(e.getMessage == "startStream() can only be called on streaming Datasets/DataFrames;") } - test("check startStream(path) can only be called on continuous queries") { + test("check startStream(path) can only be called on streaming Datasets/DataFrames") { val df = spark.read.text(newTextInput) val w = df.write.option("checkpointLocation", newMetadataDir) val e = intercept[AnalysisException](w.startStream("non_exist_path")) -assert(e.getMessage == "startStream() can only be called on continuous queries;") +assert(e.getMessage == "startStream() can only be called on streaming Datasets/DataFrames;") } - test("check mode(SaveMode) can only be called on non-continuous queries") { + test("check foreach() can only be called on streaming Datasets/DataFrames") { +val df = spark.read.text(newTextInput) +val w = df.write.option("checkpointLocation", newMetadataDir) +val foreachWriter = new ForeachWriter[String] { + override def open(partitionId: Long, version: Long): Boolean = false + override def process(value: String): Unit = {} + override def close(errorOrNull: Throwable): Unit = {} +} +val e = intercept[AnalysisException](w.foreach(foreachWriter)) +Seq("foreach()", "streaming Datasets/DataFrames").foreach { s => + assert(e.getMessage.toLowerCase.contains(s.toLowerCase)) +} + } + --- End diff -- here we add a new test, checking foreach() can only be called on streaming Datasets/DataFrames --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13595#discussion_r66588898 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala --- @@ -433,8 +433,7 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { @Experimental def foreach(writer: ForeachWriter[T]): ContinuousQuery = { assertNotBucketed("foreach") -assertStreaming( - "foreach() can only be called on streaming Datasets/DataFrames.") +assertStreaming("foreach() can only be called on streaming Datasets/DataFrames.") --- End diff -- these two lines are merged --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13595: [MINOR][SQL] Standardize 'continuous queries' to ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13595 [MINOR][SQL] Standardize 'continuous queries' to 'streaming Datasets/DataFrames' ## What changes were proposed in this pull request? This patch does some replacing (since `streaming Datasets/DataFrames` is the term we've chosen in [SPARK-15593](https://github.com/apache/spark/commit/00c310133df4f3893dd90d801168c2ab9841b102)): - `continuous queries` -> `streaming Datasets/DataFrames` - `non-continuous queries` -> `non-streaming Datasets/DataFrames` This patch also adds `test("check foreach() can only be called on streaming Datasets/DataFrames")`. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark continuous-queries-to-streaming-dss-dfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13595 commit 4780124b6541e663afe5bc84cf5785ffd6ec2874 Author: Liwei Lin Date: 2016-06-10T09:42:35Z continuous/non-continuous queries -> streaming/non-streaming Datasets/DataFrames commit 9322954380b4dc6bfebb9dd4156a4e8c04388107 Author: Liwei Lin Date: 2016-06-10T09:42:56Z Add test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66467191 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala --- @@ -120,24 +109,31 @@ class TextFileFormat extends FileFormat with DataSourceRegister { } } } + + override def buildWriter( + sqlContext: SQLContext, + dataSchema: StructType, + options: Map[String, String]): OutputWriterFactory = { +verifySchema(dataSchema) +new StreamingTextOutputWriterFactory( + sqlContext.conf, + dataSchema, + sqlContext.sparkContext.hadoopConfiguration, + options) + } } -class TextOutputWriter(path: String, dataSchema: StructType, context: TaskAttemptContext) +/** + * Base TextOutputWriter class for 'batch' TextOutputWriter and 'streaming' TextOutputWriter. The + * writing logic to a single file resides in this base class. + */ +private[text] abstract class TextOutputWriterBase(context: TaskAttemptContext) --- End diff -- This `TextOutputWriterBase` is basically the original `TextOutputWriter`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66467095 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -488,7 +488,12 @@ private[sql] class ParquetOutputWriterFactory( // Custom ParquetOutputFormat that disable use of committer and writes to the given path val outputFormat = new ParquetOutputFormat[InternalRow]() { override def getOutputCommitter(c: TaskAttemptContext): OutputCommitter = { null } - override def getDefaultWorkFile(c: TaskAttemptContext, ext: String): Path = { new Path(path) } + override def getDefaultWorkFile(c: TaskAttemptContext, ext: String): Path = { +// It has the `.parquet` extension at the end because (de)compression tools +// such as gunzip would not be able to decompress this as the compression +// is not applied on this whole file but on each "page" in Parquet format. +new Path(s"$path$ext") + } --- End diff -- This patch appends an extension to the assigned `path`; new `path` would be like `some_path.gz.parquet`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66466672 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala --- @@ -146,16 +173,53 @@ class JsonFileFormat extends FileFormat with DataSourceRegister { } } - override def toString: String = "JSON" - override def hashCode(): Int = getClass.hashCode() override def equals(other: Any): Boolean = other.isInstanceOf[JsonFileFormat] } -private[json] class JsonOutputWriter( -path: String, -bucketId: Option[Int], +/** + * A factory for generating [[OutputWriter]]s for writing json files. This is implemented different + * from the 'batch' JsonOutputWriter as this does not use any [[OutputCommitter]]. It simply + * writes the data to the path used to generate the output writer. Callers of this factory + * has to ensure which files are to be considered as committed. + */ +private[json] class StreamingJsonOutputWriterFactory( +sqlConf: SQLConf, +dataSchema: StructType, +hadoopConf: Configuration, +options: Map[String, String]) extends StreamingOutputWriterFactory { + + private val serializableConf = { +val conf = Job.getInstance(hadoopConf).getConfiguration +JsonFileFormat.prepareConfForWriting(conf, options) +new SerializableConfiguration(conf) + } + + /** + * Returns a [[OutputWriter]] that writes data to the give path without using an + * [[OutputCommitter]]. + */ + override private[sql] def newWriter(path: String): OutputWriter = { +val hadoopTaskAttempId = new TaskAttemptID(new TaskID(new JobID, TaskType.MAP, 0), 0) +val hadoopAttemptContext = + new TaskAttemptContextImpl(serializableConf.value, hadoopTaskAttempId) +// Returns a 'streaming' JsonOutputWriter +new JsonOutputWriterBase(dataSchema, hadoopAttemptContext) { + override private[json] val recordWriter: RecordWriter[NullWritable, Text] = +createNoCommitterTextRecordWriter( + path, + hadoopAttemptContext, + (c: TaskAttemptContext, ext: String) => { new Path(s"$path.json$ext") }) +} + } +} + +/** + * Base JsonOutputWriter class for 'batch' JsonOutputWriter and 'streaming' JsonOutputWriter. The + * writing logic to a single file resides in this base class. + */ +private[json] abstract class JsonOutputWriterBase( --- End diff -- This `JsonOutputWriterBase` is basically the original `JsonOutputWriter` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66466502 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -143,39 +146,99 @@ object CSVRelation extends Logging { if (nonEmptyLines.hasNext) nonEmptyLines.drop(1) } } + + /** + * Setup writing configurations into the given [[Configuration]], and then return the + * wrapped [[CSVOptions]]. + * Both continuous-queries writing process and non-continuous-queries writing process will + * call this function. + */ + private[csv] def prepareConfForWriting( + conf: Configuration, + options: Map[String, String]): CSVOptions = { +val csvOptions = new CSVOptions(options) +csvOptions.compressionCodec.foreach { codec => + CompressionCodecs.setCodecConfiguration(conf, codec) +} +csvOptions + } } -private[sql] class CSVOutputWriterFactory(params: CSVOptions) extends OutputWriterFactory { +/** + * A factory for generating OutputWriters for writing csv files. This is implemented different + * from the 'batch' CSVOutputWriter as this does not use any [[OutputCommitter]]. It simply + * writes the data to the path used to generate the output writer. Callers of this factory + * has to ensure which files are to be considered as committed. + */ +private[csv] class StreamingCSVOutputWriterFactory( + sqlConf: SQLConf, + dataSchema: StructType, + hadoopConf: Configuration, + options: Map[String, String]) extends StreamingOutputWriterFactory { + + private val (csvOptions: CSVOptions, serializableConf: SerializableConfiguration) = { +val conf = Job.getInstance(hadoopConf).getConfiguration +val csvOptions = CSVRelation.prepareConfForWriting(conf, options) +(csvOptions, new SerializableConfiguration(conf)) + } + + /** + * Returns a [[OutputWriter]] that writes data to the give path without using an + * [[OutputCommitter]]. + */ + override private[sql] def newWriter(path: String): OutputWriter = { +val hadoopTaskAttempId = new TaskAttemptID(new TaskID(new JobID, TaskType.MAP, 0), 0) +val hadoopAttemptContext = + new TaskAttemptContextImpl(serializableConf.value, hadoopTaskAttempId) +// Returns a 'streaming' CSVOutputWriter +new CSVOutputWriterBase(dataSchema, hadoopAttemptContext, csvOptions) { + override private[csv] val recordWriter: RecordWriter[NullWritable, Text] = +createNoCommitterTextRecordWriter( + path, + hadoopAttemptContext, + (c: TaskAttemptContext, ext: String) => { new Path(s"$path.csv$ext") }) +} + } +} + +private[csv] class BatchCSVOutputWriterFactory(params: CSVOptions) extends OutputWriterFactory { override def newInstance( path: String, bucketId: Option[Int], dataSchema: StructType, context: TaskAttemptContext): OutputWriter = { if (bucketId.isDefined) sys.error("csv doesn't support bucketing") -new CsvOutputWriter(path, dataSchema, context, params) +// Returns a 'batch' CSVOutputWriter +new CSVOutputWriterBase(dataSchema, context, params) { + private[csv] override val recordWriter: RecordWriter[NullWritable, Text] = { +new TextOutputFormat[NullWritable, Text]() { + override def getDefaultWorkFile(context: TaskAttemptContext, extension: String): Path = { +val conf = context.getConfiguration +val uniqueWriteJobId = conf.get(CreateDataSourceTableUtils.DATASOURCE_WRITEJOBUUID) +val taskAttemptId = context.getTaskAttemptID +val split = taskAttemptId.getTaskID.getId +new Path(path, f"part-r-$split%05d-$uniqueWriteJobId.csv$extension") + } +}.getRecordWriter(context) + } +} } } -private[sql] class CsvOutputWriter( -path: String, +/** + * Base CSVOutputWriter class for 'batch' CSVOutputWriter and 'streaming' CSVOutputWriter. The + * writing logic to a single file resides in this base class. + */ +private[csv] abstract class CSVOutputWriterBase( --- End diff -- This `CSVOutputWriterBase` is basically the original `CsvOutputWriter`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. -
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66466310 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala --- @@ -143,39 +146,99 @@ object CSVRelation extends Logging { if (nonEmptyLines.hasNext) nonEmptyLines.drop(1) } } + + /** + * Setup writing configurations into the given [[Configuration]], and then return the + * wrapped [[CSVOptions]]. + * Both continuous-queries writing process and non-continuous-queries writing process will + * call this function. + */ + private[csv] def prepareConfForWriting( + conf: Configuration, + options: Map[String, String]): CSVOptions = { +val csvOptions = new CSVOptions(options) +csvOptions.compressionCodec.foreach { codec => + CompressionCodecs.setCodecConfiguration(conf, codec) +} +csvOptions + } --- End diff -- These mostly are moved from `CSVFileFormat.prepareWrite()` to here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13575#discussion_r66465201 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -246,7 +247,12 @@ case class DataSource( case s: StreamSinkProvider => s.createSink(sparkSession.sqlContext, options, partitionColumns, outputMode) - case parquet: parquet.ParquetFileFormat => + // TODO: Remove the `isInstanceOf` check when other formats have been ported + case fileFormat: FileFormat +if (fileFormat.isInstanceOf[CSVFileFormat] + || fileFormat.isInstanceOf[JsonFileFormat] --- End diff -- @ScrapCodes , thanks! But I'm afraid that syntax would raise a compilation error: ``` [ERROR] .../datasources/DataSource.scala:250: illegal variable in pattern alternative [ERROR] case fileFormat: CSVFileFormat | JsonFileFormat | ParquetFileFormat | TextFileFormat => [ERROR]^ ``` A work-around can be the following, but I found it somewhat less intuitive: ```scala case fileFormat@(_: CSVFileFormat | _: JsonFileFormat | _: ParquetFileFormat | _: TextFileFormat) => // other code ... fileFormat.asInstanceOf[FileFormat] ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13575: [SPARK-15472][SQL] Add support for writing in `csv`, `js...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13575 @marmbrus @tdas @zsxwing , would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13575: [SPARK-15472][SQL] Add support for writing in `cs...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13575 [SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats in Structured Streaming ## What changes were proposed in this pull request? This patch adds support for writing in `csv`, `json`, `text` formats in Structured Streaming: **1. at a high level, this patch forms the following hierarchy**(`text` as an example): ``` â TextOutputWriterBase â â BatchTextOutputWriter StreamingTextOutputWriter ``` ``` â â BatchTextOutputWriterFactory StreamingOutputWriterFactory â StreamingTextOutputWriterFactory ``` The `StreamingTextOutputWriter` and other 'streaming' output writers would write data **without** using an `OutputCommitter`. This was the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409). **2. to support compression, this patch attaches an extension to the path assigned by `FileStreamSink`**, which is slightly different from [SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we are writing out using the `gzip` compression and `FileStreamSink` assigns path `${uuid}` to a text writer, then in the end the file written out will be `${uuid}.txt.gz` -- so that when we read the file back, we'll correctly interpret it as `gzip` compressed. ## How was this patch tested? `FileStreamSinkSuite` is expanded much more to cover the added `csv`, `json`, `text` formats: ```scala test(" csv - unpartitioned data - codecs: none/gzip") test("json - unpartitioned data - codecs: none/gzip") test("text - unpartitioned data - codecs: none/gzip") test(" csv - partitioned data - codecs: none/gzip") test("json - partitioned data - codecs: none/gzip") test("text - partitioned data - codecs: none/gzip") test(" csv - unpartitioned writing and batch reading - codecs: none/gzip") test("json - unpartitioned writing and batch reading - codecs: none/gzip") test("text - unpartitioned writing and batch reading - codecs: none/gzip") test(" csv - partitioned writing and batch reading - codecs: none/gzip") test("json - partitioned writing and batch reading - codecs: none/gzip") test("text - partitioned writing and batch reading - codecs: none/gzip") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-csv-json-text-in-ss Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13575.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13575 commit c70083e9f76c20f6bf48e7ec821452f9bf63783a Author: Liwei Lin Date: 2016-06-05T09:03:04Z Add csv, json, text commit bc28f4112ca9eca6a9f1602a891dd0388fa3185c Author: Liwei Lin Date: 2016-06-09T03:31:59Z Fix parquet extension --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13518 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13518 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13518 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...
GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/13518 [WIP][SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats in Structured Streaming ## What changes were proposed in this pull request? This patch adds support for writing in `csv`, `json`, `text` formats in Structured Streaming: **1. at a high level, this patch forms the following hierarchy**(`text` as an example): ``` â TextOutputWriterBase â â BatchTextOutputWriter StreamingTextOutputWriter ``` ``` â â BatchTextOutputWriterFactory StreamingOutputWriterFactory â StreamingTextOutputWriterFactory ``` The `StreamingTextOutputWriter` and other 'streaming' output writers would write data **without** using an `OutputCommitter`. This was the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409). **2. to support compression, this patch attaches an extension to the path assigned by `FileStreamSink`**, which is slightly different from [SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we are writing out using the `gzip` compression and `FileStreamSink` assigns path `${uuid}` to a text writer, then in the end the file written out will be `${uuid}.txt.gz` -- so that when we read the file back, we'll correctly interpret it as `gzip` compressed. ## How was this patch tested? `FileStreamSinkSuite` is expanded much more to cover the added `csv`, `json`, `text` formats: ```scala test(" csv - unpartitioned data - codecs: none/gzip") test("json - unpartitioned data - codecs: none/gzip") test("text - unpartitioned data - codecs: none/gzip") test(" csv - partitioned data - codecs: none/gzip") test("json - partitioned data - codecs: none/gzip") test("text - partitioned data - codecs: none/gzip") test(" csv - unpartitioned writing and batch reading - codecs: none/gzip") test("json - unpartitioned writing and batch reading - codecs: none/gzip") test("text - unpartitioned writing and batch reading - codecs: none/gzip") test(" csv - partitioned writing and batch reading - codecs: none/gzip") test("json - partitioned writing and batch reading - codecs: none/gzip") test("text - partitioned writing and batch reading - codecs: none/gzip") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13518 commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf Author: Liwei Lin Date: 2016-06-05T09:03:04Z Add csv, json, text commit 2035b597b44aa519d8da3b155036446f88b3050e Author: Liwei Lin Date: 2016-06-05T09:03:15Z Fix parquet extension commit 4737361489fd680405b291ec498ab91374685ffe Author: Liwei Lin Date: 2016-06-05T11:52:14Z Fix style commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b Author: Liwei Lin Date: 2016-06-06T08:02:32Z Fix tests commit daec480bd16ed52137d32f332debe3806953f4d2 Author: Liwei Lin Date: 2016-06-06T09:03:08Z Revert "Fix tests" This reverts commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b. commit 43b68d426e9b64061095eca7a1db0e762843adef Author: Liwei Lin Date: 2016-06-06T09:09:10Z Fix tests commit 56dbb9b4f0f7e2bf76935e0d1d2fc6c6cdf141ff Author: Liwei Lin Date: 2016-06-06T12:34:07Z Investigate test commit 91e51aed5caf663d5068057e5cf28f21eb768310 Author: Liwei Lin Date: 2016-06-06T12:43:39Z Investigate test commit eb2090ce9fa04efbd23370f7d3e6cb98fd0b4c74 Author: Liwei Lin Date: 2016-06-06T13:00:26Z Update run-tests commit 1e9af1cef892706de8b07728c192dd8ca5e5851e Author: Liwei Lin Date: 2016-06-07T04:10:34Z Investigate test commit f76b1d9fe5e4970ad66dfb6d4b2fed939b68728c Author: Liwei Lin Date: 2016-06-07T04:49:12Z Fix tests commit 7fca579d9fec2589f13403b0eb0f2d5f5e6bd52a Author: Liwei Lin Date: 2016-06-07T04:50:24Z Fix tests commit 2ead307d01d8f908951fbc059b8e49bbc77947b1 Author: Liwei Lin Date: 2016-06-07T05:02:55Z Fix tests commit b64afc64d3121479eb5c3f8c8b5663b6e05349b7 Author: Liwei Lin Date: 2016-06-07T05:35:34Z Fix tests commit 2c52c2c6ac961d0b71c2dd3802cd53d51ba1926a Author: Liwei Lin Date: 2016-06-07
[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13518 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13518: [WIP][SPARK-15472][SQL] Add support for writing in `csv`...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13518 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...
GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/13518 [WIP][SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats in Structured Streaming ## What changes were proposed in this pull request? This patch adds support for writing in `csv`, `json`, `text` formats in Structured Streaming: **1. at a high level, this patch forms the following hierarchy**(`text` as an example): ``` â TextOutputWriterBase â â BatchTextOutputWriter StreamingTextOutputWriter ``` ``` â â BatchTextOutputWriterFactory StreamingOutputWriterFactory â StreamingTextOutputWriterFactory ``` The `StreamingTextOutputWriter` and other 'streaming' output writers would write data **without** using an `OutputCommitter`. This was the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409). **2. to support compression, this patch attaches an extension to the path assigned by `FileStreamSink`**, which is slightly different from [SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we are writing out using the `gzip` compression and `FileStreamSink` assigns path `${uuid}` to a text writer, then in the end the file written out will be `${uuid}.txt.gz` -- so that when we read the file back, we'll correctly interpret it as `gzip` compressed. ## How was this patch tested? `FileStreamSinkSuite` is expanded much more to cover the added `csv`, `json`, `text` formats: ```scala test(" csv - unpartitioned data - codecs: none/gzip") test("json - unpartitioned data - codecs: none/gzip") test("text - unpartitioned data - codecs: none/gzip") test(" csv - partitioned data - codecs: none/gzip") test("json - partitioned data - codecs: none/gzip") test("text - partitioned data - codecs: none/gzip") test(" csv - unpartitioned writing and batch reading - codecs: none/gzip") test("json - unpartitioned writing and batch reading - codecs: none/gzip") test("text - unpartitioned writing and batch reading - codecs: none/gzip") test(" csv - partitioned writing and batch reading - codecs: none/gzip") test("json - partitioned writing and batch reading - codecs: none/gzip") test("text - partitioned writing and batch reading - codecs: none/gzip") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13518 commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf Author: Liwei Lin Date: 2016-06-05T09:03:04Z Add csv, json, text commit 2035b597b44aa519d8da3b155036446f88b3050e Author: Liwei Lin Date: 2016-06-05T09:03:15Z Fix parquet extension commit 4737361489fd680405b291ec498ab91374685ffe Author: Liwei Lin Date: 2016-06-05T11:52:14Z Fix style commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b Author: Liwei Lin Date: 2016-06-06T08:02:32Z Fix tests commit daec480bd16ed52137d32f332debe3806953f4d2 Author: Liwei Lin Date: 2016-06-06T09:03:08Z Revert "Fix tests" This reverts commit 90d02c4a10c14af83bbed985e36ef99a1edaa48b. commit 43b68d426e9b64061095eca7a1db0e762843adef Author: Liwei Lin Date: 2016-06-06T09:09:10Z Fix tests commit 56dbb9b4f0f7e2bf76935e0d1d2fc6c6cdf141ff Author: Liwei Lin Date: 2016-06-06T12:34:07Z Investigate test commit 91e51aed5caf663d5068057e5cf28f21eb768310 Author: Liwei Lin Date: 2016-06-06T12:43:39Z Investigate test commit eb2090ce9fa04efbd23370f7d3e6cb98fd0b4c74 Author: Liwei Lin Date: 2016-06-06T13:00:26Z Update run-tests commit 1e9af1cef892706de8b07728c192dd8ca5e5851e Author: Liwei Lin Date: 2016-06-07T04:10:34Z Investigate test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [WIP][SPARK-15472][SQL] Add support for writing i...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13518 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13518: [SPARK-15472][SQL] Add support for writing in `csv`, `js...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13518 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...
GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/13518 [SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats in Structured Streaming ## What changes were proposed in this pull request? This patch adds support for writing in `csv`, `json`, `text` formats in Structured Streaming: **1. at a high level, this patch forms the following hierarchy**(`text` as an example): ``` â TextOutputWriterBase â â BatchTextOutputWriter StreamingTextOutputWriter ``` ``` â â BatchTextOutputWriterFactory StreamingOutputWriterFactory â StreamingTextOutputWriterFactory ``` The `StreamingTextOutputWriter` and other 'streaming' output writers would write data **without** using an `OutputCommitter`. This was the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409). **2. to support compression, this patch attaches an extension to the path assigned by `FileStreamSink`**, which is slightly different from [SPARK-14716](https://github.com/apache/spark/pull/12409). For example, if we are writing out using the `gzip` compression and `FileStreamSink` assigns path `${uuid}` to a text writer, then in the end the file written out will be `${uuid}.txt.gz` -- so that when we read the file back, we'll correctly interpret it as `gzip` compressed. ## How was this patch tested? `FileStreamSinkSuite` is expanded much more to cover the added `csv`, `json`, `text` formats: ```scala test(" csv - unpartitioned data - codecs: none/gzip") test("json - unpartitioned data - codecs: none/gzip") test("text - unpartitioned data - codecs: none/gzip") test(" csv - partitioned data - codecs: none/gzip") test("json - partitioned data - codecs: none/gzip") test("text - partitioned data - codecs: none/gzip") test(" csv - unpartitioned writing and batch reading - codecs: none/gzip") test("json - unpartitioned writing and batch reading - codecs: none/gzip") test("text - unpartitioned writing and batch reading - codecs: none/gzip") test(" csv - partitioned writing and batch reading - codecs: none/gzip") test("json - partitioned writing and batch reading - codecs: none/gzip") test("text - partitioned writing and batch reading - codecs: none/gzip") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13518 commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf Author: Liwei Lin Date: 2016-06-05T09:03:04Z Add csv, json, text commit 2035b597b44aa519d8da3b155036446f88b3050e Author: Liwei Lin Date: 2016-06-05T09:03:15Z Fix parquet extension commit 4737361489fd680405b291ec498ab91374685ffe Author: Liwei Lin Date: 2016-06-05T11:52:14Z Fix style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13518 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13258: [SPARK-15472][SQL][Streaming] Add partitioned `cs...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13258 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13518: [SPARK-15472][SQL] Add support for writing in `cs...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13518 [SPARK-15472][SQL] Add support for writing in `csv`, `json`, `text` formats in Structured Streaming ## What changes were proposed in this pull request? This patch adds support for writing in `csv`, `json`, `text` formats in Structured Streaming: **1. at a high level, this patch forms the following hierarchy**(`text` as an example): ``` â TextOutputWriterBase â â BatchTextOutputWriter StreamingTextOutputWriter ``` ``` â â BatchTextOutputWriterFactory StreamingOutputWriterFactory â StreamingTextOutputWriterFactory ``` The `StreamingTextOutputWriter` and other 'streaming' output writers would write data **without** using an `OutputCommitter`. This was the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409). **2. to support compression, this patch attaches an extension to the path assigned by `FileStreamSink`.** For example, if we are writing out using the `gzip` compression and `FileStreamSink` assigns path `{$uuid}` to a text writer, then in the end the file written out will be `${uuid}.txt.gz` -- so that when we read the file back, we'll correctly interpret it as `gzip` compressed. ## How was this patch tested? `FileStreamSinkSuite` is much more expanded to cover the added `csv`, `json`, `text` format: ```scala test(" csv - unpartitioned data - codecs: none/gzip") test("json - unpartitioned data - codecs: none/gzip") test("text - unpartitioned data - codecs: none/gzip") test(" csv - partitioned data - codecs: none/gzip") test("json - partitioned data - codecs: none/gzip") test("text - partitioned data - codecs: none/gzip") test(" csv - unpartitioned writing and batch reading - codecs: none/gzip") test("json - unpartitioned writing and batch reading - codecs: none/gzip") test("text - unpartitioned writing and batch reading - codecs: none/gzip") test(" csv - partitioned writing and batch reading - codecs: none/gzip") test("json - partitioned writing and batch reading - codecs: none/gzip") test("text - partitioned writing and batch reading - codecs: none/gzip") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-csv-json-text-for-ss Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13518 commit 97034f9aeb092b10e1606e60a8e6b4878ebd54cf Author: Liwei Lin Date: 2016-06-05T09:03:04Z Add csv, json, text commit 2035b597b44aa519d8da3b155036446f88b3050e Author: Liwei Lin Date: 2016-06-05T09:03:15Z Fix parquet extension --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13515: [MINOR] Fix Typos 'an -> a'
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/13515#discussion_r65812444 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -366,7 +366,7 @@ object SparkSubmit { } } -// In YARN mode for an R app, add the SparkR package archive and the R package +// In YARN mode for a R app, add the SparkR package archive and the R package --- End diff -- maybe we don't want to change modify this one --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13507: [SPARK-15765][SQL][Streaming] Make continuous Parquet wr...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13507 @liancheng @tdas @zsxwing would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13507: [SPARK-15765][SQL][Streaming] Make continuous Par...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13507 [SPARK-15765][SQL][Streaming] Make continuous Parquet writing consistent with non-consistent Parquet writing ## What changes were proposed in this pull request? Currently there are some code duplicates in continuous Parquet writing (as in Structured Streaming) and non-continuous batch writing; see [ParquetFileFormat#prepareWrite()](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L68) and [ParquetFileFormat#ParquetOutputWriterFactory](https://github.com/apache/spark/blob/431542765785304edb76a19885fbc5f9b8ae7d64/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L414). This may lead to inconsistent behavior, when we only change one piece of code but not the other. By extracting the common code out, this patch fixes the inconsistency. As a result, Structured Streaming now also enjoys [SPARK-15719](https://github.com/apache/spark/pull/13455). ## How was this patch tested? Just code refactoring without any logic change, this should be covered by existing suits. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark parquet-conf-deduplicate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13507.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13507 commit 60a2c8ee7c610a783e65b78ac21e25661b84f49d Author: Liwei Lin Date: 2016-06-03T14:31:56Z Make continuous writing consistent with non-consistent writing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13258: [SPARK-15472][SQL][Streaming] Add partitioned `csv`, `js...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/13258 @zsxwing , yeah let me update this in the next one day or so - I was waiting for https://github.com/apache/spark/pull/13431. Thanks for the reminder! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12981: [WIP][SPARK-15208][Core][Streaming][Docs] Update Spark e...
Github user lw-lin commented on the issue: https://github.com/apache/spark/pull/12981 @srowen, there are no other examples that need an update :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-15208][Core][Streaming][Docs] Update Spark e...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12981 @rxin sure, I'll resolve the conflicts very soon; and once the Java APIs are updated in the next couple of days, I'll update the Java examples accordingly. Thank you for bringing this up! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/13258#issuecomment-222035774 @zsxwing sure let's add an abstract layer. I'll rebase and do this in the next two days or so. Thanks for the review! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/13258#issuecomment-220917708 @marmbrus @tdas @zsxwing would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add partitioned ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13258 [SPARK-15472][SQL][Streaming] Add partitioned `csv`, `json`, `text` format support for FileStreamSink ## What changes were proposed in this pull request? This patch adds partitioned `csv`, `json`, `text` format support for `FileStreamSink`. For each format, this patch adds a dedicated `OutputWriterFactory`, which would return a dedicated `OutputWriter` writing data without using an `OutputCommitter`. This is the same approach taken by [SPARK-14716](https://github.com/apache/spark/pull/12409) in which we had added partitioned `parquet` format support. ## How was this patch tested? `FileStreamSinkSuite` is expanded to cover the added `csv`, `json`, `text` format: ```scala test(" csv - unpartitioned data") test("json - unpartitioned data") test("text - unpartitioned data") test(" csv - partitioned data") test("json - partitioned data") test("text - partitioned data") test(" csv - unpartitioned writing and batch reading") test("json - unpartitioned writing and batch reading") test("text - unpartitioned writing and batch reading") test(" csv - partitioned writing and batch reading") test("json - partitioned writing and batch reading") test("text - partitioned writing and batch reading") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-cvs-json-text Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13258 commit f3bac77b307cc56aa6c468ef75ab255b5ad85459 Author: Liwei Lin Date: 2016-05-22T08:54:11Z Add csv, json, text --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/13251 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/13251#issuecomment-220833527 This is still WIP; will reopen soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/13251#issuecomment-220826352 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15472][SQL][Streaming] Add support for ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/13251 [SPARK-15472][SQL][Streaming] Add support for partitioned `text` format in FileStreamSink ## What changes were proposed in this pull request? This patch adds support for partitioned `text` format in FileStreamSink. Specifically, this patch adds a new `TextOutputWriterFactory`, which would return an `OutputWriter` writing data without using an `OutputCommitter`. This is the same manner employed by [SPARK-14716](https://github.com/apache/spark/pull/12409) in which we had added support for partitioned `parquet` format. ## How was this patch tested? The `FileStreamSinkSuite` is expanded to cover tests added for the partitioned `text` format. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-text-format Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13251.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13251 commit b6da2287b8d7924a978106b953a83dabad288980 Author: Liwei Lin Date: 2016-05-22T08:54:11Z Some refactor commit b58e97f97930a81183d737319ba431e6722905a5 Author: Liwei Lin Date: 2016-05-22T08:55:26Z Add `text` format support for FileStreamSink --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-219336043 @marmbrus @zsxwing maybe this is ready to go? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-218016900 @zsxwing would you take another look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-15208][Core][Docs] Update spark ex...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12981#issuecomment-217689734 Once the AccumulatorV2 Java API is finalized, I would update the Java & Python part. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-15208][Core][Docs] Update spark ex...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12981 [WIP][SPARK-15208][Core][Docs] Update spark examples with AccumulatorV2 ## What changes were proposed in this pull request? The patch updates the codes & docs in the example module as well as the related doc module: - [ ] [docs] `streaming-programming-guide.md` - [ ] scala code part - [ ] java code part - [ ] python code part - [ ] [examples] `RecoverableNetworkWordCount.scala` - [ ] [examples] `JavaRecoverableNetworkWordCount.java` - [ ] [examples] `recoverable_network_wordcount.py` ## How was this patch tested? Ran the examples and verified results manually. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark accumulatorV2-examples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12981.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12981 commit 2d5198265b3df9c48af2be4bf572a3c815fc9bb1 Author: Liwei Lin Date: 2016-05-08T03:31:20Z update scala example code & doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-217608390 I've addressed comments and expanded tests; @zsxwing would you mind taking another look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12725#discussion_r62410545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala --- @@ -27,12 +27,12 @@ import org.apache.spark.sql.execution.{QueryExecution, SparkPlan, SparkPlanner, * A variant of [[QueryExecution]] that allows the execution of the given [[LogicalPlan]] * plan incrementally. Possibly preserving state in between each execution. */ -class IncrementalExecution( +class IncrementalExecution private[sql]( sparkSession: SparkSession, logicalPlan: LogicalPlan, outputMode: OutputMode, checkpointLocation: String, -currentBatchId: Long) +val currentBatchId: Long) --- End diff -- expose this to tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12725#discussion_r62410548 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -122,7 +122,7 @@ class StreamExecution( * processing is done. Thus, the Nth record in this log indicated data that is currently being * processed and the N-1th entry indicates which offsets have been durably committed to the sink. */ - private val offsetLog = + private[sql] val offsetLog = --- End diff -- expose this to test suits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12797#issuecomment-216705209 @marmbrus @zsxwing thank you for the patient review! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61977204 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala --- @@ -65,8 +65,13 @@ case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} milliseconds") } - /** Return the next multiple of intervalMs */ + /** + * Returns the start time in milliseconds for the next batch interval, given the current time. + * Note that a batch interval is inclusive with respect to its start time, and thus calling + * `nextBatchTime` with the result of a previous call should return the next interval. (i.e. given + * an interval of `100 ms`, `nextBatchTime(nextBatchTime(0)) = 200` rather than `0`). --- End diff -- Thank you always for the prompt review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61976675 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala --- @@ -65,8 +65,13 @@ case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} milliseconds") } - /** Return the next multiple of intervalMs */ + /** + * Returns the start time in milliseconds for the next batch interval, given the current time. + * Note that a batch interval is inclusive with respect to its start time, and thus calling + * `nextBatchTime` with the result of a previous call should return the next interval. (i.e. given + * an interval of `100 ms`, `nextBatchTime(nextBatchTime(0)) = 200` rather than `0`). --- End diff -- `nextBatchTime(0) = 100`, so `nextBatchTime(nextBatchTime(0)) = 200`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61976475 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/StreamTest.scala --- @@ -142,7 +142,10 @@ trait StreamTest extends QueryTest with Timeouts { case object StopStream extends StreamAction with StreamMustBeRunning /** Starts the stream, resuming if data has already been processed. It must not be running. */ - case object StartStream extends StreamAction + case class StartStream(trigger: Trigger = null, triggerClock: Clock = null) extends StreamAction --- End diff -- This layer of `null`s was intended to delegate the default values of `StreamExecution` into these tests, so that we don't have to set the same default values in many places and maintain their consistency. But since it seems very unlikely that we would change the default values, so I've removed the `null`s and followed your comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61837747 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ProcessingTimeExecutorSuite.scala --- @@ -21,19 +21,41 @@ import java.util.concurrent.{CountDownLatch, TimeUnit} import org.apache.spark.SparkFunSuite import org.apache.spark.sql.ProcessingTime -import org.apache.spark.util.ManualClock +import org.apache.spark.util.{Clock, ManualClock, SystemClock} class ProcessingTimeExecutorSuite extends SparkFunSuite { test("nextBatchTime") { val processingTimeExecutor = ProcessingTimeExecutor(ProcessingTime(100)) +assert(processingTimeExecutor.nextBatchTime(0) === 100) assert(processingTimeExecutor.nextBatchTime(1) === 100) assert(processingTimeExecutor.nextBatchTime(99) === 100) -assert(processingTimeExecutor.nextBatchTime(100) === 100) +assert(processingTimeExecutor.nextBatchTime(100) === 200) assert(processingTimeExecutor.nextBatchTime(101) === 200) assert(processingTimeExecutor.nextBatchTime(150) === 200) } + private def testNextBatchTimeAgainstClock(clock: Clock) { +val IntervalMS = 100 --- End diff -- Sure; let me fix this typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61837727 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala --- @@ -65,8 +65,22 @@ case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} milliseconds") } - /** Return the next multiple of intervalMs */ + /** Return the next multiple of intervalMs + * + * e.g. for intervalMs = 100 + * nextBatchTime(0) = 100 + * nextBatchTime(1) = 100 + * ... + * nextBatchTime(99) = 100 + * nextBatchTime(100) = 200 + * nextBatchTime(101) = 200 + * ... + * nextBatchTime(199) = 200 + * nextBatchTime(200) = 300 + * + * Note, this way, we'll get nextBatchTime(nextBatchTime(0)) = 200, rather than = 0 --- End diff -- Let me update it with your much clearer verson! Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023][SQL][Streaming] Add...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61837469 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TriggerExecutor.scala --- @@ -65,8 +65,22 @@ case class ProcessingTimeExecutor(processingTime: ProcessingTime, clock: Clock = s"${intervalMs} milliseconds, but spent ${realElapsedTimeMs} milliseconds") } - /** Return the next multiple of intervalMs */ + /** Return the next multiple of intervalMs + * + * e.g. for intervalMs = 100 + * nextBatchTime(0) = 100 + * nextBatchTime(1) = 100 + * ... + * nextBatchTime(99) = 100 + * nextBatchTime(100) = 200 + * nextBatchTime(101) = 200 + * ... + * nextBatchTime(199) = 200 + * nextBatchTime(200) = 300 + * + * Note, this way, we'll get nextBatchTime(nextBatchTime(0)) = 200, rather than = 0 + * */ def nextBatchTime(now: Long): Long = { -(now - 1) / intervalMs * intervalMs + intervalMs +now / intervalMs * intervalMs + intervalMs --- End diff -- @zsxwing thanks for clarifying on this! :-) [1] The issue is triggered when both `batchElapsedTimeMs == 0` and `batchEndTimeMs` is multiple of `intervalMS` hold, e.g. `batchStartTimeMs == 50` and `batchEndTimeMS == 50` given `intervalMS == 100` won't trigger the issue. So, we might have to do like this: ```scala if (batchElapsedTimeMs == 0 && batchEndTimeMs % intervalMS == 0) { clock.waitTillTime(batchEndTimeMs + intervalMs) } else { clock.waitTillTime(nextBatchTime(batchEndTimeMs)) } ``` For me It seems a little hard to interpret... [2] > ... deal with one case: If a batch takes exactly intervalMs, we should run the next batch at once instead of sleeping intervalMs This is a good point! I've done some calculations based on your comments, and it seems we would still run the next batch at once when the last job takes exactly `intervalMs`? prior to this path: ``` batch | job - [ 0, 99] | [100, 199] | job x starts at 100, stops at 199, takes 100 [200, 299] | ``` after this patch, it's still the same: ``` batch | job - [ 0, 99] | [100, 199] | job y starts at 100, stops at 199, takes 100 [200, 299] | ``` -- @zsxwing thoughts on the above [1] and [2]? Thanks! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12650#issuecomment-216427977 @zsxwing I've made updates per your comments; would you take a another look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12650#discussion_r61835377 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala --- @@ -46,7 +45,11 @@ private[sql] object SQLExecution { val executionId = SQLExecution.nextExecutionId sc.setLocalProperty(EXECUTION_ID_KEY, executionId.toString) val r = try { -val callSite = Utils.getCallSite() +// We first try to pick up any call site that was set previously, then fall back to +// Utils.getCallSite(); because call Utils.getCallSite() on continuous queries directly +// would give us call site like "run at :0" +val callSite = sparkSession.sparkContext.getCallSite() --- End diff -- @zsxwing sorry for the vague comments. The fall back logic is contained in the `sparkSession.sparkContext.getCallSite()` itself. I've updated the comments accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12650#discussion_r61835312 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --- @@ -99,7 +99,15 @@ class StreamExecution( /** The thread that runs the micro-batches of this stream. */ private[sql] val microBatchThread = new UninterruptibleThread(s"stream execution thread for $name") { - override def run(): Unit = { runBatches() } + + private val callSite = Utils.getCallSite() --- End diff -- @zsxwing making it as a field of `StreamExecution` is surely more reasonable; I've updated this within the new commit. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12797#issuecomment-215928373 @marmbrus @tdas @zsxwing would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-215927916 For things to be easy to review, I've added the manual timed executor for testing general cases in [a separate PR](https://github.com/apache/spark/pull/12797). When that patch get merged, I'll add some dedicated tests here for this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61663843 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala --- @@ -136,6 +136,22 @@ class StreamSuite extends StreamTest with SharedSQLContext { testStream(ds)() } } + + // This would fail for now -- error is "Timed out waiting for stream" + // Root cause is that data generated in batch 0 may not get processed in batch 1 + // Let's enable this after SPARK-14942: Reduce delay between batch construction and execution + ignore("minimize delay between batch construction and execution") { +val inputData = MemoryStream[Int] +testStream(inputData.toDS())( + StartStream(ProcessingTime("10 seconds"), new ManualClock), + /* -- batch 0 --- */ + AddData(inputData, 1), + AddData(inputData, 2), + AddData(inputData, 3), + AdvanceManualClock(10 * 1000), // 10 seconds + /* -- batch 1 --- */ + CheckAnswer(1, 2, 3)) + } --- End diff -- The above test takes advantage of the new `StartStream` and `AdvanceManualClock` action. It's testing against `SPARK-14942: Reduce delay between batch construction and execution` with a manually timed executor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12797#discussion_r61663787 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/ProcessingTimeExecutorSuite.scala --- @@ -21,19 +21,41 @@ import java.util.concurrent.{CountDownLatch, TimeUnit} import org.apache.spark.SparkFunSuite import org.apache.spark.sql.ProcessingTime -import org.apache.spark.util.ManualClock +import org.apache.spark.util.{Clock, ManualClock, SystemClock} class ProcessingTimeExecutorSuite extends SparkFunSuite { test("nextBatchTime") { val processingTimeExecutor = ProcessingTimeExecutor(ProcessingTime(100)) +assert(processingTimeExecutor.nextBatchTime(0) === 100) assert(processingTimeExecutor.nextBatchTime(1) === 100) assert(processingTimeExecutor.nextBatchTime(99) === 100) -assert(processingTimeExecutor.nextBatchTime(100) === 100) +assert(processingTimeExecutor.nextBatchTime(100) === 200) assert(processingTimeExecutor.nextBatchTime(101) === 200) assert(processingTimeExecutor.nextBatchTime(150) === 200) } + private def testNextBatchTimeAgainstClock(clock: Clock) { +val IntervalMS = 100 +val processingTimeExecutor = ProcessingTimeExecutor(ProcessingTime(IntervalMS), clock) + +val ITERATION = 10 +var nextBatchTime: Long = 0 +for (it <- 1 to ITERATION) + nextBatchTime = processingTimeExecutor.nextBatchTime(nextBatchTime) + +// nextBatchTime should be 1000 +assert(nextBatchTime === IntervalMS * ITERATION) + } + + test("nextBatchTime against SystemClock") { +testNextBatchTimeAgainstClock(new SystemClock) + } + + test("nextBatchTime against ManualClock") { --- End diff -- Please note the `ProcessingTimeExecutor` issue would fail this test without this patch, but would pass with this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12797 [SPARK-15022][SPARK-15023] Add support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock` ## What changes were proposed in this pull request? Currently in `StreamTest`, we have a `StartStream` which will start a streaming query against trigger `ProcessTime(intervalMS = 0)` and `SystemClock`. We also need to test cases against `ProcessTime(intervalMS > 0)`, which often requires `ManualClock`. This patch: - fixes an issue of `ProcessingTimeExecutor`, where for a batch it should run `batchRunner` only once but might run multiple times under certain conditions; - adds support for testing against the `ProcessingTime(intervalMS > 0)` trigger and `ManualClock`, by specifying them as fields for `StartStream`, and by adding an `AdvanceClock` action; - adds a test, which takes advantage of the new `StartStream` and `AdvanceClock`, to test against [PR#[SPARK-14942] Reduce delay between batch construction and execution ](https://github.com/apache/spark/pull/12725). ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark add-trigger-test-support Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12797.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12797 commit 90ed69285fc34ae43a0f454ceb25837618212e28 Author: Liwei Lin Date: 2016-04-30T01:32:48Z fix an issue of nextBatchTime against ManualClock commit 9d80b15e33151f5f87207d72f89f016c18a21b01 Author: Liwei Lin Date: 2016-04-30T01:33:48Z Add support for testing against ProcessingTime(intervalMS>0) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15022][SPARK-15023] Add support for tes...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12797#issuecomment-215924685 still editing, will explain why ProcessingTimeExecutor, where for a batch it should run batchRunner only once but might run multiple times under certain conditions; --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] Reduce delay bet...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-215405018 Sure, I'll add a manual timed executor and some dedicated tests as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214983254 @davies thanks for the review & merging :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] First construct ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12725#issuecomment-214979767 @marmbrus @tdas @zsxwing would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14942][SQL][Streaming] First construct ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12725 [SPARK-14942][SQL][Streaming] First construct a batch then run the batch for continuous queries ## Problem Currently in `StreamExecution`, we first run the batch, then construct the next: ```scala if (dataAvailable) runBatch() constructNextBatch() ``` This is good if we run batches ASAP, where data would get processed in the **very next batch**: ![1](https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png) However, if we run batches at trigger like `ProcessTime("1 minute")`, data - such as _y_ below - may not get processed in the very next batch i.e. _batch 1_, but in _batch 2_: ![2](https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png) ## What changes were proposed in this pull request? This patch reverse the order of `constructNextBatch()` and `runBatch()`. After this patch, data would get processed in the **very next batch**, i.e. _batch 1_: ![3](https://cloud.githubusercontent.com/assets/15843379/14779816/6f36ee62-0b0c-11e6-9e53-bc8397fade18.png) In addition, this patch alters when we do `currentBatchId += 1`: let's do that when the processing of the current batch of data is complete, so we won't bother passing `currentBatchId + 1` or `currentBatchId - 1` to states or sinks. ## How was this patch tested? This should be covered by existing test suits including stress tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark construct-before-run-3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12725 commit 8c8d73a85550349faafef1cbde1c7094d384809e Author: Liwei Lin Date: 2016-04-27T02:19:14Z constructNextBatch() before runBatch() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-214943354 just rebase to master to resolve some conflicts --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12638#discussion_r61193722 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala --- @@ -88,7 +88,7 @@ class FileStreamSource( } /** - * Returns the next batch of data that is available after `start`, if any is available. --- End diff -- This doc should be updated, following `Source.getBatch()`'s doc change from `Returns the next batch of data that is available after start, if any is available.` to `Returns the data that is between the offsets (start, end].` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214939654 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214753491 Some flaky tests unrelated to this PR Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-214714643 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214713457 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...
Github user lw-lin commented on a diff in the pull request: https://github.com/apache/spark/pull/12521#discussion_r61041131 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala --- @@ -368,4 +368,79 @@ class DataFrameReaderWriterSuite extends StreamTest with SharedSQLContext with B "org.apache.spark.sql.streaming.test", Map.empty) } + + test("check continuous query methods should not be called on non-continuous queries") { --- End diff -- updated - thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214617805 Some flaky tests not caused by this PR. Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-214609215 some build issues unrelated to this PR. Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-214606875 @marmbrus thanks for the patient reminder! Since I've reverted the renaming, and I've checked there's no other completely unused class under `o.a.s.sql.execution.streaming` package, maybe this is ready to go (pending tests). So would mind take another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/12638 [SPARK-14874][SQL][Streaming] Remove the obsolete Batch representation ## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless. This patch: - removes the `Batch` class - renames `getBatch(...)` to `getData(...)` for `Source`: - prior to [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_** - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_** - proposed in this patch: get**_Data_**(start: Option[Offset], end: Offset): DataFrame - renames `addBatch(...)` to `addData(...)` for `Sink`: - prior to [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: addBatch(**_batch: Batch_**) - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: addBatch(batchId: Long, **_data: DataFrame_**) - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame) The renaming of public methods should be OK since they have not been in any release yet. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark remove-batch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12638.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12638 commit c79cba9059b7ac2d6398c81b57ceece50b6b7526 Author: Liwei Lin Date: 2016-04-23T10:15:51Z remove the useless Batch class commit 60aaf97c30f55f5791ec53f81f3ed90481586fed Author: Liwei Lin Date: 2016-04-26T04:03:10Z revert renaming commit 14e690037fbfc7840672e9a78b9abbccbc07c3ee Author: Liwei Lin Date: 2016-04-26T04:03:23Z revert renaming --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214605511 @davies (who made the first change) might want to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12681#issuecomment-214601397 Actually this wouldn't cause any problem and wouldn't fail any test suits **_for now_**, because the read of `acquiredButNotUsed` is guaranteed to see most recent value due to the existing `synchronized(this) {consumers...}` block before the read of `acquiredButNotUsed`. It is kind of ["Piggybacking" on synchronization](http://www.javamex.com/tutorials/synchronization_piggyback.shtml) -- but let's not rely on this because it's vulnerable to future code changes? :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14911][Core] Fix a potential data race ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12681 [SPARK-14911][Core] Fix a potential data race in TaskMemoryManager ## What changes were proposed in this pull request? [[SPARK-13210][SQL] catch OOM when allocate memory and expand array](https://github.com/apache/spark/commit/37bc203c8dd5022cb11d53b697c28a737ee85bcc) introduced an `acquiredButNotUsed` field, but it might not be correctly synchronized: - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271)); - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, taskAttemptId, tungstenMemoryMode)` (see [here](https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400)) might not be correctly synchronized, and thus might not see `acquiredButNotUsed`'s new written value. This patch makes `acquiredButNotUsed` to fix this. ## How was this patch tested? This should be covered by existing suits. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark fix-acquiredButNotUsed Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12681.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12681 commit 6b72b963d54855771dcabc1fca8ed963be28303c Author: Liwei Lin Date: 2016-04-26T03:11:53Z fix a potential data race in TaskMemoryManager --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-214586964 Sure, so I'm closing this PR since the removal itself is not worthy for committers to process. @marmbrus thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/12638 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12521#issuecomment-214585357 Updates: thanks to [[SPARK-14473][SQL] Define analysis rules to catch operations not supported in streaming](https://github.com/apache/spark/commit/775cf17eaaae1a38efe47b282b1d6bbdb99bd759), now we have friendly messages for most of the incorrect usages: > Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with write.startStream(); and > Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries without streaming sources cannot be executed with write.startStream(); - That leaves this patch capturing other incorrect usages such as calling `.trigger()`/`.queryName()` on non-continuous queries, or calling `bucketBy()`/`sortBy()` on continuous queries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12650#issuecomment-213919607 @andrewor14 @zsxwing would you mind taking a look? Thanks! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14884][SQL][Streaming][WebUI] Fix call ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12650 [SPARK-14884][SQL][Streaming][WebUI] Fix call site for continuous queries ## What changes were proposed in this pull request? Since we've been processing continuous queries in separate threads, the call sites are then `run at :0`. It's not wrong but provides very little information; in addition, we can not distinguish two queries only from their call sites. This patch fixes this. ### Before [Jobs Tab] ![s1a](https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png) [SQL Tab] ![s1b](https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png) ### After [Jobs Tab] ![s2a](https://cloud.githubusercontent.com/assets/15843379/14766104/a89705b6-0a30-11e6-9830-0d40ec68527b.png) [SQL Tab] ![s2b](https://cloud.githubusercontent.com/assets/15843379/14766103/a8966728-0a30-11e6-8e4d-c2e326400478.png) ## How was this patch tested? Manually checks - see screenshots above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark fix-call-site Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12650 commit db15de5661ba9f5dbf94caedd02f5fdd90a8fe61 Author: Liwei Lin Date: 2016-04-24T07:05:15Z fix call site for continuous queries commit 44d93c780cbeb97937f211b50c79bd0bf8291909 Author: Liwei Lin Date: 2016-04-24T07:10:19Z fix style --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12638#issuecomment-213870571 @marmbrus @tdas would you mind taking a look? Thanks! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Remove the obsol...
GitHub user lw-lin reopened a pull request: https://github.com/apache/spark/pull/12638 [SPARK-14874][SQL][Streaming] Remove the obsolete Batch representation ## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless. This patch: - removes the `Batch` class - renames `getBatch(...)` to `getData(...)` for `Source`: - before [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_** - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_** - proposed in this patch: get**_Data_**(start: Option[Offset], end: Offset): DataFrame - renames `addBatch(...)` to `addData(...)` for `Sink`: - before [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: addBatch(**_batch: Batch_**) - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: addBatch(batchId: Long, **_data: DataFrame_**) - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame) The renaming of public methods should be OK since they have not been in any release yet. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark remove-batch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12638.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12638 commit c79cba9059b7ac2d6398c81b57ceece50b6b7526 Author: Liwei Lin Date: 2016-04-23T10:15:51Z remove the useless Batch class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Cleanup the usel...
Github user lw-lin closed the pull request at: https://github.com/apache/spark/pull/12638 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14874][SQL][Streaming] Cleanup the usel...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12638 [SPARK-14874][SQL][Streaming] Cleanup the useless Batch class ## What changes were proposed in this pull request? The `Batch` class, which had been used to indicate progress in a stream, was abandoned by [[SPARK-13985][SQL] Deterministic batches with ids](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b) and then became useless. This patch: - removes the `Batch` class - renames `getBatch(...)` to `getData(...)` for `Source`: - before [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: get**_NextBatch_**(start: Option[Offset]): **_Option[Batch]_** - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: get**_Batch_**(start: Option[Offset], end: Offset): **_DataFrame_** - proposed in this patch: get**_Data_**(start: Option[Offset], end: Offset): DataFrame - renames `addBatch(...)` to `addData(...)` for `Sink`: - before [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it was: addBatch(**_batch: Batch_**) - after [SPARK-13985](https://github.com/apache/spark/commit/caea15214571d9b12dcf1553e5c1cc8b83a8ba5b), it became: addBatch(batchId: Long, **_data: DataFrame_**) - proposed in this patch: add**_Data_**(batchId: Long, data: DataFrame) ## How was this patch tested? The changes should be covered by existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark remove-batch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12638.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12638 commit c79cba9059b7ac2d6398c81b57ceece50b6b7526 Author: Liwei Lin Date: 2016-04-23T10:15:51Z remove the useless Batch class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-14687][Core][SQL][MLlib] Call path.getF...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12450#issuecomment-212405238 @srowen thank you for the review & merging :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12521#issuecomment-212303861 @rxin @marmbrus would you mind taking a look when you have time? Thanks! :-) And I'm not sure we should disallow calling methods like `parquet()`, `text()` on continuous queries or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14747][SQL] Add assertStreaming/assertN...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12521 [SPARK-14747][SQL] Add assertStreaming/assertNoneStreaming checks in DataFrameWriter ## Problem If an end user happens to write code mixed with continuous-query-oriented methods and non-continuous-query-oriented methods: ```scala ctx.read .format("text") .stream("...") // continuous query .write .text("...")// non-continuous query; should be startStream() here ``` He/she would get this somehow confusing exception: > Exception in thread "main" java.lang.AssertionError: assertion failed: No plan for FileSource[./continuous_query_test_input] at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at ... ## What changes were proposed in this pull request? This PR adds checks for continuous-query-oriented methods and non-continuous-query-oriented methods in `DataFrameWriter`: continuous query non-continuous query mode â trigger â format â â option/options â â partitionBy â â bucketBy â sortBy â save â queryName â startStream â insertInto â saveAsTable â jdbc â json â parquet â orc â text â csv â ## How was this patch tested? dedicated unit tests were added You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark dataframe-writer-check Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12521.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12521 commit 449b7bd204970311b26638e1b8bc0de3d4faaa8c Author: Liwei Lin Date: 2016-04-20T05:58:05Z add checks for continuous-queries-oriented methods / non-continuous-queries-oriented methods --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14701][Streaming] First stop the event ...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12489#issuecomment-211724261 @zsxwing would you mind taking a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14701][Streaming] First stop the event ...
GitHub user lw-lin opened a pull request: https://github.com/apache/spark/pull/12489 [SPARK-14701][Streaming] First stop the event loop, then stop the checkpoint writer in JobGenerator ## What changes were proposed in this pull request? The stopping order of the `event loop` and `checkpoint writer` was reversed. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/lw-lin/spark spark-14701 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12489.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12489 commit 48453fda59c1392a51d987962aa4eeda20fefbbc Author: Liwei Lin Date: 2016-04-19T03:10:00Z First stop the event loop, then stop the checkpoint writer in JobGenerator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-14687][Core][SQL][MLlib] Call path.getF...
Github user lw-lin commented on the pull request: https://github.com/apache/spark/pull/12450#issuecomment-211679362 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org