[GitHub] spark pull request #17564: [MINOR][DOCS] Fix spacings in Structured Streamin...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/17564 [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide ## What changes were proposed in this pull request? 1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...` 2. Omitted colon in Output Model section. ## How was this patch tested? None. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/fix-programming-guide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17564.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17564 commit e10843d0c6e1a45ef507f8a4c7514c0694570abf Author: Lee Dongjin Date: 2017-04-07T08:17:25Z [MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/17225 [CORE] Support ZStandard Compression ## What changes were proposed in this pull request? Hadoop will support ZStandard Compression from 2.9.0. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. ## How was this patch tested? 3 additional unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/z-compression Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17225.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17225 commit 01850446779b398bae281391de205cbda28d3efc Author: Lee Dongjin Date: 2017-03-09T15:43:56Z Implement ZStandardCompressionCodec. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/17225#discussion_r105342714 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -49,13 +50,14 @@ private[spark] object CompressionCodec { private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = { (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec] - || codec.isInstanceOf[LZ4CompressionCodec]) + || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec]) } private val shortCompressionCodecNames = Map( "lz4" -> classOf[LZ4CompressionCodec].getName, "lzf" -> classOf[LZFCompressionCodec].getName, -"snappy" -> classOf[SnappyCompressionCodec].getName) +"snappy" -> classOf[SnappyCompressionCodec].getName, +"zstd" -> classOf[SnappyCompressionCodec].getName) --- End diff -- OMG, it is a typo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression
Github user dongjinleekr closed the pull request at: https://github.com/apache/spark/pull/17225 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17225: [CORE] Support ZStandard Compression
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/17225 @HyukjinKwon Thanks for the information. It seems like both of Jira issue and my PR are messed up - it will re-create PR with the Jira issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/17303 [SPARK-19112][CORE] add codec for ZStandard ## What changes were proposed in this pull request? Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.' [^1]: https://issues.apache.org/jira/browse/HADOOP-13578 [^2]: https://issues.apache.org/jira/browse/HBASE-16710 ## How was this patch tested? 3 additional unit tests in `CompressionCodecSuite.scala`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/SPARK-19112 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17303.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17303 commit 1927b91d0d8621e9e2dc2a88a93e07780cfc66bf Author: Lee Dongjin Date: 2017-03-15T08:09:56Z Implement ZStandardCompressionCodec --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16830: [MINOR][CORE] Fix incorrect documentation of Writ...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/16830 [MINOR][CORE] Fix incorrect documentation of WritableConverter ## What changes were proposed in this pull request? `WritableConverter` and `WritableFactory` work in opposite directions. But both of them are documented with same description: > A class encapsulating how to convert some type T to Writable. It stores both the Writable class corresponding to T (e.g. IntWritable for Int) and a function for doing the conversion. This error is a result of commit d37978d, which resolves [SPARK-4795](https://issues.apache.org/jira/browse/SPARK-4795) by redesigning the "primitive type => Writable" implicit APIs. This PR fix the documentation of `WritableConverter`. ## How was this patch tested? Existing tests (no functional change anyway) You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/fix-writableconverter-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16830.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16830 commit 6c04422d7ec343ba6766980e5fa9985e27659de8 Author: Lee Dongjin Date: 2017-02-07T11:23:02Z Fix documentation of WritableConverter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16830: [MINOR][CORE] Fix incorrect documentation of WritableCon...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/16830 Updated with the problem description. Please have a look when you are free. I will review the other documentation descriptions ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16830: [MINOR][CORE] Fix incorrect documentation of WritableCon...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/16830 Is this exactly corresponds to your intention? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/21501 [SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala|Python] unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/SPARK-15064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21501.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21501 commit 8379d88360651e0571698412e4aecc305ae7fe60 Author: Lee Dongjin Date: 2018-03-21T15:03:17Z Implement SPARK-15064 commit 682df4afe6859119d01de1fa953f95ebaacb4889 Author: Lee Dongjin Date: 2018-06-05T11:00:17Z Add test cases for StopWordsRemoverSuite commit 67ae6ebaebbb537f24db51be1f96759d5c6463ff Author: Lee Dongjin Date: 2018-06-06T07:43:45Z Implement pyspark parity --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194222392 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString --- End diff -- As far as I am understanding, the Locale makes difference how the lowercase string would be, not related to the stopwords. @mengxr It is the reason why you recommended me to change the default locale, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194222407 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[String] = new Param[String](this, "locale", +"Locale of the input for case insensitive matching. Ignored when caseSensitive is true.", + ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString))) + + /** @group setParam */ + @Since("2.4.0") + def setLocale(value: String): this.type = set(locale, value) --- End diff -- @mengxr How do you think? I think supporting two setters is reasonable, but it can introduce some asymmetrical signatures. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194296251 --- Diff: python/pyspark/ml/feature.py --- @@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive Sets params for this StopWordRemover. """ kwargs = self._input_kwargs +if locale is None: +sc = SparkContext._active_spark_context +kwargs['locale'] = sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale() --- End diff -- @viirya You mean... `locale=SparkContext._active_spark_context.(...)`over `locale=None` with ugly if statement, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21525: [SPARK-24513][ML] Attribute support in UnaryTrans...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/21525 [SPARK-24513][ML] Attribute support in UnaryTransformer ## What changes were proposed in this pull request? This PR adds Metadata support in `UnaryTransformer`, as a preliminary work of [SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998) and [SPARK-13964](https://issues.apache.org/jira/browse/SPARK-13964). ## How was this patch tested? unit test: `build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.ml.feature.* test` You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/SPARK-24513 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21525 commit 18b5b25b42e01680e873b915c9b77bfd724b3a45 Author: Lee Dongjin Date: 2018-06-11T08:49:06Z [SPARK-24513][ML] Attribute support in UnaryTransformer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/21501 Finally, I made it work by adding `StopWordsRemover.getDefaultLocale`! How about this approach? @mengxr @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194342986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` --- End diff -- ... OMG; please wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/21525 If needed, I can propose a draft version of [SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998) implemented on top of this work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194623958 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) --- End diff -- Thank you for the comment but... is that necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/21525 @jkbradley Excuse me. Could have a look when you are free? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/21501 @mengxr Thank you very much for your kind guidance. If you have some time, please have a look at my another PR, #21525. I found this issue while I was reading [an issue on HashingTF](https://issues.apache.org/jira/browse/SPARK-13998). @viirya Thank you again for your comments. Without you I could never complete this feature. @dongjoon-hyun ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17564: [MINOR][DOCS] Fix spacings in Structured Streaming Progr...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/17564 No problem! I will notify you after the review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17564: [MINOR][DOCS] Fix spacings in Structured Streaming Progr...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/17564 @srowen I just completed the review. I could not find any other typos from this document. :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/22282 [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming ## What changes were proposed in this pull request? This update adds support for Kafka Headers functionality in Structured Streaming. ## How was this patch tested? With following unit tests: - KafkaRelationSuite: "default starting and ending offsets with headers" (new) - KafkaSinkSuite: "batch - write to kafka" (updated) You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/SPARK-23539 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22282.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22282 commit ddd08612ef8bdb173f974059d2dc6311e1c7d9a3 Author: Lee Dongjin Date: 2018-08-26T13:19:52Z Remove unused methods from UnsafeArrayData commit 2af13899ab052cc7b52c25b57f154b78a2c45b2a Author: Lee Dongjin Date: 2018-08-26T13:25:25Z Implement UnsafeArrayData#fromBinaryArray commit a8e5c5c0f478a795af1236771236da2074093f3e Author: Lee Dongjin Date: 2018-08-27T12:22:18Z Implement UnsafeArrayData#fromStringArray commit 2ca181046cf1102aed14f4957e11e4dd901ba3c7 Author: Lee Dongjin Date: 2018-08-27T13:28:58Z Implement UnsafeMapData#of commit d0d746d99d0a19ecbb2dc098589adbfd1ef0b5ae Author: Lee Dongjin Date: 2018-08-29T13:25:57Z Allow empty UnsafeArrayData: does not throw IllegalArgumentException on empty or null array argument anymore. commit b459cf3f391d6e4ee9cb77a7b5ed510d027d9ddd Author: Lee Dongjin Date: 2018-08-28T07:50:59Z Fix invalid formatting: UnsafeArraySuite commit f077c5d75a83df3541a95a628726e1d74af8c153 Author: Lee Dongjin Date: 2018-08-28T07:51:30Z Implemenet kafka headers functionality commit 6b4d7754d01e4211d05c83746c848fdfd873f229 Author: Lee Dongjin Date: 2018-08-29T09:39:55Z Add KafkaTestUtils#{sendMessage, sendMessages(String, Array[(String, String, Array[(String, String)])], Option[Int])} commit c7fb9819989056da4910e2cfc81af332cd603d41 Author: Lee Dongjin Date: 2018-08-29T13:24:09Z Extend KafkaRelationSuite, KafkaSinkSuite to test headers functionality commit dd2d9390478e4c69b01a7c699e28bfe923ef0db1 Author: Lee Dongjin Date: 2018-08-30T10:50:38Z Minor refinements commit 229aac85442b03736fc850cae2c3b26becaedade Author: Lee Dongjin Date: 2018-08-30T12:21:30Z Specify #selectExpr on KafkaSourceSuiteBase's 'Kafka column types' test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/22282 As you can see, this PR consists of 3 parts: 1. Extend `UnsafeArrayData`, `UnsafeMapData` (commit 1~6) 2. Implement Kafka Headers functionality (commit 7, 10) 3. Update unit tests (commit 8, 9) I have the following questions: 1. Should I separate group 1 as a separated issue? 2. I found that KafkaSourceSuiteBase's 'Kafka column types' test is missing a select expression. The weird thing is that it works before the update but does not work after the update. (It is why the last commit was added - without specification, this test does not pass.) Is this intended one? Or, Do I misunderstanding something? Please have a look when you are free. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/22282#discussion_r214198620 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala --- @@ -131,9 +158,25 @@ private[kafka010] abstract class KafkaRowWriter( throw new IllegalStateException(s"${KafkaWriter.VALUE_ATTRIBUTE_NAME} " + s"attribute unsupported type ${t.catalogString}") } +val headersExpression = inputSchema + .find(_.name == KafkaWriter.HEADERS_ATTRIBUTE_NAME).getOrElse( + Literal(CatalystTypeConverters.convertToCatalyst(null), MapType(StringType, BinaryType)) +) +headersExpression.dataType match { + case MapType(StringType, BinaryType, true) => // good + case t => +throw new IllegalStateException(s"${KafkaWriter.HEADERS_ATTRIBUTE_NAME} " + --- End diff -- Just a typo. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/22282#discussion_r214345173 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala --- @@ -131,9 +158,25 @@ private[kafka010] abstract class KafkaRowWriter( throw new IllegalStateException(s"${KafkaWriter.VALUE_ATTRIBUTE_NAME} " + s"attribute unsupported type ${t.catalogString}") } +val headersExpression = inputSchema + .find(_.name == KafkaWriter.HEADERS_ATTRIBUTE_NAME).getOrElse( + Literal(CatalystTypeConverters.convertToCatalyst(null), MapType(StringType, BinaryType)) +) +headersExpression.dataType match { + case MapType(StringType, BinaryType, true) => // good + case t => +throw new IllegalStateException(s"${KafkaWriter.HEADERS_ATTRIBUTE_NAME} " + --- End diff -- Oh, I misunderstood; After reviewing the code, I found that `KafkaRowWriter#createProjection` throws `IllegalStateException` while `KafkaWriter#validateQuery` throwing `AnalysisException.` I think the reason should be attributed to the difference between two methods - while the former one detects the error from the state of `InternalRow,` the later one does by analyzing the expression's schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/22282#discussion_r214345393 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java --- @@ -473,17 +474,6 @@ public static UnsafeArrayData fromPrimitiveArray( return result; } - public static UnsafeArrayData forPrimitiveArray(int offset, int length, int elementSize) { -return fromPrimitiveArray(null, offset, length, elementSize); - } - - public static boolean shouldUseGenericArrayData(int elementSize, int length) { --- End diff -- Thank you for your kind guidance. I drop the commit removing some methods - it was totally wrong! :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/22282#discussion_r214856258 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala --- @@ -88,7 +92,30 @@ private[kafka010] abstract class KafkaRowWriter( throw new NullPointerException(s"null topic present in the data. Use the " + s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a default topic.") } -val record = new ProducerRecord[Array[Byte], Array[Byte]](topic.toString, key, value) +val record = if (projectedRow.isNullAt(3)) { + new ProducerRecord[Array[Byte], Array[Byte]]( +topic.toString, +null, +key, +value + ) +} else { + val headerMap = projectedRow.getMap(3) + val headers = (0 until headerMap.numElements()).toArray.map( +i => + new RecordHeader( --- End diff -- As of September 2018, `RecordHeader` is the only implementation provided by Kafka. As you can see [here](https://memorynotfound.com/spring-kafka-adding-custom-header-kafka-message-example/), this way is widely used - I think it is more natural for `RecordHeader` to be hidden by some builder classes but it's not. It seems a missing spot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22342: Feature/fix kafka sql trivials
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/22342 Feature/fix kafka sql trivials ## What changes were proposed in this pull request? Fix unused imports & outdated comments on `kafka-0-10-sql` module. (Found while I was working on [SPARK-23539](https://github.com/apache/spark/pull/22282)) ## How was this patch tested? Existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/fix-kafka-sql-trivials Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22342.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22342 commit 4a5a7e8c88a53f363b58dada46fe9df13a814c0f Author: Lee Dongjin Date: 2018-09-05T13:02:44Z Remove unused import from KafkaRelation commit 3b9ea74bc5facf7a17f698bbaac6f08243fc11bc Author: Lee Dongjin Date: 2018-09-05T13:05:15Z Remove unused import from KafkaOffsetRangeCalculator commit f116d9e7bb4de94510f67677f92b726f678d2bcc Author: Lee Dongjin Date: 2018-09-05T13:13:38Z Fix comments on KafkaStreamWriterFactory: it does not extend DataWriterFactory from commit e754887. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/22282 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/22282 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21525: [SPARK-24513][ML] Attribute support in UnaryTrans...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21525#discussion_r216333687 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Transformer.scala --- @@ -116,10 +116,17 @@ abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]] StructType(outputFields) } + /** + * Returns [[Metadata]] to be attached to the output column. + */ + protected def outputMetadata(outputSchema: StructType, dataset: Dataset[_]): Metadata = +Metadata.empty + override def transform(dataset: Dataset[_]): DataFrame = { -transformSchema(dataset.schema, logging = true) +val outputSchema = transformSchema(dataset.schema, logging = true) val transformUDF = udf(this.createTransformFunc, outputDataType) -dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol +val metadata = outputMetadata(outputSchema, dataset) --- End diff -- Sorry for the late reply. Here is the answer: **because the ultimate goal is [to make `HashingTF` to extend `UnaryTransformer`](https://issues.apache.org/jira/browse/SPARK-13998), not just attaching attribute**. Yes, you are right, `HashingTF` is an example of how metadata is created and attached to `outputSchema`. However, we need a method to wrap that metadata routine to replace `HashingTF extends Transformer with HasInputCol with HasOutputCol` into `HashingTF extends UnaryTransformer`. It's why. (Please refer Joseph K. Bradley's comment at [SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998)) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/22282 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...
Github user dongjinleekr commented on the issue: https://github.com/apache/spark/pull/22282 cc/ @zsxwing @tdas @dongjoon-hyun @srowen Rebased onto the latest master. Please have a look when you are free. Thanks in advance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org