[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...

2018-11-12 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/22282
  
cc/ @zsxwing @tdas @dongjoon-hyun @srowen Rebased onto the latest master. 
Please have a look when you are free. Thanks in advance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...

2018-09-10 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/22282
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21525: [SPARK-24513][ML] Attribute support in UnaryTrans...

2018-09-10 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21525#discussion_r216333687
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Transformer.scala ---
@@ -116,10 +116,17 @@ abstract class UnaryTransformer[IN, OUT, T <: 
UnaryTransformer[IN, OUT, T]]
 StructType(outputFields)
   }
 
+  /**
+   * Returns [[Metadata]] to be attached to the output column.
+   */
+  protected def outputMetadata(outputSchema: StructType, dataset: 
Dataset[_]): Metadata =
+Metadata.empty
+
   override def transform(dataset: Dataset[_]): DataFrame = {
-transformSchema(dataset.schema, logging = true)
+val outputSchema = transformSchema(dataset.schema, logging = true)
 val transformUDF = udf(this.createTransformFunc, outputDataType)
-dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+val metadata = outputMetadata(outputSchema, dataset)
--- End diff --

Sorry for the late reply. Here is the answer: **because the ultimate goal 
is [to make `HashingTF` to extend 
`UnaryTransformer`](https://issues.apache.org/jira/browse/SPARK-13998), not 
just attaching attribute**. Yes, you are right, `HashingTF` is an example of 
how metadata is created and attached to `outputSchema`. However, we need a 
method to wrap that metadata routine to replace `HashingTF extends Transformer 
with HasInputCol with HasOutputCol` into `HashingTF extends UnaryTransformer`. 
It's why. (Please refer Joseph K. Bradley's comment at 
[SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...

2018-09-06 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/22282
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...

2018-09-06 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/22282
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22342: Feature/fix kafka sql trivials

2018-09-05 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/22342

Feature/fix kafka sql trivials

## What changes were proposed in this pull request?

Fix unused imports & outdated comments on `kafka-0-10-sql` module. (Found 
while I was working on 
[SPARK-23539](https://github.com/apache/spark/pull/22282))

## How was this patch tested?

Existing unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark 
feature/fix-kafka-sql-trivials

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22342.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22342


commit 4a5a7e8c88a53f363b58dada46fe9df13a814c0f
Author: Lee Dongjin 
Date:   2018-09-05T13:02:44Z

Remove unused import from KafkaRelation

commit 3b9ea74bc5facf7a17f698bbaac6f08243fc11bc
Author: Lee Dongjin 
Date:   2018-09-05T13:05:15Z

Remove unused import from KafkaOffsetRangeCalculator

commit f116d9e7bb4de94510f67677f92b726f678d2bcc
Author: Lee Dongjin 
Date:   2018-09-05T13:13:38Z

Fix comments on KafkaStreamWriterFactory: it does not extend 
DataWriterFactory from commit e754887.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...

2018-09-04 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22282#discussion_r214856258
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala
 ---
@@ -88,7 +92,30 @@ private[kafka010] abstract class KafkaRowWriter(
   throw new NullPointerException(s"null topic present in the data. Use 
the " +
 s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a 
default topic.")
 }
-val record = new ProducerRecord[Array[Byte], 
Array[Byte]](topic.toString, key, value)
+val record = if (projectedRow.isNullAt(3)) {
+  new ProducerRecord[Array[Byte], Array[Byte]](
+topic.toString,
+null,
+key,
+value
+  )
+} else {
+  val headerMap = projectedRow.getMap(3)
+  val headers = (0 until headerMap.numElements()).toArray.map(
+i =>
+  new RecordHeader(
--- End diff --

As of September 2018, `RecordHeader` is the only implementation provided by 
Kafka. As you can see 
[here](https://memorynotfound.com/spring-kafka-adding-custom-header-kafka-message-example/),
 this way is widely used - I think it is more natural for `RecordHeader` to be 
hidden by some builder classes but it's not. It seems a missing spot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...

2018-08-31 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22282#discussion_r214345393
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java
 ---
@@ -473,17 +474,6 @@ public static UnsafeArrayData fromPrimitiveArray(
 return result;
   }
 
-  public static UnsafeArrayData forPrimitiveArray(int offset, int length, 
int elementSize) {
-return fromPrimitiveArray(null, offset, length, elementSize);
-  }
-
-  public static boolean shouldUseGenericArrayData(int elementSize, int 
length) {
--- End diff --

Thank you for your kind guidance. I drop the commit removing some methods - 
it was totally wrong! :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...

2018-08-31 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22282#discussion_r214345173
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala
 ---
@@ -131,9 +158,25 @@ private[kafka010] abstract class KafkaRowWriter(
 throw new 
IllegalStateException(s"${KafkaWriter.VALUE_ATTRIBUTE_NAME} " +
   s"attribute unsupported type ${t.catalogString}")
 }
+val headersExpression = inputSchema
+  .find(_.name == KafkaWriter.HEADERS_ATTRIBUTE_NAME).getOrElse(
+  Literal(CatalystTypeConverters.convertToCatalyst(null), 
MapType(StringType, BinaryType))
+)
+headersExpression.dataType match {
+  case MapType(StringType, BinaryType, true) => // good
+  case t =>
+throw new 
IllegalStateException(s"${KafkaWriter.HEADERS_ATTRIBUTE_NAME} " +
--- End diff --

Oh, I misunderstood; After reviewing the code, I found that 
`KafkaRowWriter#createProjection` throws `IllegalStateException` while 
`KafkaWriter#validateQuery` throwing `AnalysisException.` I think the reason 
should be attributed to the difference between two methods - while the former 
one detects the error from the state of `InternalRow,` the later one does by 
analyzing the expression's schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...

2018-08-30 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22282#discussion_r214198620
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala
 ---
@@ -131,9 +158,25 @@ private[kafka010] abstract class KafkaRowWriter(
 throw new 
IllegalStateException(s"${KafkaWriter.VALUE_ATTRIBUTE_NAME} " +
   s"attribute unsupported type ${t.catalogString}")
 }
+val headersExpression = inputSchema
+  .find(_.name == KafkaWriter.HEADERS_ATTRIBUTE_NAME).getOrElse(
+  Literal(CatalystTypeConverters.convertToCatalyst(null), 
MapType(StringType, BinaryType))
+)
+headersExpression.dataType match {
+  case MapType(StringType, BinaryType, true) => // good
+  case t =>
+throw new 
IllegalStateException(s"${KafkaWriter.HEADERS_ATTRIBUTE_NAME} " +
--- End diff --

Just a typo.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22282: [SPARK-23539][SS] Add support for Kafka headers in Struc...

2018-08-30 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/22282
  
As you can see, this PR consists of 3 parts:

1. Extend `UnsafeArrayData`, `UnsafeMapData` (commit 1~6)
2. Implement Kafka Headers functionality (commit 7, 10)
3. Update unit tests (commit 8, 9)

I have the following questions:

1. Should I separate group 1 as a separated issue?
2. I found that KafkaSourceSuiteBase's 'Kafka column types' test is missing 
a select expression. The weird thing is that it works before the update but 
does not work after the update. (It is why the last commit was added - without 
specification, this test does not pass.) Is this intended one? Or, Do I 
misunderstanding something?

Please have a look when you are free. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22282: [SPARK-23539][SS] Add support for Kafka headers i...

2018-08-30 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/22282

[SPARK-23539][SS] Add support for Kafka headers in Structured Streaming

## What changes were proposed in this pull request?

This update adds support for Kafka Headers functionality in Structured 
Streaming.

## How was this patch tested?

With following unit tests:

- KafkaRelationSuite: "default starting and ending offsets with headers" 
(new)
- KafkaSinkSuite: "batch - write to kafka" (updated)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/SPARK-23539

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22282.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22282


commit ddd08612ef8bdb173f974059d2dc6311e1c7d9a3
Author: Lee Dongjin 
Date:   2018-08-26T13:19:52Z

Remove unused methods from UnsafeArrayData

commit 2af13899ab052cc7b52c25b57f154b78a2c45b2a
Author: Lee Dongjin 
Date:   2018-08-26T13:25:25Z

Implement UnsafeArrayData#fromBinaryArray

commit a8e5c5c0f478a795af1236771236da2074093f3e
Author: Lee Dongjin 
Date:   2018-08-27T12:22:18Z

Implement UnsafeArrayData#fromStringArray

commit 2ca181046cf1102aed14f4957e11e4dd901ba3c7
Author: Lee Dongjin 
Date:   2018-08-27T13:28:58Z

Implement UnsafeMapData#of

commit d0d746d99d0a19ecbb2dc098589adbfd1ef0b5ae
Author: Lee Dongjin 
Date:   2018-08-29T13:25:57Z

Allow empty UnsafeArrayData: does not throw IllegalArgumentException on 
empty or null array argument anymore.

commit b459cf3f391d6e4ee9cb77a7b5ed510d027d9ddd
Author: Lee Dongjin 
Date:   2018-08-28T07:50:59Z

Fix invalid formatting: UnsafeArraySuite

commit f077c5d75a83df3541a95a628726e1d74af8c153
Author: Lee Dongjin 
Date:   2018-08-28T07:51:30Z

Implemenet kafka headers functionality

commit 6b4d7754d01e4211d05c83746c848fdfd873f229
Author: Lee Dongjin 
Date:   2018-08-29T09:39:55Z

Add KafkaTestUtils#{sendMessage, sendMessages(String, Array[(String, 
String, Array[(String, String)])], Option[Int])}

commit c7fb9819989056da4910e2cfc81af332cd603d41
Author: Lee Dongjin 
Date:   2018-08-29T13:24:09Z

Extend KafkaRelationSuite, KafkaSinkSuite to test headers functionality

commit dd2d9390478e4c69b01a7c699e28bfe923ef0db1
Author: Lee Dongjin 
Date:   2018-08-30T10:50:38Z

Minor refinements

commit 229aac85442b03736fc850cae2c3b26becaedade
Author: Lee Dongjin 
Date:   2018-08-30T12:21:30Z

Specify #selectExpr on KafkaSourceSuiteBase's 'Kafka column types' test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover

2018-06-12 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/21501
  
@mengxr Thank you very much for your kind guidance. If you have some time, 
please have a look at my another PR, #21525. I found this issue while I was 
reading [an issue on 
HashingTF](https://issues.apache.org/jira/browse/SPARK-13998).
@viirya Thank you again for your comments. Without you I could never 
complete this feature.
@dongjoon-hyun 😄


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer

2018-06-12 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/21525
  
@jkbradley Excuse me. Could have a look when you are free? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-12 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194623958
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
--- End diff --

Thank you for the comment but... is that necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21525: [SPARK-24513][ML] Attribute support in UnaryTransformer

2018-06-11 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/21525
  
If needed, I can propose a draft version of 
[SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998) implemented on 
top of this work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194342986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
--- End diff --

... OMG; please wait.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21501: [SPARK-15064][ML] Locale support in StopWordsRemover

2018-06-11 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/21501
  
Finally, I made it work by adding `StopWordsRemover.getDefaultLocale`! How 
about this approach? @mengxr @viirya


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21525: [SPARK-24513][ML] Attribute support in UnaryTrans...

2018-06-11 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/21525

[SPARK-24513][ML] Attribute support in UnaryTransformer

## What changes were proposed in this pull request?

This PR adds Metadata support in `UnaryTransformer`, as a preliminary work 
of [SPARK-13998](https://issues.apache.org/jira/browse/SPARK-13998) and 
[SPARK-13964](https://issues.apache.org/jira/browse/SPARK-13964).

## How was this patch tested?

unit test: `build/mvn -Dtest=none 
-DwildcardSuites=org.apache.spark.ml.feature.* test`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/SPARK-24513

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21525


commit 18b5b25b42e01680e873b915c9b77bfd724b3a45
Author: Lee Dongjin 
Date:   2018-06-11T08:49:06Z

[SPARK-24513][ML] Attribute support in UnaryTransformer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-10 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194296251
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, 
stopWords=None, caseSensitive
 Sets params for this StopWordRemover.
 """
 kwargs = self._input_kwargs
+if locale is None:
+sc = SparkContext._active_spark_context
+kwargs['locale'] = 
sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()
--- End diff --

@viirya You mean... `locale=SparkContext._active_spark_context.(...)`over 
`locale=None` with ugly if statement, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194222407
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[String] = new Param[String](this, "locale",
+"Locale of the input for case insensitive matching. Ignored when 
caseSensitive is true.",
+
ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString)))
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setLocale(value: String): this.type = set(locale, value)
--- End diff --

@mengxr How do you think? I think supporting two setters is reasonable, but 
it can introduce some asymmetrical signatures.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194222392
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
--- End diff --

As far as I am understanding, the Locale makes difference how the lowercase 
string would be, not related to the stopwords. @mengxr It is the reason why you 
recommended me to change the default locale, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/21501

[SPARK-15064][ML] Locale support in StopWordsRemover

## What changes were proposed in this pull request?

Add locale support for `StopWordsRemover`.

## How was this patch tested?

[Scala|Python] unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/SPARK-15064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21501


commit 8379d88360651e0571698412e4aecc305ae7fe60
Author: Lee Dongjin 
Date:   2018-03-21T15:03:17Z

Implement SPARK-15064

commit 682df4afe6859119d01de1fa953f95ebaacb4889
Author: Lee Dongjin 
Date:   2018-06-05T11:00:17Z

Add test cases for StopWordsRemoverSuite

commit 67ae6ebaebbb537f24db51be1f96759d5c6463ff
Author: Lee Dongjin 
Date:   2018-06-06T07:43:45Z

Implement pyspark parity




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17564: [MINOR][DOCS] Fix spacings in Structured Streaming Progr...

2017-04-11 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/17564
  
@srowen I just completed the review. I could not find any other typos from 
this document. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17564: [MINOR][DOCS] Fix spacings in Structured Streaming Progr...

2017-04-07 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/17564
  
No problem! I will notify you after the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17564: [MINOR][DOCS] Fix spacings in Structured Streamin...

2017-04-07 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/17564

[MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

1. Omitted space between the sentences: `... on static data.The Spark SQL 
engine will ...` -> `... on static data. The Spark SQL engine will ...`
2. Omitted colon in Output Model section.

## How was this patch tested?

None.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark 
feature/fix-programming-guide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17564.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17564


commit e10843d0c6e1a45ef507f8a4c7514c0694570abf
Author: Lee Dongjin <dong...@apache.org>
Date:   2017-04-07T08:17:25Z

[MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17303: [SPARK-19112][CORE] add codec for ZStandard

2017-03-15 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/17303

[SPARK-19112][CORE] add codec for ZStandard

## What changes were proposed in this pull request?

Hadoop[^1] & HBase[^2] started to support ZStandard Compression from their 
recent releases. This update enables saving a file in HDFS using ZStandard 
Codec, by implementing ZStandardCodec. It also requires adding a new 
configuration for default compression level, for example, 
'spark.io.compression.zstandard.level.'

[^1]: https://issues.apache.org/jira/browse/HADOOP-13578
[^2]: https://issues.apache.org/jira/browse/HBASE-16710

## How was this patch tested?

3 additional unit tests in `CompressionCodecSuite.scala`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/SPARK-19112

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17303


commit 1927b91d0d8621e9e2dc2a88a93e07780cfc66bf
Author: Lee Dongjin <dong...@apache.org>
Date:   2017-03-15T08:09:56Z

Implement ZStandardCompressionCodec




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17225: [CORE] Support ZStandard Compression

2017-03-15 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/17225
  
@HyukjinKwon Thanks for the information. It seems like both of Jira issue 
and my PR are messed up - it will re-create PR with the Jira issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression

2017-03-15 Thread dongjinleekr
Github user dongjinleekr closed the pull request at:

https://github.com/apache/spark/pull/17225


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression

2017-03-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/17225#discussion_r105342714
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -49,13 +50,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

OMG, it is a typo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression

2017-03-09 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/17225

[CORE] Support ZStandard Compression

## What changes were proposed in this pull request?

Hadoop will support ZStandard Compression from 2.9.0. This update enables 
saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec.

## How was this patch tested?

3 additional unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/z-compression

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17225.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17225


commit 01850446779b398bae281391de205cbda28d3efc
Author: Lee Dongjin <dong...@apache.org>
Date:   2017-03-09T15:43:56Z

Implement ZStandardCompressionCodec.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16830: [MINOR][CORE] Fix incorrect documentation of WritableCon...

2017-02-07 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/16830
  
Is this exactly corresponds to your intention?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16830: [MINOR][CORE] Fix incorrect documentation of WritableCon...

2017-02-07 Thread dongjinleekr
Github user dongjinleekr commented on the issue:

https://github.com/apache/spark/pull/16830
  
Updated with the problem description. Please have a look when you are free. 
I will review the other documentation descriptions ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16830: [MINOR][CORE] Fix incorrect documentation of Writ...

2017-02-07 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/16830

[MINOR][CORE] Fix incorrect documentation of WritableConverter

## What changes were proposed in this pull request?

`WritableConverter` and `WritableFactory` work in opposite directions. But 
both of them are documented with same description:

> A class encapsulating how to convert some type T to Writable. It stores 
both the Writable class corresponding to T (e.g. IntWritable for Int) and a 
function for doing the conversion.

This error is a result of commit d37978d, which resolves 
[SPARK-4795](https://issues.apache.org/jira/browse/SPARK-4795) by redesigning 
the "primitive type => Writable" implicit APIs.

This PR fix the documentation of `WritableConverter`.

## How was this patch tested?

Existing tests (no functional change anyway)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark 
feature/fix-writableconverter-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16830.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16830


commit 6c04422d7ec343ba6766980e5fa9985e27659de8
Author: Lee Dongjin <dong...@apache.org>
Date:   2017-02-07T11:23:02Z

Fix documentation of WritableConverter




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org