[GitHub] spark pull request #18032: [SPARK-20806][DEPLOY] Launcher: redundant check f...

2017-05-19 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/18032

[SPARK-20806][DEPLOY] Launcher: redundant check for Spark lib dir

## What changes were proposed in this pull request?

Remove redundant check for libdir in CommandBuilderUtils

## How was this patch tested?

Existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-20806

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18032






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18030: [SPARK-20798] GenerateUnsafeProjection should che...

2017-05-19 Thread ala
Github user ala commented on a diff in the pull request:

https://github.com/apache/spark/pull/18030#discussion_r117433160
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -50,10 +50,15 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
   fieldTypes: Seq[DataType],
   bufferHolder: String): String = {
 val fieldEvals = fieldTypes.zipWithIndex.map { case (dt, i) =>
-  val fieldName = ctx.freshName("fieldName")
-  val code = s"final ${ctx.javaType(dt)} $fieldName = 
${ctx.getValue(input, dt, i.toString)};"
-  val isNull = s"$input.isNullAt($i)"
-  ExprCode(code, isNull, fieldName)
+  val javaType = ctx.javaType(dt)
+  val isNullVar = ctx.freshName("isNull")
+  val valueVar = ctx.freshName("value")
+  val defaultValue = ctx.defaultValue(dt)
+  val readValue = ctx.getValue(input, dt, i.toString)
+  val code = s"""
+   boolean $isNullVar = $input.isNullAt($i);
+   $javaType $valueVar = $isNullVar ? $defaultValue : 
$readValue;"""
--- End diff --

Fixed. Now it looks like the rest of the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18033: Add compression/decompression of column data to C...

2017-05-19 Thread kiszk
GitHub user kiszk opened a pull request:

https://github.com/apache/spark/pull/18033

Add compression/decompression of column data to ColumnVector

## What changes were proposed in this pull request?

This PR adds compression/decompression of column data to `ColumnVector`. 
While current `CachedBatch` can compress column data by using of multiple 
compression schemes, `ColumnVector` cannot compress column data. The 
compression is mandatory for table cache.

At first, this PR enables `RunLengthEncoding` for 
boolean/byte/short/int/long and `BooleanBitSet` for boolean. Another JIRA will 
support comrpession schemes.

At high level view, when `ColumnVector.compress()` is called, compression 
is performed from an array for primitive data type to byte array in 
`ColumnVector`. When `ColumnVector.decompress()` is called, decompression is 
performed from the byte array to the array for primitive data type to byte 
array in `ColumnVector`. For these compression/decompression, `ArrayBuffer` is 
used for accessing data.


This PR added and changed the following APIs:

`ArrayBuffer`
* This new class is similar to `java.io.ByteBuffer`. `ArrayBuffer` class 
can wrap an array for any primitive data type such as `Array[Int]` or 
`Array[Long]`. This class manages current position to be accessed.

`ColumnType.get(buffer: ArrayBuffer): jvmType, ColumnType.put(buffer: 
ArrayBuffer)`
* These APIs gets a primitive value from the current position or puts a 
primitive value into the current position at the given `ArrayBuffer`. 

`Encoder.gatherCompressibilityStats(in: ArrayBuffer)`
* This API calculates uncompressed and compressed size by using a given 
compression method.

`Encoder.compress(from: ArrayBuffer, to: ArrayBuffer): Unit`
* This API compresses data in `from` and stores compressed data to `to`. 
`to` has to have an byte array with enough size for compressed data.

`Decoder.decompress(values: ArrayBuffer): Unit`
* This API decompresses data in `Decoder` by providing its constructor and 
stores uncompressed data to `values`. `to` has to have an byte array with 
enough size for uncompressed data.

## How was this patch tested?

Added new test suites

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kiszk/spark SPARK-20807

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18033.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18033


commit 6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a
Author: Kazuaki Ishizaki 
Date:   2017-05-19T09:33:38Z

initial commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17936
  
How much difference this performs, compared with caching the two RDDs 
before doing cartesian?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17880: [SPARK-20620][TEST]Improve some unit tests for NullExpre...

2017-05-19 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/17880
  
I have modify `Scala style`.
Test is not started, could you help trigger it,thanks  @HyukjinKwon 
@gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18033
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

2017-05-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18034#discussion_r117443669
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -468,7 +469,16 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
   val topics = Range(0, k).map { topicInd =>
 Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
topicInd)
   }
-  
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
+
+  val bufferSize = Utils.byteStringAsBytes(
+spark.conf.get("spark.kryoserializer.buffer.max", "64m"))
+  // We calculate the approximate size of the model
+  // We only calculate the array size, considering an
+  // average string size of 15 bytes, the formula is:
+  // (floatSize * vectorSize + 15) * numWords
+  val approxSize = (4L * k + 15) * topicsMatrix.numRows
+  val nPartitions = ((approxSize / bufferSize) + 1).toInt
+  
spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))
--- End diff --

The problem is that this writes multiple files. I don't think you can do 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18034
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17455: [Spark-20044][Web UI] Support Spark UI behind front-end ...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17455
  
**[Test build #3730 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3730/testReport)**
 for PR 17455 at commit 
[`af314fd`](https://github.com/apache/spark/commit/af314fd58f3f79a7dab670368dd7cb8417187868).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18035: [MINOR][SPARKR][ML] Fix coefficients issue and code clea...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18035
  
**[Test build #77094 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77094/testReport)**
 for PR 18035 at commit 
[`1ed3ba0`](https://github.com/apache/spark/commit/1ed3ba0c031b6d45fdaa025f3831020417ce164d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17936
  
I agreed with @srowen. This adds quite complexity. If there is no much 
difference comparing with caching RDDs before doing cartesian (or other ways), 
it may not worth such complexity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117443085
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
--- End diff --

Suppose each map task produces a 90MB bucket and many small buckets (skew 
data), then avgSize is very small, and threshold would be 100MB. If the number 
of map tasks is large, OOM can still happen, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17936
  
`Broadcast` should first fetch the all block to driver, and cached in the 
local, then the executor fetch it from the driver. I think it's really time 
consuming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17868: [SPARK-20607][CORE]Add new unit tests to ShuffleSuite

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17868
  
**[Test build #3734 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3734/testReport)**
 for PR 17868 at commit 
[`b6eff6e`](https://github.com/apache/spark/commit/b6eff6e70ed98bdc396490382c2da0c0d96acbc2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17940: [SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17940
  
**[Test build #3733 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3733/testReport)**
 for PR 17940 at commit 
[`b40a706`](https://github.com/apache/spark/commit/b40a706e4fa5733d8205c31efd48ff67b9c75575).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17869: [SPARK-20609][CORE]Run the SortShuffleSuite unit tests h...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17869
  
**[Test build #3736 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3736/testReport)**
 for PR 17869 at commit 
[`e901625`](https://github.com/apache/spark/commit/e9016258978908dcc317700330ec61eae6d5439d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18013: [SPARK-20781] the location of Dockerfile in docker.prope...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18013
  
**[Test build #3735 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3735/testReport)**
 for PR 18013 at commit 
[`a0830c1`](https://github.com/apache/spark/commit/a0830c16d7bbb6f55e0f21643513720248ffe71b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17936
  
@viirya , this is slightly different from caching RDD. It is more like 
broadcasting, the final state is that each executor will hold the whole data of 
RDD2, the difference is that this is executor-executor sync, not 
driver-executor sync.

I also have the similar concern. The performance can be varied by 
workloads, we'd better have some different workloads to see general 
improvements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17936
  
Sorry for the mistake, this test result should be the cached situation:
| --| -- | -- |
| 15.877s | 2827.373s | 178x |
| 16.781s | 2809.502s | 167x |
| 16.320s | 2845.699s | 174x |
| 19.437s | 2860.387s | 147x |
| 16.793s | 2931.667s | 174x|

Test case:
```
object TestNetflixlib {
  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Test Netflix mlib")
val sc = new SparkContext(conf)

val data = sc.textFile("hdfs://10.1.2.173:9000/nf_training_set.txt")

val ratings = data.map(_.split("::") match {
  case Array(user, item, rate) => Rating(user.toInt, item.toInt, 
rate.toDouble)
})

val rank = 0
val numIterations = 10
val train_start = System.nanoTime()
val model = ALS.train(ratings, rank, numIterations, 0.01)
val user = model.userFeatures
val item = model.productFeatures
val start = System.nanoTime()
val rate = user.cartesian(item)
println(rate.count())
val time = (System.nanoTime() - start) / 1e9
println(time)
  }
}
```

The RDDs (user and item) should be cached.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17698: [SPARK-20403][SQL]Modify the instructions of some functi...

2017-05-19 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/17698
  
@gatorsmile   I have added test cases to the file `cast.sql` , thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17936
  
@jerryshao As you mentioned broadcasting, another question might be, can we 
just use broadcasting to achieve similar performance without such changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17992: [SPARK-20759] SCALA_VERSION in _config.yml should be con...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17992
  
**[Test build #3732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3732/testReport)**
 for PR 17992 at commit 
[`7f0c75a`](https://github.com/apache/spark/commit/7f0c75a6d39632a65f112f06832cc7ad430d9b10).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18035: [MINOR][SPARKR][ML] Fix coefficients issue and co...

2017-05-19 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/18035

[MINOR][SPARKR][ML] Fix coefficients issue and code cleanup for SparkR 
linear SVM.

## What changes were proposed in this pull request?
Fix coefficients issue and code cleanup for SparkR linear SVM.

## How was this patch tested?
Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark svm-r

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18035.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18035


commit 1ed3ba0c031b6d45fdaa025f3831020417ce164d
Author: Yanbo Liang 
Date:   2017-05-19T11:33:27Z

Code reorg and cleanup for SparkR linear SVM.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18014: [SPARK-20783][SQL] Enhance ColumnVector to keep UnsafeAr...

2017-05-19 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18014
  
@cloud-fan What would you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17936
  
OK, I'll add it. From the test data, performance is still very obvious. 
Mainly from the network and disk overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18030: [SPARK-20798] GenerateUnsafeProjection should check if a...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18030
  
**[Test build #77090 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77090/testReport)**
 for PR 18030 at commit 
[`eea789d`](https://github.com/apache/spark/commit/eea789d74d819bc712165659b9c278dd6197ba43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117440029
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
--- End diff --

Just for curiosity: is there any reason we compute threshold in this way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117440204
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -121,48 +126,69 @@ private[spark] class CompressedMapStatus(
 }
 
 /**
- * A [[MapStatus]] implementation that only stores the average size of 
non-empty blocks,
- * plus a bitmap for tracking which blocks are empty.
+ * A [[MapStatus]] implementation that stores the accurate size of huge 
blocks, which are larger
+ * than both spark.shuffle.accurateBlockThreshold and
+ * spark.shuffle.accurateBlockThresholdByTimesAverage * averageSize. It 
stores the
+ * average size of other non-empty blocks, plus a bitmap for tracking 
which blocks are empty.
  *
  * @param loc location where the task is being executed
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizes sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+@transient private var hugeBlockSizes: Map[Int, Byte])
--- End diff --

Can the size of this map become large?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18032
  
**[Test build #77089 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77089/testReport)**
 for PR 18032 at commit 
[`df74d04`](https://github.com/apache/spark/commit/df74d048e8dc87e118000e164af713e86c053d56).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17936: [SPARK-20638][Core]Optimize the CartesianRDD to r...

2017-05-19 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17936#discussion_r117429928
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala ---
@@ -71,9 +72,92 @@ class CartesianRDD[T: ClassTag, U: ClassTag](
   }
 
   override def compute(split: Partition, context: TaskContext): 
Iterator[(T, U)] = {
+val blockManager = SparkEnv.get.blockManager
 val currSplit = split.asInstanceOf[CartesianPartition]
-for (x <- rdd1.iterator(currSplit.s1, context);
- y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
+val blockId2 = RDDBlockId(rdd2.id, currSplit.s2.index)
+var cachedInLocal = false
+var holdReadLock = false
+
+// Try to get data from the local, otherwise it will be cached to the 
local.
+def getOrElseCache(
--- End diff --

Btw, can we move those functions out of `compute`? Too many nested 
functions here and making `compute` too big.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17936: [SPARK-20638][Core]Optimize the CartesianRDD to r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17936#discussion_r117432268
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala ---
@@ -71,9 +72,92 @@ class CartesianRDD[T: ClassTag, U: ClassTag](
   }
 
   override def compute(split: Partition, context: TaskContext): 
Iterator[(T, U)] = {
+val blockManager = SparkEnv.get.blockManager
 val currSplit = split.asInstanceOf[CartesianPartition]
-for (x <- rdd1.iterator(currSplit.s1, context);
- y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
+val blockId2 = RDDBlockId(rdd2.id, currSplit.s2.index)
+var cachedInLocal = false
+var holdReadLock = false
+
+// Try to get data from the local, otherwise it will be cached to the 
local.
+def getOrElseCache(
--- End diff --

Ok, I will change it too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18004: [SPARK-18838][CORE] Introduce blocking strategy for Live...

2017-05-19 Thread bOOm-X
Github user bOOm-X commented on the issue:

https://github.com/apache/spark/pull/18004
  
@markhamstra, @vanzin: Can I have a review please ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18016
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18016
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77087/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18030: [SPARK-20798] GenerateUnsafeProjection should che...

2017-05-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18030


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17936: [SPARK-20638][Core]Optimize the CartesianRDD to r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17936#discussion_r117427240
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala ---
@@ -71,9 +72,92 @@ class CartesianRDD[T: ClassTag, U: ClassTag](
   }
 
   override def compute(split: Partition, context: TaskContext): 
Iterator[(T, U)] = {
+val blockManager = SparkEnv.get.blockManager
 val currSplit = split.asInstanceOf[CartesianPartition]
-for (x <- rdd1.iterator(currSplit.s1, context);
- y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
+val blockId2 = RDDBlockId(rdd2.id, currSplit.s2.index)
+var cachedInLocal = false
+var holdReadLock = false
+
+// Try to get data from the local, otherwise it will be cached to the 
local.
+def getOrElseCache(
+rdd: RDD[U],
+partition: Partition,
+context: TaskContext,
+level: StorageLevel): Iterator[U] = {
+  getLocalValues() match {
+case Some(result) =>
+  return result
+case None => if (holdReadLock) {
+  throw new SparkException(s"get() failed for block $blockId2 even 
though we held a lock")
+}
+  }
+
+  val iterator = rdd.iterator(partition, context)
+  if (rdd.getStorageLevel != StorageLevel.NONE || 
rdd.isCheckpointedAndMaterialized) {
+// If the block is cached in local, wo shouldn't cache it again.
--- End diff --

Ok, I'll change it, thanks very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17992: [SPARK-20759] SCALA_VERSION in _config.yml should be con...

2017-05-19 Thread liu-zhaokun
Github user liu-zhaokun commented on the issue:

https://github.com/apache/spark/pull/17992
  
@srowen 
Hi,do you know why this PR can't pass the test? I don't think it's my 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17880: [SPARK-20620][TEST]Improve some unit tests for NullExpre...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17880
  
**[Test build #3731 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3731/testReport)**
 for PR 17880 at commit 
[`3110f0f`](https://github.com/apache/spark/commit/3110f0f0c1a09b28a5706674ae65fd47ce48b163).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

2017-05-19 Thread d0evi1
GitHub user d0evi1 opened a pull request:

https://github.com/apache/spark/pull/18034

[SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

## What changes were proposed in this pull request?

LocalLDAModel's model save function has a bug: 

please see: https://issues.apache.org/jira/browse/SPARK-20797

add some code like word2vec's save method (repartition), to avoid this bug.

## How was this patch tested?

it's hard to test. need large data to train with online LDA method.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/d0evi1/spark mllib-mod-pr

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18034.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18034






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18033
  
**[Test build #77091 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77091/testReport)**
 for PR 18033 at commit 
[`6d5497e`](https://github.com/apache/spark/commit/6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17770
  
**[Test build #77088 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77088/testReport)**
 for PR 17770 at commit 
[`6a7204c`](https://github.com/apache/spark/commit/6a7204c0fc00dbe2e43d6d65e722b3b13c3b35d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17770
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77088/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18033: [SPARK-20807][SQL] Add compression/decompression of colu...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18033
  
**[Test build #77092 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77092/testReport)**
 for PR 18033 at commit 
[`193a71b`](https://github.com/apache/spark/commit/193a71bb30cd38c5ca3d3c234bf2f1e2b8210f11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17770
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18011: [SPARK-19089][SQL] Add support for nested sequences

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18011
  
**[Test build #77093 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77093/testReport)**
 for PR 18011 at commit 
[`39e037c`](https://github.com/apache/spark/commit/39e037c22df05dd101b58ebbd33183e6c3a3ef9f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17923
  
**[Test build #3737 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3737/testReport)**
 for PR 17923 at commit 
[`d80f140`](https://github.com/apache/spark/commit/d80f14013d9a55fc1050220eb59a166a9eb9743d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-19 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117461089
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
--- End diff --

Yes, the case you mentioned above is a really good one. But setting 
`spark.shuffle.accurateBlockThreshold`  means we can accept sacrificing 
accuracy of blocks smaller than `spark.shuffle.accurateBlockThreshold`. If we 
want it to be more accurate, set it larger(in this case we can set it 50M). 
Thus size of the big bucket will be accurate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17936
  
Yeah, I think I can do the performance comparison.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18033
  
**[Test build #77091 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77091/testReport)**
 for PR 18033 at commit 
[`6d5497e`](https://github.com/apache/spark/commit/6d5497ef38b3efff6ac1b1b48fe9e873f5c9394a).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ArrayBuffer(array: Array[_]) `
  * `class ColumnVectorCompressionBuilder[T <: AtomicType](dataType: T) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18033: Add compression/decompression of column data to ColumnVe...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18033
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77091/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17936
  
@jerryshao Yeah, the reason I mentioned caching is to know how much 
re-computing RDD costs in the performance. It seems to me that if re-computing 
is much more costing than transferring the data, only caching can be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18030: [SPARK-20798] GenerateUnsafeProjection should check if a...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18030
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18030: [SPARK-20798] GenerateUnsafeProjection should check if a...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18030
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77090/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18032
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18030: [SPARK-20798] GenerateUnsafeProjection should check if a...

2017-05-19 Thread hvanhovell
Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/18030
  
LGTM - merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17880: [SPARK-20620][TEST]Improve some unit tests for NullExpre...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17880
  
**[Test build #3731 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3731/testReport)**
 for PR 17880 at commit 
[`3110f0f`](https://github.com/apache/spark/commit/3110f0f0c1a09b28a5706674ae65fd47ce48b163).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18031: [SPARK-20801] Record accurate size of blocks in M...

2017-05-19 Thread jinxing64
Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18031#discussion_r117460170
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +219,27 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val threshold1 = Option(SparkEnv.get)
+  .map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD))
+  .getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD.defaultValue.get)
+val threshold2 = avgSize * Option(SparkEnv.get)
+  
.map(_.conf.get(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE))
+  
.getOrElse(config.SHUFFLE_ACCURATE_BLOCK_THRESHOLD_BY_TIMES_AVERAGE.defaultValue.get)
+val threshold = math.max(threshold1, threshold2)
--- End diff --

@wzhfy 
Thanks for taking time review this :) 
This pr is based on the discussion in #16989 . The idea is to avoid 
underestimating big blocks in `HighlyCompressedStatus` and control the size of 
`HighlyCompressedStatus` at the same time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/17936
  
I see. I think at least we should make this cache mechanism controllable by 
flag. I'm guessing in some HPC clusters or single node cluster this problem is 
not so severe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread ConeyLiu
Github user ConeyLiu commented on the issue:

https://github.com/apache/spark/pull/17936
  
I did not directly test this situation. But I have test the this pr 
compared with latest `ALS`(after merge #17742 ). In `ALS`, the both RDDs are 
cached, and also grouped the iterator(iterator.grouped). You can see the test 
result above, and the directly test I will give next week due to due to 
maintenance of server.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18016
  
**[Test build #77087 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77087/testReport)**
 for PR 18016 at commit 
[`1698e80`](https://github.com/apache/spark/commit/1698e805fdded2cc6760bc399a8f8e1724facfb4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17936: [SPARK-20638][Core]Optimize the CartesianRDD to reduce r...

2017-05-19 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17936
  
Seems it should be still better than original cartesian, since it saves 
re-computing RDD, re-transferring data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18030: [SPARK-20798] GenerateUnsafeProjection should check if a...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18030
  
**[Test build #77090 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77090/testReport)**
 for PR 18030 at commit 
[`eea789d`](https://github.com/apache/spark/commit/eea789d74d819bc712165659b9c278dd6197ba43).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18032
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77089/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18032: [SPARK-20806][DEPLOY] Launcher: redundant check for Spar...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18032
  
**[Test build #77089 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77089/testReport)**
 for PR 18032 at commit 
[`df74d04`](https://github.com/apache/spark/commit/df74d048e8dc87e118000e164af713e86c053d56).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77108/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17996: [SPARK-20506][DOCS] 2.2 migration guide

2017-05-19 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17996
  
right, I reviewed them 
- 
[this](https://github.com/apache/spark/pull/17996/files#diff-a9770b923a4959616bc2126d4afd61eaR35)
 in ML could also affect R
- [this](https://github.com/apache/spark/blame/master/docs/sparkr.md#L657) 
in R might be good to include in ML guide too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18038: [MINOR][SPARKRSQL]Remove unnecessary comment in S...

2017-05-19 Thread lys0716
GitHub user lys0716 opened a pull request:

https://github.com/apache/spark/pull/18038

[MINOR][SPARKRSQL]Remove unnecessary comment in SqlBase.g4

## What changes were proposed in this pull request?

The issue(https://github.com/antlr/antlr4/issues/781) in the comment is 
fixed in its duplicate ticket(https://github.com/antlr/antlr4/issues/781).

## How was this patch tested?

Existing tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lys0716/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18038.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18038


commit 31e000ff87b299d9adfd4b8ee010817c57734bca
Author: lys0716 
Date:   2017-05-20T04:42:05Z

Remove unnecessary comment in SqlBase.g4

The issue(//  https://github.com/antlr/antlr4/issues/781) in the comment is 
fixed in its duplicate ticket(https://github.com/antlr/antlr4/issues/781).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16697: [SPARK-19358][CORE] LiveListenerBus shall log the event ...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16697
  
**[Test build #77111 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77111/testReport)**
 for PR 16697 at commit 
[`554cd39`](https://github.com/apache/spark/commit/554cd391b3ddb5fb3f7c52950610e832ad40047b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16697: [SPARK-19358][CORE] LiveListenerBus shall log the event ...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16697
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16697: [SPARK-19358][CORE] LiveListenerBus shall log the event ...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16697
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77111/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17723: [SPARK-20434][YARN][CORE] Move kerberos delegatio...

2017-05-19 Thread mgummelt
Github user mgummelt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17723#discussion_r117595679
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HadoopAccessManager.scala 
---
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.security
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+/**
+ * Methods in [[HadoopAccessManager]] return scheduler-specific 
information related to how Hadoop
+ * delegation tokens should be fetched.
+ */
+private[spark] trait HadoopAccessManager {
+
+  /** The user allowed to renew delegation tokens */
+  def getTokenRenewer: String
+
+  /** The renewal interval, or [[None]] if the token shouldn't be renewed 
*/
+  def getTokenRenewalInterval: Option[Long]
--- End diff --

That's a good point.  They don't.  I have it factored out like this because 
the code which computes it depends on `spark.yarn.principal`, which of course 
should be specific to YARN, but once Mesos keytab/principal support gets 
merged, that variable will likely be deprecated and replaced with 
`spark.kerberos.principal` or something.

I'll go ahead and remove this function from this trait and put it back in 
`HadoopFSCredentialProvider`.  It will use `spark.yarn.principal` until we 
rename it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17966: [SPARK-20727] Skip tests that use Hadoop utils on CRAN W...

2017-05-19 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/17966
  
Sorry I've been out traveling -- I'll try to update this by tonight


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17967#discussion_r117602233
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -38,29 +38,35 @@ import org.apache.spark.sql.types._
 private[feature] trait RFormulaBase extends HasFeaturesCol with 
HasLabelCol {
 
   /**
-   * Param for how to order labels of string column. The first label after 
ordering is assigned
-   * an index of 0.
-   * Options are:
-   *   - 'frequencyDesc': descending order by label frequency (most 
frequent label assigned 0)
-   *   - 'frequencyAsc': ascending order by label frequency (least 
frequent label assigned 0)
-   *   - 'alphabetDesc': descending alphabetical order
-   *   - 'alphabetAsc': ascending alphabetical order
-   * Default is 'frequencyDesc'.
-   * When the ordering is set to 'alphabetDesc', `RFormula` drops the same 
category as R
-   * when encoding strings.
+   * Param for how to order categories of a FEATURE string column used by 
`StringIndexer`.
+   * The last category after ordering is dropped when encoding strings.
+   * The options are explained using an example string: 'b', 'a', 'b', 
'a', 'c', 'b'
+   * |
+   * | Option | Category mapped to 0 by StringIndexer |  Category dropped 
by RFormula
--- End diff --

BTW, it looks hidden in the API documentation. My suggestion is prose with 
simple `-` or resembling any other formats in this package if there are similar 
instances.

Probably, it would be fine if the current format as is looks okay to you 
@felixcheung. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18039: [SPARK-20751][SQL] Add cot test in MathExpression...

2017-05-19 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/18039

[SPARK-20751][SQL] Add cot test in MathExpressionsSuite

## What changes were proposed in this pull request?

Add cot test in MathExpressionsSuite as 
https://github.com/apache/spark/pull/17999#issuecomment-302832794.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-20751-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18039.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18039


commit 57719f1177884cc136a279bc3e4a6ad3c0da174d
Author: Yuming Wang 
Date:   2017-05-20T04:42:55Z

Add cot test in MathExpressionsSuite.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18039: [SPARK-20751][SQL] Add cot test in MathExpressionsSuite

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18039
  
**[Test build #77113 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77113/testReport)**
 for PR 18039 at commit 
[`57719f1`](https://github.com/apache/spark/commit/57719f1177884cc136a279bc3e4a6ad3c0da174d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17981: [SPARK-15767][ML][SparkR] Decision Tree wrapper in Spark...

2017-05-19 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17981
  
any more comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-19 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/12646
  
Jenkins is about to shut down, we can retest this later


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17978: [SPARK-20736][Python] PySpark StringIndexer supports Str...

2017-05-19 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17978
  
I'd hold this for another 3-4 days just in case..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-19 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/12646
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16697: [SPARK-19358][CORE] LiveListenerBus shall log the event ...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16697
  
**[Test build #77111 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77111/testReport)**
 for PR 16697 at commit 
[`554cd39`](https://github.com/apache/spark/commit/554cd391b3ddb5fb3f7c52950610e832ad40047b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17967
  
**[Test build #77110 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77110/testReport)**
 for PR 17967 at commit 
[`5f31d31`](https://github.com/apache/spark/commit/5f31d311c0c39da1968686dd4147376b3888cee3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12646
  
**[Test build #77112 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77112/testReport)**
 for PR 12646 at commit 
[`11d5c10`](https://github.com/apache/spark/commit/11d5c1034b11f7b0bf1eeb1c5600e6c1b7739ad2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17966: [SPARK-20727] Skip tests that use Hadoop utils on CRAN W...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17966
  
**[Test build #77114 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77114/testReport)**
 for PR 17966 at commit 
[`701`](https://github.com/apache/spark/commit/701daf238993547b9ec77465f39427c4e1ed).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17966: [SPARK-20727] Skip tests that use Hadoop utils on CRAN W...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17966
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17966: [SPARK-20727] Skip tests that use Hadoop utils on CRAN W...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17966
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77114/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18023
  
**[Test build #77108 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77108/testReport)**
 for PR 18023 at commit 
[`612bedf`](https://github.com/apache/spark/commit/612bedf9bb1181687fa536d2e927923901c19582).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17967: [SPARK-14659][ML] RFormula consistent with R when...

2017-05-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17967#discussion_r117602143
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
---
@@ -38,29 +38,35 @@ import org.apache.spark.sql.types._
 private[feature] trait RFormulaBase extends HasFeaturesCol with 
HasLabelCol {
 
   /**
-   * Param for how to order labels of string column. The first label after 
ordering is assigned
-   * an index of 0.
-   * Options are:
-   *   - 'frequencyDesc': descending order by label frequency (most 
frequent label assigned 0)
-   *   - 'frequencyAsc': ascending order by label frequency (least 
frequent label assigned 0)
-   *   - 'alphabetDesc': descending alphabetical order
-   *   - 'alphabetAsc': ascending alphabetical order
-   * Default is 'frequencyDesc'.
-   * When the ordering is set to 'alphabetDesc', `RFormula` drops the same 
category as R
-   * when encoding strings.
+   * Param for how to order categories of a FEATURE string column used by 
`StringIndexer`.
+   * The last category after ordering is dropped when encoding strings.
+   * The options are explained using an example string: 'b', 'a', 'b', 
'a', 'c', 'b'
+   * |
+   * | Option | Category mapped to 0 by StringIndexer |  Category dropped 
by RFormula
--- End diff --

Hm, up to my knowledge, it looks not. I guess @actuaryzhang meant to just 
write these out as they are? Let me double check by myself 

Scaladoc

https://cloud.githubusercontent.com/assets/6477701/26273032/0d97fd84-3d62-11e7-8f18-1c89f539b1ae.png;>


Javadoc

https://cloud.githubusercontent.com/assets/6477701/26273031/0bac57cc-3d62-11e7-8875-6f897b093633.png;>




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.

2017-05-19 Thread d0evi1
Github user d0evi1 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18034#discussion_r117602669
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -468,7 +469,16 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
   val topics = Range(0, k).map { topicInd =>
 Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
topicInd)
   }
-  
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
+
+  val bufferSize = Utils.byteStringAsBytes(
+spark.conf.get("spark.kryoserializer.buffer.max", "64m"))
+  // We calculate the approximate size of the model
+  // We only calculate the array size, considering an
+  // average string size of 15 bytes, the formula is:
+  // (floatSize * vectorSize + 15) * numWords
+  val approxSize = (4L * k + 15) * topicsMatrix.numRows
+  val nPartitions = ((approxSize / bufferSize) + 1).toInt
+  
spark.createDataFrame(topics).repartition(nPartitions).write.parquet(Loader.dataPath(path))
--- End diff --

why not? i think it does works. the multiple parquet files will may be in 
random order, but it will save topic indices. when u call load process, parquet 
will restore dataframe, u can check the LocalLDAModel's load method, it will 
scan all dataframe's row with the topic indices to rebuild the (topic x vocab) 
breeze matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

2017-05-19 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/16648#discussion_r117602817
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -145,11 +145,85 @@ class CodegenContext {
*
* They will be kept as member variables in generated classes like 
`SpecificProjection`.
*/
-  val mutableStates: mutable.ArrayBuffer[(String, String, String)] =
-mutable.ArrayBuffer.empty[(String, String, String)]
+  val mutableState: mutable.ListBuffer[(String, String, String)] =
+mutable.ListBuffer.empty[(String, String, String)]
 
-  def addMutableState(javaType: String, variableName: String, initCode: 
String): Unit = {
-mutableStates += ((javaType, variableName, initCode))
+  // An array keyed by the tuple of mutable states' types and 
initialization codes, holds the
+  // current max index of the array
+  var mutableStateArrayIdx: mutable.Map[(String, String), Int] =
+mutable.Map.empty[(String, String), Int]
+
+  // An array keyed by the tuple of mutable states' types and 
initialization codes, holds the name
+  // of the mutableStateArray into which state of the given key will be 
compacted
+  var mutableStateArrayNames: mutable.Map[(String, String), String] =
+mutable.Map.empty[(String, String), String]
+
+  // An array keyed by the tuple of mutable states' types and 
initialization codes, holds the code
+  // that will initialize the mutableStateArray when initialized in loops
+  var mutableStateArrayInitCodes: mutable.Map[(String, String), String] =
+mutable.Map.empty[(String, String), String]
+
+  /**
+   * Adds an instance of globally-accessible mutable state. Mutable state 
may either be inlined
+   * as a private member variable to the class, or it may be compacted 
into arrays of the same
+   * type and initialization in order to avoid Constant Pool limit errors 
for both state declaration
+   * and initialization.
+   *
+   * We compact state into arrays when we can anticipate variables of the 
same type and initCode
+   * may appear numerous times. Variable names with integer suffixes (as 
given by the `freshName`
+   * function), that are either simply assigned (null or no 
initialization) or are primitive are
+   * good candidates for array compaction, as these variables types are 
likely to appear numerous
+   * times, and can be easily initialized in loops.
+   *
+   * @param javaType the javaType
+   * @param variableName the variable name
+   * @param initCode the initialization code for the variable
+   * @return the name of the mutable state variable, which is either the 
original name if the
+   * variable is inlined to the class, or an array access if the 
variable is to be stored
+   * in an array of variables of the same type and initialization.
+   */
+  def addMutableState(
--- End diff --

Thank you for updating generated code. It makes clear.

When I see this method, `addMutableState` returns an array element if a 
variable is primitive or simple object even if the number of mutable states are 
small. Does to always use array element lead to performance overhead compared 
to using instance variables?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/17967
  
@yanboliang Thanks for the review and suggestion. Makes lots of sense. I 
made a new commit to address these. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117600840
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -510,6 +510,69 @@ public UTF8String trim() {
 }
   }
 
+  /**
+   * Removes the given trim string from both ends of a string
+   * @param trimString the trim character string
+   */
+  public UTF8String trim(UTF8String trimString) {
+// This method searches for each character in the source string, 
removes the character if it is found
+// in the trim string, stops at the first not found. It starts from 
left end, then right end.
+// It returns a new string in which both ends trim characters have 
been removed.
+int s = 0; // the searching byte position of the input string
+int i = 0; // the first beginning byte position of a non-matching 
character
+int e = 0; // the last byte position
+int numChars = 0; // number of characters from the input string
+int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
+int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
+int searchCharBytes;
+
+while (s < this.numBytes) {
+  UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
+  searchCharBytes = searchChar.numBytes;
+  // try to find the matching for the searchChar in the trimString set
+  if (trimString.find(searchChar, 0) >= 0) {
+i += searchCharBytes;
+  } else {
+// no matching, exit the search
+break;
+  }
+  s += searchCharBytes;
+}
+
+if (i >= this.numBytes) {
+  // empty string
+  return UTF8String.EMPTY_UTF8;
+} else {
+  //build the position and length array
--- End diff --

nit: add a space after `//`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117601121
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -510,6 +510,69 @@ public UTF8String trim() {
 }
   }
 
+  /**
+   * Removes the given trim string from both ends of a string
+   * @param trimString the trim character string
+   */
+  public UTF8String trim(UTF8String trimString) {
+// This method searches for each character in the source string, 
removes the character if it is found
+// in the trim string, stops at the first not found. It starts from 
left end, then right end.
+// It returns a new string in which both ends trim characters have 
been removed.
+int s = 0; // the searching byte position of the input string
+int i = 0; // the first beginning byte position of a non-matching 
character
+int e = 0; // the last byte position
+int numChars = 0; // number of characters from the input string
+int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
+int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
+int searchCharBytes;
+
+while (s < this.numBytes) {
+  UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
+  searchCharBytes = searchChar.numBytes;
+  // try to find the matching for the searchChar in the trimString set
+  if (trimString.find(searchChar, 0) >= 0) {
+i += searchCharBytes;
+  } else {
+// no matching, exit the search
+break;
+  }
+  s += searchCharBytes;
+}
+
+if (i >= this.numBytes) {
+  // empty string
+  return UTF8String.EMPTY_UTF8;
+} else {
+  //build the position and length array
+  s = 0;
+  while (s < numBytes) {
+stringCharPos[numChars] = s;
+stringCharLen[numChars]= numBytesForFirstByte(getByte(s));
+s += stringCharLen[numChars];
--- End diff --

Since numChars != numBytes, can we use ArrayList instead of Array? Although 
numBytes is larger, I still think it's strange.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117601355
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2015,4 +2015,121 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   checkAnswer(table.filter($"p" === "p1\" and q=\"q1").select($"a"), 
Row(4))
 }
   }
+
+  test("TRIM function-BOTH") {
+withTable("trimBoth", "trimStrut") {
+  sql("create table trimBoth (c1 string, c2 char(1), c3 string, c4 
string, " +
+"c5 string, c6 string)")
+  // scalastyle:off
+  sql("insert into trimBoth select 'cc', 'c', ' cccbacc', 
'cccbacc数', '数据砖头', '数'")
+  // scalastyle:on
+  sql("create table trimStrut (c1 struct, c2 
string)")
+  sql("insert into trimStrut values ((100, 'abc'), 'ABC')")
+
+  intercept[AnalysisException] {
+sql("SELECT TRIM('c', C1, 'd') from trimBoth")
+  }
+  intercept[AnalysisException] {
+   sql("SELECT TRIM(C2, C1) from trimBoth").collect
+  }
+  intercept[AnalysisException] {
+   sql("SELECT TRIM(BOTH C2 FROM C1) from trimBoth").collect
+  }
+  intercept[AnalysisException] {
+sql("select trim(c1,c2) from trimStrut")
+  }
+  intercept[AnalysisException] {
+sql("select trim(c2,c1) from trimStrut")
+  }
+
+checkAnswer (sql("SELECT TRIM(BOTH 'c' FROM C1) from trimBoth"), 
Row(""))
--- End diff --

remove space after `checkAnswer`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17967
  
**[Test build #77110 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77110/testReport)**
 for PR 17967 at commit 
[`5f31d31`](https://github.com/apache/spark/commit/5f31d311c0c39da1968686dd4147376b3888cee3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117601249
  
--- Diff: 
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java 
---
@@ -730,4 +726,62 @@ public void testToLong() throws IOException {
   assertFalse(negativeInput, 
UTF8String.fromString(negativeInput).toLong(wrapper));
 }
   }
+  @Test
+  public void trim() {
+assertEquals(fromString("hello"), fromString("  hello 
").trim(fromString(" ")));
+assertEquals(fromString("o"), fromString("  hello ").trim(fromString(" 
hle")));
+assertEquals(fromString("h e"), fromString("ooh e 
ooo").trim(fromString("o ")));
+assertEquals(fromString(""), 
fromString("ooo...").trim(fromString("o.")));
+assertEquals(fromString("b"), 
fromString("%^b[]@").trim(fromString("][@^%")));
+
--- End diff --

unnecessary blank lines


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117601293
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/StringExpressionsSuite.scala
 ---
@@ -375,24 +374,61 @@ class StringExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper {
 
   test("TRIM/LTRIM/RTRIM") {
--- End diff --

Since the test case becomes large after this pr, I suggest split it into 
three.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117600921
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -510,6 +510,69 @@ public UTF8String trim() {
 }
   }
 
+  /**
+   * Removes the given trim string from both ends of a string
+   * @param trimString the trim character string
+   */
+  public UTF8String trim(UTF8String trimString) {
+// This method searches for each character in the source string, 
removes the character if it is found
+// in the trim string, stops at the first not found. It starts from 
left end, then right end.
+// It returns a new string in which both ends trim characters have 
been removed.
+int s = 0; // the searching byte position of the input string
+int i = 0; // the first beginning byte position of a non-matching 
character
+int e = 0; // the last byte position
+int numChars = 0; // number of characters from the input string
+int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
+int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
+int searchCharBytes;
+
+while (s < this.numBytes) {
+  UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
+  searchCharBytes = searchChar.numBytes;
+  // try to find the matching for the searchChar in the trimString set
+  if (trimString.find(searchChar, 0) >= 0) {
+i += searchCharBytes;
+  } else {
+// no matching, exit the search
+break;
+  }
+  s += searchCharBytes;
+}
+
+if (i >= this.numBytes) {
+  // empty string
+  return UTF8String.EMPTY_UTF8;
+} else {
+  //build the position and length array
+  s = 0;
+  while (s < numBytes) {
+stringCharPos[numChars] = s;
+stringCharLen[numChars]= numBytesForFirstByte(getByte(s));
+s += stringCharLen[numChars];
+numChars ++;
+  }
+
+  e = this.numBytes - 1;
+  while (numChars > 0) {
+UTF8String searchChar =
+  copyUTF8String(stringCharPos[numChars-1], 
stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1);
--- End diff --

nit: `numChars - 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12646: [SPARK-14878][SQL] Trim characters string functio...

2017-05-19 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/12646#discussion_r117600707
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -510,6 +510,69 @@ public UTF8String trim() {
 }
   }
 
+  /**
+   * Removes the given trim string from both ends of a string
+   * @param trimString the trim character string
+   */
+  public UTF8String trim(UTF8String trimString) {
+// This method searches for each character in the source string, 
removes the character if it is found
+// in the trim string, stops at the first not found. It starts from 
left end, then right end.
+// It returns a new string in which both ends trim characters have 
been removed.
+int s = 0; // the searching byte position of the input string
+int i = 0; // the first beginning byte position of a non-matching 
character
+int e = 0; // the last byte position
+int numChars = 0; // number of characters from the input string
+int[] stringCharLen = new int[numBytes]; // array of character length 
for the input string
+int[] stringCharPos = new int[numBytes]; // array of the first byte 
position for each character in the input string
+int searchCharBytes;
+
+while (s < this.numBytes) {
+  UTF8String searchChar = copyUTF8String(s, s + 
numBytesForFirstByte(this.getByte(s)) - 1);
+  searchCharBytes = searchChar.numBytes;
--- End diff --

move `searchCharBytes` declaration here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

2017-05-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17967
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >