[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13538: [MINOR] fix typo in documents

2016-06-06 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13538

[MINOR] fix typo in documents

## What changes were proposed in this pull request?

I use spell check tools checks typo in spark documents and fix them.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_doc_typo

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13538.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13538


commit 8af1bf8f0d1a59aea35633780ca439f6c459bb78
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-06T14:23:10Z

fix typo in documents




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13578: [SPARK-15837][ML][PySpark]Word2vec python add max...

2016-06-09 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13578

[SPARK-15837][ML][PySpark]Word2vec python add maxsentence parameter

## What changes were proposed in this pull request?

Word2vec python add maxsentence parameter.

## How was this patch tested?

Existing test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
word2vec_python_add_maxsentence

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13578.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13578


commit 57384a0a9ffbfe9befe44cb7a9ae226eff603c94
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-08T21:28:41Z

word2vec_python_add_maxsentence_param




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13578: [SPARK-15837][ML][PySpark]Word2vec python add maxsentenc...

2016-06-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13578
  
@srowen Hi srowen, I have another similar PR #13558  which past test on my 
machine, but the official test fail. It seems to be the test server's problem, 
can you help to check it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...

2016-06-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13544
  
@rxin 
a small problem:
in `HiveContext` there is a method `refreshTable` for refreshing metadata 
of Hive table.
now using new SparkSession API with hive support, the method is removed,
so the new SparkSession API don't need user to refreshing metadata of Hive 
table ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13558: [SPARK-15820][pyspark][SQL] update python sql int...

2016-06-08 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13558

[SPARK-15820][pyspark][SQL] update python sql interface refreshTable

## What changes were proposed in this pull request?

Add Catalog.refreshTable API into python interface for Spark-SQL.

## How was this patch tested?

Existing test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
update_python_sql_interface_refreshTable

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13558.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13558


commit eedb961ebb141f2bafd1c9798bb09d5de939ea5c
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-07T20:04:52Z

update python sql interface refreshTable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13538: [MINOR] fix typo in documents

2016-06-07 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13538#discussion_r66059465
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -2037,7 +2037,7 @@ and configuring them to receive different partitions 
of the data stream from the
 For example, a single Kafka input DStream receiving two topics of data can 
be split into two
 Kafka input streams, each receiving only one topic. This would run two 
receivers,
 allowing data to be received in parallel, thus increasing overall 
throughput. These multiple
-DStreams can be unioned together to create a single DStream. Then the 
transformations that were
+DStreams can be united together to create a single DStream. Then the 
transformations that were
--- End diff --

@srowen recover 'unioned', thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13538: [MINOR] fix typo in documents

2016-06-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13538
  
@srowen Yes, I check each md files, and I think it is done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13381: [SPARK-15608][ml][doc] add_isotonic_regression_doc

2016-06-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13381
  
@yanboliang Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13525: [MINOR]fix typo a -> an

2016-06-06 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13525

[MINOR]fix typo a -> an

## What changes were proposed in this pull request?

a->an
similar to #13515 
Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' a [aeiou]' 
{} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_typo_a2an

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13525


commit 59918455fee4a7b88c5c27b60c29a508da3c4500
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-06T04:00:09Z

fix typo a -> an




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13544: [SPARK-15805][SQL][Documents] update sql programm...

2016-06-07 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13544

[SPARK-15805][SQL][Documents] update sql programming guide

## What changes were proposed in this pull request?

Update the whole sql programming guide doc file , including:
update doc with `SparkSession` instead of `SQLContext`
update doc with `SparkSession.builder.enableHiveSupport` instead of 
`HiveContext`
update doc with `dataFrame.write.saveAsTable` instead of 
`dataFrame.saveAsTable`
update doc with `sparkSession.catalog.cacheTable/uncacheTable` instead of 
`SQLContext.cacheTable/uncacheTable`
and so on...

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark update_sql_prog_doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13544.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13544


commit 770abc52e2f3d34e8638b2126587daa4af3490c2
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-07T04:03:15Z

update sql programming guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...

2016-06-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13544
  
@rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-12 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13544: [SPARK-15805][SQL][Documents] update sql programming gui...

2016-06-10 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13544
  
@liancheng OK, no problem !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

2016-06-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66699400
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1607,13 +1600,13 @@ a regular multi-line JSON file will most often fail.
 
 {% highlight r %}
 # sc is an existing SparkContext.
-sqlContext <- sparkRSQL.init(sc)
+spark <- sparkRSQL.init(sc)
--- End diff --

Currently, `sparkRSQL.init` call 
`org.apache.spark.sql.api.r.SQLUtils.createSQLContext` which return 
`SQLContext` object not `SparkSession` object. So here it seems to update the R 
api ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...

2016-06-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r66700697
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.regression.IsotonicRegression
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+/**
+ * An example demonstrating Isotonic Regression.
+ * Run with
+ * {{{
+ * bin/run-example ml.IsotonicRegressionExample
+ * }}}
+ */
+object IsotonicRegressionExample {
+
+  def main(args: Array[String]): Unit = {
+
+// Creates a SparkSession.
+val spark = SparkSession
+  .builder
+  .appName(s"${this.getClass.getSimpleName}")
+  .getOrCreate()
+
+// $example on$
+// Loads data.
+val dataset = spark.read.format("libsvm")
+  .load("data/mllib/sample_isotonic_regression_libsvm_data.txt")
+
+// Trains an isotonic regression model.
+val ir = new IsotonicRegression()
+val model = ir.fit(dataset)
+
+println(s"Boundaries in increasing order: ${model.boundaries}")
+println(s"Predictions associated with the boundaries: 
${model.predictions}")
+
+// Makes predictions.
+model.transform(dataset).show
--- End diff --

@jkbradley Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

2016-06-10 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13592#discussion_r66700281
  
--- Diff: docs/sql-programming-guide.md ---
@@ -517,24 +517,26 @@ types such as Sequences or Arrays. This RDD can be 
implicitly converted to a Dat
 registered as a table. Tables can be used in subsequent SQL statements.
 
 {% highlight scala %}
-// sc is an existing SparkContext.
-val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+val spark: SparkSession // An existing SparkSession
 // this is used to implicitly convert an RDD to a DataFrame.
-import sqlContext.implicits._
+import spark.implicits._
 
 // Define the schema using a case class.
 // Note: Case classes in Scala 2.10 can support only up to 22 fields. To 
work around this limit,
 // you can use custom classes that implement the Product interface.
 case class Person(name: String, age: Int)
 
-// Create an RDD of Person objects and register it as a table.
-val people = 
sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
=> Person(p(0), p(1).trim.toInt)).toDF()
+// Create an RDD of Person objects and register it as a temporary view.
+val people = sc
+  .textFile("examples/src/main/resources/people.txt")
+  .map(_.split(","))
+  .map(p => Person(p(0), p(1).trim.toInt))
+  .toDF()
 people.createOrReplaceTempView("people")
--- End diff --

Here it seems better to update the input data file as json format, and then 
can use `SparkSession.read.json('path/to/data.json')` so we don't need to use 
SparkContext, and 
can directly get a `DataFrame`, it can simplify the example code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...

2016-06-11 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r66715565
  
--- Diff: docs/ml-classification-regression.md ---
@@ -685,6 +685,76 @@ The implementation matches the result from R's 
survival function
 
 
 
+## Isotonic regression
+[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
+belongs to the family of regression algorithms. Formally isotonic 
regression is a problem where
+given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` 
representing observed responses
+and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
+finding a function that minimises
+
+`\begin{equation}
+  f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
+\end{equation}`
+
+with respect to complete order subject to
+`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
+The resulting function is called isotonic regression and it is unique.
+It can be viewed as least squares problem under order restriction.
+Essentially isotonic regression is a
+[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
+best fitting the original data points.
+
+In `spark.ml`, we implement a
--- End diff --

@jkbradley Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13381: [SPARK-15608][ml][examples][doc] add examples and...

2016-06-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r66961848
  
--- Diff: examples/src/main/python/mllib/isotonic_regression_example.py ---
@@ -23,18 +23,22 @@
 from pyspark import SparkContext
 # $example on$
 import math
-from pyspark.mllib.regression import IsotonicRegression, 
IsotonicRegressionModel
+from pyspark.mllib.regression import LabeledPoint, IsotonicRegression, 
IsotonicRegressionModel
 # $example off$
 
 if __name__ == "__main__":
 
 sc = SparkContext(appName="PythonIsotonicRegressionExample")
 
 # $example on$
-data = sc.textFile("data/mllib/sample_isotonic_regression_data.txt")
+# Load and parse the data
+def parsePoint(line):
+values = [float(x) for x in line.replace(',', ' ').replace(':', ' 
').split(' ')]
+return (values[0], values[2], 1.0)
+data = 
sc.textFile("data/mllib/sample_isotonic_regression_libsvm_data.txt")
--- End diff --

@yanboliang Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
@andrewor14 It looks strange, I test on my own machine and it is all OK. If 
it is the test server's problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13525: [MINOR]fix typo a -> an

2016-06-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13525
  
@srowen 
ok, so  the comand don't work correctly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13525: [MINOR]fix typo a -> an

2016-06-06 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13525


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression...

2016-05-28 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13381

[SPARK-15608][ml][doc] add_isotonic_regression_doc

## What changes were proposed in this pull request?

add ml doc for ml isotonic regression
add scala example for ml isotonic regression

## How was this patch tested?

N/A




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark add_isotonic_regression_doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13381.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13381


commit 1c974e31dba06491495f432b1d24e16e67baa13c
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-28T14:32:24Z

add_isotonic_regression_doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression...

2016-05-29 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13381#issuecomment-222353670
  
@holdenk Java & python example added. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15533][SQL]Deprecate Dataset.explode

2016-05-25 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13313

[SPARK-15533][SQL]Deprecate Dataset.explode

## What changes were proposed in this pull request?

Deprecate Dataset.explode


## How was this patch tested?

Existing test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix-SPARK-15533

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13313


commit aaac61a015133ac09a92a8edaeea423254d94d8b
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-25T23:15:40Z

Deprecate Dataset.explode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15533][SQL]Deprecate Dataset.explode

2016-05-25 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13313


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...

2016-05-27 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13275#issuecomment-222069163
  
@jkbradley 
--modules='pyspark-ml' will run a bunch of test in pyspark sub directory 
parallel, not single python file.
and my purpose is add a way to debug single python file code. Because the 
python test code should launch with bin/pyspark so it is not convenient to 
directly debug in IDE.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13441: [SPARK-15702][Documentation]Update document progr...

2016-06-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13441

[SPARK-15702][Documentation]Update document programming-guide accumulator 
section

## What changes were proposed in this pull request?

Update document programming-guide accumulator section (scala language)
java and python version, because the API haven't done, so I do not modify 
them.


## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
update_doc_accumulatorV2_clean

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13441.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13441


commit 598b28dbe1e3ad3f0afd885e72c4a368c3cd9a09
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-06-01T04:01:08Z

Update document programming-guide accumulator section




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #12987: [spark-15212][SQL]CSV file reader when read file ...

2016-06-01 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/12987


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15670][Java API][Spark Core]label_accumulator_dep...

2016-05-31 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13412

[SPARK-15670][Java API][Spark 
Core]label_accumulator_deprecate_in_java_spark_context

## What changes were proposed in this pull request?

Add deprecate annotation for acumulator V1 interface in JavaSparkContext 
class

## How was this patch tested?

N/A



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
label_accumulator_deprecate_in_java_spark_context

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13412.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13412


commit 8761c52a8ba39de7ef8b0f99976b103ec7f1cd2e
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-31T03:01:34Z

label_accumulator_deprecate_in_java_spark_context




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...

2016-06-19 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/13558
  
@andrewor14 The PR is OK now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13544: [SPARK-15805][SQL][Documents] update sql programm...

2016-06-27 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13544


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTab...

2016-06-17 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13558#discussion_r67496513
  
--- Diff: python/pyspark/sql/catalog.py ---
@@ -232,6 +232,11 @@ def clearCache(self):
 """Removes all cached tables from the in-memory cache."""
 self._jcatalog.clearCache()
 
+@since(2.0)
+def refreshTable(self, tableName):
+"""Invalidate and refresh all the cached the metadata of the given 
table."""
--- End diff --

@MLnick Done. I also correct the comment with extra "the" in the scala 
code. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-19 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13172#issuecomment-220260517
  
@srowen 
Modified as you expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15461][tests][pyspark]modify python tes...

2016-05-20 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13240

[SPARK-15461][tests][pyspark]modify python test script using version 
default 2.7

## What changes were proposed in this pull request?

update the default python version used in pytion/run_tests.py to python 2.7.

## How was this patch tested?

Existing test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
modify_python_test_use_version_to_27

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13240


commit 04bd30c55134400f01dad845431374a73ffc0406
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-21T05:09:22Z

modify python test script using version default 2.7




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15464][ML][MLlib][SQL][Tests] Replace S...

2016-05-22 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13242#discussion_r64147805
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -933,21 +933,20 @@ def getKeepLastCheckpoint(self):
 if __name__ == "__main__":
 import doctest
 import pyspark.ml.clustering
-from pyspark.context import SparkContext
-from pyspark.sql import SQLContext
+from pyspark.sql import SparkSession
 globs = pyspark.ml.clustering.__dict__.copy()
 # The small batch size here ensures that we see multiple batches,
 # even in these small test examples:
-sc = SparkContext("local[2]", "ml.clustering tests")
-sqlContext = SQLContext(sc)
+spark = SparkSession.builder.master("local[2]").appName("ml.clustering 
tests").getOrCreate()
--- End diff --

@techaddict Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15461][tests][pyspark]modify python tes...

2016-05-22 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

2016-05-22 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13007


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...

2016-05-24 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13275#issuecomment-221203315
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...

2016-05-24 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13275

[SPARK-15499][PySpark][Tests] Add python testsuite with remote debug and 
single test parameter to help developer debug code easier

## What changes were proposed in this pull request?

To python/run-tests.py script, I add the following parameters:
--single-test=SINGLE_TEST
specify a python module to run single python test.
--debug-server=DEBUG_SERVER
debug server host, only used in single test.
--debug-port=DEBUG_PORT
debug server port, only used in single test.
Now, for example, I want to debug only pyspark.tests, I can use the 
following command:
(first startup debug server in your python IDE such as pycharm or pydev, 
your IED on machine which has host MY_DEV_MACHINE and using debug port 5678, 
and make sure the machine running test can connect to MY_DEV_MACHINE)
python/run-tests --python-executables=python2.7 --single-test=pyspark.tests 
--debug-server=MY_DEV_MACHINE --debug-port=5678
the parameter --single-test specify the single python testsuite you want to 
test, currently you can use the following value:
`
pyspark.tests
pyspark.ml.tests
pyspark.mllib.tests
pyspark.sql.tests
pyspark.streaming.tests
`

## How was this patch tested?

Existing test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark add_python_remote_debug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13275.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13275


commit 872ebdedc80a76b9bf064bda52af6074d16e7efc
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-24T05:32:35Z

add python test with single test and remote debug param

commit e694d10f27d2da9607c6f52bb7b2a2e2fd50e0fd
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-24T05:41:53Z

remove useless code which used for debug by myself




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15499][PySpark][Tests] Add python tests...

2016-05-24 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13275#issuecomment-221210606
  
@davies  How do you think about it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15464][ML][MLlib][SQL][Tests] Replace S...

2016-05-21 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13242

[SPARK-15464][ML][MLlib][SQL][Tests] Replace SQLContext and SparkContext 
with SparkSession using builder pattern in python test code

## What changes were proposed in this pull request?

Replace SQLContext and SparkContext with SparkSession using builder pattern 
in python test code.

## How was this patch tested?

Existing test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
python_doctest_update_sparksession

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13242.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13242


commit e6ff76a5309b4b5d0b23e4eccba413cddab5daa0
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-21T16:09:18Z

replace SQLContext and SparkContext with SparkSession using builder pattern 
in python test code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15446][build][sql] modify catalyst usin...

2016-05-20 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/13224


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15446][build][sql] modify catalyst usin...

2016-05-20 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13224

[SPARK-15446][build][sql] modify catalyst using longValueExact not 
supporting java 7

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

change into BigInteger.longValue supporting java 7

## How was this patch tested?



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
modify_compile_err_catalyst_longValueExact_error

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13224


commit 991e34932854194598d5aa05402d53440b750a1d
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-20T15:30:23Z

modify catalyst using longValueExact not supporting java 7




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15350][mllib]add unit test function for...

2016-05-16 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13136

[SPARK-15350][mllib]add unit test function for LogisticRegressionWithLBFGS 
in JavaLogisticRegressionSuite

## What changes were proposed in this pull request?

add unit test function for LogisticRegressionWithLBFGS in 
JavaLogisticRegressionSuite,
and update the test function name in the class to distinct LR with SGD and 
LBFGS.

## How was this patch tested?

The testsuite I modified.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark add_testsuite_LR_LBFGS

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13136


commit a8593ab94861a2332380a23dd9c849a1b95296fd
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-16T15:15:53Z

add unit test function for LogisticRegressionWithLBFGS in 
JavaLogisticRegressionSuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/12978#issuecomment-220021066
  
In order to check this potential problem more carefully, We can add the 
following test code like this:

`
echo "$newpid" > "$pid"
  **for i in {1..30};do
echo "check daemon stage $i" `ps -p "$newpid" -o comm=`
sleep 0.1
  done**
  sleep 2
  # Check if the process has died; in that case we'll tail the log so the 
user can see
  if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
echo "failed to launch $command:"
tail -2 "$log" | sed 's/^/  /'
echo "full log in $log"
  fi
`

then run the start daemon script it will print such as:

sbin/start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to 
/diskext/mySpark/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-n131.out
check daemon stage 1 bash
check daemon stage 2 bash
check daemon stage 3 bash
check daemon stage 4 bash
check daemon stage 5 java
check daemon stage 6 java
check daemon stage 7 java
check daemon stage 8 java


The running result show that the bash status daemon will exist for sometime.
Especially when the machine just startup and OS cache is clean,  the time 
may exceed 2s.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/12978#issuecomment-220016724
  
@srowen According to your suggestion, I add a loop to check whether it pass 
STAGE-1 and launch java daemon. And I recover the check statement `! $(ps -p 
"$newpid" -o comm=) =~ "java"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12978#discussion_r63700865
  
--- Diff: sbin/spark-daemon.sh ---
@@ -162,6 +162,15 @@ run_command() {
   esac
 
   echo "$newpid" > "$pid"
+
+  for i in {1..10}
+  do
+if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
+   break
+fi
+sleep 0.5
--- End diff --

Oh..the sleep 2 statement can't move.

the loop may take at most 5 seconds to check whether the daemon turn into 
java daemon. When it become java daemon, it jump out the loop.

**then must sleep 2 seconds,**
then check whether the java daemon die. Because java daemon may exist for 
some time and then throw an Exception and die. So here need to wait for some 
time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15322][mllib][core][sql]update deprecat...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13112#issuecomment-219980697
  
@srowen updated. Seems no problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/12978#discussion_r63708450
  
--- Diff: sbin/spark-daemon.sh ---
@@ -162,6 +162,15 @@ run_command() {
   esac
 
   echo "$newpid" > "$pid"
+
+  for i in {1..10}
+  do
+if [[ $(ps -p "$newpid" -o comm=) =~ "java" ]]; then
+   break
+fi
+sleep 0.5
--- End diff --

new clean PR is here:
https://github.com/apache/spark/pull/13172
thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]fix bug SPARK-15203

2016-05-18 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13172

[SPARK-15203][Deploy]fix bug SPARK-15203

## What changes were proposed in this pull request?

fix bug SPARK-15203

## How was this patch tested?

existing test.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix-spark-15203

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13172.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13172


commit 55d911f29137441ddaa1e9022bf85edac86f72af
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-18T13:52:14Z

fix bug SPARK-15203




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-18 Thread WeichenXu123
Github user WeichenXu123 closed the pull request at:

https://github.com/apache/spark/pull/12978


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15322][mllib]update deprecate accumulat...

2016-05-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13112

[SPARK-15322][mllib]update deprecate accumulator usage into accumulatorV2 
in mllib

## What changes were proposed in this pull request?

MLlib code has two position use sc.accumulator method and it is deprecate, 
update it.
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala line 282
mllib/src/main/scala/org/apache/spark/ml/util/stopwatches.scala line 106

## How was this patch tested?

rerun build and test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark update_accuV2_in_mllib

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13112


commit 2761dff513eb2da87464735722807e3ea0ea7676
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-14T06:10:34Z

update deprecate accumulator usage into accumulatorV2 in mllib




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15322][mllib]update deprecate accumulat...

2016-05-15 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13112#issuecomment-219291622
  
@srowen I use Intellj-IDEA to search usage of deprecate 
SparkContext.accumulator in the whole spark project, and   update the 
code.(except those test code for accumulator method itself)
@HyukjinKwon I update  import org.apache.spark.{SparkContext} ==> import 
org.apache.spark.SparkContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-10 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/12978#issuecomment-218089784
  
@srowen At my virtual machine, After OS started and start spark daemon, the 
stage 1 describe above will took a long time,often exceeding 2s, I think the 
java daemon start need to load a huge number of jars so it will took several 
time when started at the first time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

2016-05-10 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13007#issuecomment-218094762
  
@HyukjinKwon 
En..current cvs load code use Hadoop `LineRecordReader`, so not allow a row 
split into mulit-lines, so I think the code should disable csv multi-line 
format, or should replace `LineRecordReader` with csv multiline supported 
reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: fix CSV file data-line with newline at first l...

2016-05-09 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/13007

fix CSV file data-line with newline at first line load error

## What changes were proposed in this pull request?

fix CSV file data-line with newline at first line load error.


## How was this patch tested?

test as here describe;
https://issues.apache.org/jira/browse/SPARK-15226



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_firstline_csv

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13007.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13007


commit 049743be3672e5b04e5b02597add0ff49dd80de5
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-09T15:25:54Z

fix CSV file data-line with newline at first line load error




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

2016-05-09 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/13007#issuecomment-218037385
  
@HyukjinKwon I run existing test against this patch and all pass. If need I 
will add a new test in CSVSuit.
And I think the only reason cause the bug is reading first line handling in 
the code. Because a row with multi-line in csv format is legal and the current 
csv datasource code using LIB `com.univocity.parsers.csv` also support it. Only 
when a data row with multi-line appear at file beginning will cause bug and I 
think this patch can fix it easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-09 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/12978#issuecomment-218040874
  
@srowen 
because spark-daemon.sh using exec command to start the java daemon process.
when run script, the spark-daemon.sh process will exists for a little 
time(stage1) and then changed to a java process(stage2), then if java process 
throw exception, it will die.
if stage1 spending more then 2s, then the judgement if [[ ! $(ps -p 
"$newpid" -o comm=) =~ "java" ]]; is fail, it will regard the process as failed 
but in fact it has not come into stag2. In my virtual machine, this thing offen 
happen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-10 Thread WeichenXu123
Github user WeichenXu123 commented on the pull request:

https://github.com/apache/spark/pull/12978#issuecomment-218097327
  
@srowen Er...I am also a little strange while stage 1 may took a long time 
but it really happen several times... If there is time I will do a more 
detailed test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [spark-15212][SQL]CSV file reader when read fi...

2016-05-08 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/12987

[spark-15212][SQL]CSV file reader when read file with first line schema do 
not filter blank in schema column name

## What changes were proposed in this pull request?

When load csv file with schema, add schema column name string trim to avoid 
blank problem.

## How was this patch tested?

construct csv data file contains schema definition with column blank such 
as:

csv data file contains:
--
col1, col2,col3,col4,col5
1997,Ford,E350,"ac, abs, moon",3000.00


notice there is a blank before col2,
test command:
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
var reader = sqlContext.read
reader.option("header", true)
var df = reader.csv("path/to/csvfile")
df.select("col2");//check if OK
df.registerTempTable("tab1");
sqlContext.sql("select col2 from tab1"); //check if OK.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark bugfix-15212

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12987.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12987


commit 56cc34f37c712d172df6fbc27b2823a571160cda
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-08T12:52:09Z

fix bug: spark-15212




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15203][Deploy]The spark daemon shell sc...

2016-05-07 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/12978

[SPARK-15203][Deploy]The spark daemon shell script error, daemon process 
start successfully but script output fail message.

This bug is because, sbin/spark-daemon.sh script use bin/spark-class shell 
to start daemon, then sleep  2s and check whether the daemon process exists, 
using shell script like following:
if [[ ! $(ps -p "$newpid" -o comm=) =~ "java" ]]
the problem is, some machine with bad performance may start the daemon 
using a long time(exceeding 2s), but still can start daemon successfully, but 
in this case, the shell script judgement  ! $(ps -p "$newpid" -o comm=) =~ 
"java"  will fail, because at this time, the $newpid process is still shell 
process, until the daemon started, it turns into java process. That's the 
reason cause the bug.
In order to reproduce the bug more easily, you can change 
sleep 2  ==>  sleep 0.01 (sbin/spark-daemon.sh, line 165)

to fix this bug, 
I replace
 ! $(ps -p "$newpid" -o comm=) =~ "java"   (sbin/spark-daemon.sh, line 167)
with 
-z $(ps --no-headers -p "$newpid")



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark fix_shell_001

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12978


commit 750078c3ab0d34f55db935b9518b3288f98cade1
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-05-07T13:00:58Z

fix shell start daemon script bug




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen 
The sparkContext, by default, will running a cleaner to release not 
referenced RDD/broadcasts on background. But, I think, we'd better to release 
them by ourselves because the SparkContext auto-cleaner  depends on java-gc. If 
gc not triggered the cleaner won't release the unused broadcasts.
As we can see, if a driver-side broadcast has been unpersisted but not 
destroyed, its metadata will be keep in the context and the content of the var 
broadcasted is kept in driver-side, here in KMeans `bcNewCenters` is a 
two-dimension array so I think it is better to release them as soon as possible.
About the overhead, I think to track these history `bcNewCenters` 
broadcasts only need a reference list and it can destroyed in async way so the 
overhead is acceptable.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBatch met...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14335
  
@srowen 
`stats.unpersist(false)` ==> `stats.unpersist()` updated.
is there anything else need to update ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen 
I check `RDD.persist` referenced place:
AFTSuvivalRegression, LinearRegression, LogisticRegression, will persist 
input training RDD and unpersist them when `train` return, seems OK.
`recommend.ALS` persist many RDDs and seems unpersist them all OK.
mllib `BisectingKMeans.run` contains a TODO "unpersist old indices", I'll 
check it now.
Others seems OK.

`Broadcast.persist` referenced place already checked in this PR I think 
they are all properly handled here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14203: update python dataframe.drop

2016-07-14 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14203

update python dataframe.drop

## What changes were proposed in this pull request?

Make `dataframe.drop` API in python support multi-columns parameters,
so that it is the same with scala API. 


## How was this patch tested?

The doc test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark drop_python_api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14203






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
yeah, but the `bcSyn0Global` in Word2Vec is a difference case, it looks 
safe there to destroy, 
because in each loop iteration, the RDD transform which use `bcSyn0Global` 
ends with a `collect`,
after the `collect` action, we no longer need the RDD(the RDD's all 
computation has done, no more possible recovery) so we also can destroy the 
`bcSyn0Global` directly in each loop iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14335#discussion_r72003627
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 gammaPart = gammad :: gammaPart
   }
   Iterator((stat, gammaPart))
-}
+}.persist(StorageLevel.MEMORY_AND_DISK)
 val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
   _ += _, _ += _)
-expElogbetaBc.unpersist()
 val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
   stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
+stats.unpersist(false)
--- End diff --

@srowen yeah, here we'd better make it consistent with others. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] unused broadcast variables do d...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen 
The `bcNewCenters` in `KMeans` has some problem.
Check the code logic in detail, we can find that in each loop, it should 
destroy the broadcast var `bcNewCenters` generated in the previous loop, not 
the one generated in current loop. Like what
is done to the `costs: RDD`, which use a `preCosts` var to save that 
generated in previous loop.
I update the code. 

The second problem, what's the meaning of `broadcast.unpersist`, eh, I 
think, there is another senario, suppose there is a RDD lineage, when executing 
in normal case, it executed successfully, and in code we can unpersist useless 
broadcast var in time, but, if some exception happened, the spark can recovery  
from it, it need to recovery the broken RDD from the RDD lineage and in such 
case may re-use the broadcast var we had unpersisted. If we simply destroy it, 
the broadcast var cannot be recover
so that the recovery will fail.

So that I think the safe place to use `broadcast.destroy` is the place 
where some action to RDD has successfully executed, and the whole RDD lineage 
is no longer needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14335#discussion_r72003530
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 gammaPart = gammad :: gammaPart
   }
   Iterator((stat, gammaPart))
-}
+}.persist(StorageLevel.MEMORY_AND_DISK)
 val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
   _ += _, _ += _)
-expElogbetaBc.unpersist()
 val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
   stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
--- End diff --

@srowen
you mean change 
`stats.map(_._2).flatMap(list => list).collect().map(_.toDenseMatrix)`
to
`stats.map(_._2).flatMap(list => list).map(_.toDenseMatrix).collect()`
?

will the latter one running faster ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14335#discussion_r72003428
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 gammaPart = gammad :: gammaPart
   }
   Iterator((stat, gammaPart))
-}
+}.persist(StorageLevel.MEMORY_AND_DISK)
--- End diff --

@srowen The type of the RDD to be persisted here is fixed to 
RDD[(BDM[Double], List[BDV[Double]])] so what's the risk of cannot be persisted 
? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-27 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen yeah, the code logic here seems confusing, but I think it is right.
Now I can explain it in a clear way:
in essence, the logic can be expressed as following:
A0->I1->A1->I2->A2->...
A0 is the initial `assignments`, I1 is step-1 `indices`, A1 is step-1 
`assignment`, I2 is step-2 `indices`, and so on.
There is dependency between them as the arrows show.
Now the key point is that when we compute I(K), we must make sure I(K-1) is 
persisted, and I(K-2) and older ones can be unpersisted.
NOW, check my code logic, in fact, in each iteration, I do the following 
thing:
1. unpersist I(K-1)
2. compute I(K+1) using A(K), and because of dependency, A(K) must use 
I(K), And I(K) is STILL PERSISTED.
3. compute A(K+1) using I(K+1)

But now I found another problem in BisectKMeans:
in line 191 there is a iteration it also need this pattern “persist 
current step RDD, unpersist previous”
and the iteration's first RDD relates to code here.
The problem seems a little troblesome so we'd better create another PR to 
handle it ?





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14326: [SPARK-3181] [ML] Implement RobustRegression with huber ...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14326
  
@yanboliang I go through the code and there are several problems need to 
solve:

The robust regression has a parameter `sigma` which must > 0, so that it is 
a bound optimize problem and should use LBFGS-B. But as my test, current breeze 
LBFGS-B has bugs and when iterating sometimes it will generate NaN value and 
corrupt the computing. 

I add some log printing to help debug, I paste a small fragment to show how 
the LBFGS-B corrupt:
**(robust regression w/o intercept w/ regularization test)**
costFun: sigma param: 1.0
huberAggrLoss + reg: 18262.68068379334
cost grad- sigma: -630.1789355384457
costFun: sigma param: 631.1789355384457
huberAggrLoss + reg: 1.256602668595641E7
cost grad- sigma: -466.0711286869664
costFun: sigma param: 64.01789355384457
huberAggrLoss + reg: 483796.45119015244
cost grad- sigma: -448.1113667824356
costFun: sigma param: 9.849995439060637
huberAggrLoss + reg: 44154.79971484518
cost grad- sigma: -275.5999029061156
costFun: sigma param: 3.2447088269560513
huberAggrLoss + reg: 8513.171279631315
cost grad- sigma: -5.737776191290681
**costFun: sigma param: NaN
huberAggrLoss + reg: NaN**
cost grad- sigma: -822.49944

as shown above, when sigma param became NaN in iterating, the LBFGS-B has 
corrupted and there is no need to continue. 

When I trace the LBFGS-B I found that in
`LBFGSB.subspaceMinimization` method, it may cause output point became 
(NaN, NaN...) even if the input is OK. so that I think it is a bug in 
`LBFGSB.subspaceMinimization` . 
I think this problem has no wark-around way and need Breeze community to 
fix it.

The second problem, whether the loss should divided by N and whether L2 reg 
should divided by 2, I think it should keep consistent with other GLM alogrithm 
in mllib.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14216
  
@srowen Oh, I miss your comment about loop brace, now it added, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14265
  
cc @jkbradley Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14216: [SPARK-16561][MLLib] fix multivarOnlineSummary min/max b...

2016-07-21 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14216
  
@srowen yeah I have pushed, "some minor update" 
https://github.com/apache/spark/pull/14216/commits/362074187d8845eeb40452eceec10f7e8ad805df


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14333: [SPARK-16696][ML][MLLib] unused broadcast variabl...

2016-07-24 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14333

[SPARK-16696][ML][MLLib] unused broadcast variables do destroy call to 
release memory in time

## What changes were proposed in this pull request?

update unused broadcast in KMeans/LDAOptimizer/Word2Vec,
use destroy(false) to release memory in time.

and several place destroy() update to destroy(false) so that it will be 
async-called,
it will better than blocking called. 

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
broadvar_unpersist_to_destroy

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14333


commit 52afc038c79ab8176bf760d65793e8d5f94d4d4a
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-07-20T14:12:41Z

broadvar_unpersist_to_destroy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-24 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14335

[SPARK-16697][ML][MLLib] improve LDA submitMiniBatch method to avoid 
redundant RDD computation

## What changes were proposed in this pull request?

In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], 
List[BDV[Double]])]`
and also move the place of unpersisting `expElogbetaBc` broadcast variable,
to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
and update previous `expElogbetaBc.unpersist()` into 
`expElogbetaBc.destroy(false)`

## How was this patch tested?

Existing test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark improve_LDA

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14335.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14335


commit e5ed33b559a04215c784d0d81a1578d3f13d8804
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-07-20T16:39:27Z

improve_LDA




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14265: [PySpark] add picklable SparseMatrix in pyspark.ml.commo...

2016-07-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14265
  
@srowen We can check python/ml/tests.py, `VectorTests.test_serialize` 
function, it contains a test for `SparseMatrix` serializing/unserializing, so 
that we can confirm that this works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14440: [SPARK-16835][ML] add training data unpersist han...

2016-08-01 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14440

[SPARK-16835][ML] add training data unpersist handling when throw exception

[SPARK-16835][ML] add training data `unpersist` handling when throw 
exception

## What changes were proposed in this pull request?

to `LinearRegression`,  `LogisticRegression`, `AFTSuvivalRegression`
modify the `unpersist` call to input training `Dataset` to make sure that
even when exception throws, the `unpersist` will be called.

## How was this patch tested?

Existing test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
glm_traindata_unpersist_on_exception

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14440


commit 1a366e6d382ea1e5156ae5d75c6600a2d414295e
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-07-26T07:25:01Z

glm_traindata_unpersist_on_exception




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14335#discussion_r72013619
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 gammaPart = gammad :: gammaPart
   }
   Iterator((stat, gammaPart))
-}
+}.persist(StorageLevel.MEMORY_AND_DISK)
 val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
   _ += _, _ += _)
-expElogbetaBc.unpersist()
 val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
   stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
--- End diff --

`stats.map(_._2).flatMap(list => list).map(_.toDenseMatrix).collect()` it 
can also work well.
but I think the two ways have no big difference considering efficiency.
because `stats.map(_._2).flatMap(list => list)` already generate a 
RDD[DenseVector]
Does serialzing a "DenseVector" or a "one row DenseMatrix" have a big 
difference ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen The KMeans.initKMeansParallel already implements the pattern 
"persist current step RDD, and unpersist previous one", but I think an RDD 
persisted can also break down because of disk error or something? If we want to 
reach the goal "the broadcast can safely be disposed(`broadcast.destroy`) at 
each iteration too, rather than only at the end", I think it need to use 
`RDD.checkpoint` instead of `RDD.persist` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen
I check the code about KMean `bcNewCenters` again, if we want to make sure 
the recovery of RDD will successful in any unexcepted case,  we have to keep 
all the `bcNewCenters` generated in each loop, until the loop is done.
because each loop the `costs:RDD` is build using preCosts:RDD, so that it 
became a RDD link. and the loop will and only will keep latest two RDDs being 
persisted. if the last two RDDs is broken, spark will need to rebuild them from 
the first RDD, in such case, each historical `bcNewCenters` generated will be 
used.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14335: [SPARK-16697][ML][MLLib] improve LDA submitMiniBa...

2016-07-25 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14335#discussion_r72014278
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -472,12 +473,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 gammaPart = gammad :: gammaPart
   }
   Iterator((stat, gammaPart))
-}
+}.persist(StorageLevel.MEMORY_AND_DISK)
 val statsSum: BDM[Double] = 
stats.map(_._1).treeAggregate(BDM.zeros[Double](k, vocabSize))(
   _ += _, _ += _)
-expElogbetaBc.unpersist()
 val gammat: BDM[Double] = breeze.linalg.DenseMatrix.vertcat(
   stats.map(_._2).flatMap(list => 
list).collect().map(_.toDenseMatrix): _*)
--- End diff --

En...DenseVector.toDenseMatrix need to copy the whole buffer in the Vector, 
maybe there is some influence to performance if they are all done in driver 
side. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14333: [SPARK-16696][ML][MLLib] destroy KMeans bcNewCenters whe...

2016-07-30 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14333
  
@srowen 
I check code again, the problem I mentioned above
`But now I found another problem in BisectKMeans:
in line 191 there is a iteration it also need this pattern “persist 
current step RDD, unpersist previous”`
is NOT A PROBLEM, I misread code.

And I check other modifications in this PR again and it seems no problems. 

So I think the PR is OK now. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14604: [Doc] add config option spark.ui.enabled into doc...

2016-08-11 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14604

[Doc] add config option spark.ui.enabled into document

## What changes were proposed in this pull request?

The configuration doc lost the config option `spark.ui.enabled` (default 
value is `true`)
I think this option is important because many cases we would like to turn 
it off.
so I add it.

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
add_doc_param_spark_ui_enabled

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14604


commit 93fa7f7816017628cc5b898c1d7c6acf385fcde0
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-08-10T17:19:36Z

add config doc option: spark.ui.enabled




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14483: [SPARK-16880][ML][MLLib] make ann training data persiste...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14483
  
@srowen yeah, others algorithm using LBFGS all have this pattern, only ANN 
forgot it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14156
  
yeah, currently it seems to make a little overhead (do a copy), but I think 
it will take advantage of breeze optimization, in the future, e.g, SIMD 
instructions or something ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-03 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14156
  
@srowen
The := operator in BDM is simply copy one BDM to another, and it is widely 
used in breeze source, e.g, we can check DenseMatrix.copy function in Breeze:
it first use `DenseMatrix.create` to create a new Matrix with the same 
dimension
`val result = DenseMatrix.create(...)`
, and them use
`result := this` to copy self into the matrix just created.

The mechanism of := operator for DenseMatrix is that the DenseMatrix 
implements the `OpSet` trait.
check `DenseMatrix` source file in breeze, in line 985, there is:
implicit val setMV_D:OpSet.InPlaceImpl2[...] = new SetDMDVOp[Double]()
so, the implementation code is in `SetDMDVOp` class
and we can see that in `SetDMDVOp` it do Type Specialization for Double 
type so that the compiling code will have high efficiency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14156: [SPARK-16499][ML][MLLib] improve ApplyInPlace function i...

2016-08-04 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14156
  
@srowen yeah, the function supplied here called cannot be turned into SIMD 
instructions but I think it can do some parallelization optimization on large 
matrix, for example we can split the matrix into several blocks and executed 
the "in place transform" in parallel way, although it haven't added in breeze 
currently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14483: [SPARK-16880][ML][MLLib] make ann training data p...

2016-08-03 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/14483

[SPARK-16880][ML][MLLib] make ann training data persisted if needed

## What changes were proposed in this pull request?

To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from 
lineage.


## How was this patch tested?

Existing Tests.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark 
add_ann_persist_training_data

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14483.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14483


commit 0bfece8fea0599a4a4493d6a6822d0c6e542745b
Author: WeichenXu <weichenxu...@outlook.com>
Date:   2016-08-03T12:11:19Z

add ann persist training data if needed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14629: [WIP][SPARK-17046][SQL] prevent user using dataframe.sel...

2016-08-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14629
  
MySql do not allow select with 0 columns, and I think select() is useless, 
no one will do such operation, so, is it better to generate compiling error 
when detecting code use `df.select()` because it is usually a coding mistake?
Or, `df.select()` is useful in some particular scenario?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...

2016-08-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14520
  
@yanboliang Thanks for carefully review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...

2016-08-14 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14520
  
@sethah I attach the test result and it looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14520: [SPARK-16934][ML][MLLib] Improve LogisticCostFun to avoi...

2016-08-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/14520
  
cc @yanboliang Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >