[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-06-07 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r193664416
  
--- Diff: docs/submitting-applications.md ---
@@ -218,6 +218,115 @@ These commands can be used with `pyspark`, 
`spark-shell`, and `spark-submit` to
 For Python, the equivalent `--py-files` option can be used to distribute 
`.egg`, `.zip` and `.py` libraries
 to executors.
 
+# VirtualEnv for PySpark
--- End diff --

Thanks @HyukjinKwon , doc is updated


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-06-07 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
Thanks for the interest on this PR and the info about `Pipfiles`. I think 
we could support that after this PR get merged so that we can provide users 
more options for virtualenv based on their enviroment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-05-01 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13493
  
Thanks @jkbradley The failed tests seems unrelated. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-02-27 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
That would be awesome. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-02-27 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
I am afraid I would not be present in Strata SJ, I live in Shanghai China, 
and may not be able to travel at time. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-02-04 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
ping @holdenk @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-29 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164646157
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+private[spark] class VirtualEnvFactory(pythonExec: String, conf: 
SparkConf, isDriver: Boolean)
+  extends Logging {
+
+  private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private var virtualEnvPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
--- End diff --

It is used by launcher module which doesn't depend on scala. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-26 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164069473
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
@@ -39,12 +39,17 @@ object PythonRunner {
 val pyFiles = args(1)
 val otherArgs = args.slice(2, args.length)
 val sparkConf = new SparkConf()
-val pythonExec = sparkConf.get(PYSPARK_DRIVER_PYTHON)
+var pythonExec = sparkConf.get(PYSPARK_DRIVER_PYTHON)
--- End diff --

I  am afraid I have to use var, because `VirtualenvFactory` also need 
`pythonExec` as constructor arg.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-26 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164068172
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: 
Boolean)
+  extends Logging {
+
+  private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private val virtualEnvBinPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private val initPythonPackages = 
conf.getOption("spark.pyspark.virtualenv.packages")
--- End diff --

For the existing running executor,  the only way to install additional 
packages is via `sc.intall_packages`.  And `spark.pyspark.virtualenv.packages` 
will be updated on driver side first when `sc.install_packges` is called, then 
new allocated executor will fetch this property and install all the additional 
packages properly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164037516
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,41 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages):
+"""
+Install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled is set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+return pip.main(["install"] + packages)
+
+# install package on driver first. if installation succeeded, 
continue the installation
+# on executors, otherwise return directly.
+if _run_pip(packages, None) != 0:
+return
+
+virtualenvPackages = 
self._conf.get("spark.pyspark.virtualenv.packages")
+if virtualenvPackages:
+self._conf.set("spark.pyspark.virtualenv.packages", 
virtualenvPackages + "," +
+   ",".join(packages))
+else:
+self._conf.set("spark.pyspark.virtualenv.packages", 
",".join(packages))
--- End diff --

Good catch, I will use other separator. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164037239
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -98,7 +98,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   private val reviveThread =
 
ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread")
 
-  class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: 
Seq[(String, String)])
+  class DriverEndpoint(override val rpcEnv: RpcEnv)
--- End diff --

Without this change, the following scenario won't work. 
1. Launch spark app
2. call `sc.install_packages("numpy")`
3. run `sc.range(3).map(lambda x: np.__version__).collect()`
4. Restart executor (by kill it, scheduler will scheduler another executor)
5. run `sc.range(3).map(lambda x: np.__version___.collect()` again, this 
time it would fail. Because the new scheduled executor can not set up 
virtualenv correctly as it can not get the updated 
`spark.pyspark.virtualenv.packages`.

That's why make this change in core. Now executor would always get the 
updated SparkConf instead of the SparkConf created when spark app is started. 

There's some overhead, but I believe it is very trivial, and could be 
improved later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164037055
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -299,20 +300,34 @@
 // 4. environment variable PYSPARK_PYTHON
 // 5. python
 List pyargs = new ArrayList<>();
-pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON),
+String pythonExec = 
firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON),
   conf.get(SparkLauncher.PYSPARK_PYTHON),
   System.getenv("PYSPARK_DRIVER_PYTHON"),
   System.getenv("PYSPARK_PYTHON"),
-  "python"));
-String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS");
-if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) {
-  // pass conf spark.pyspark.python to python by environment variable.
-  env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON));
+  "python");
+if (conf.getOrDefault("spark.pyspark.virtualenv.enabled", 
"false").equals("true")) {
+  try {
+// setup virtualenv in launcher when virtualenv is enabled in 
pyspark shell
+Class virtualEnvClazz = 
Class.forName("org.apache.spark.api.python.VirtualEnvFactory");
--- End diff --

Because launcher module does not depend on core module explicitly in pom.xml


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164036488
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: 
Boolean)
+  extends Logging {
+
+  private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private val virtualEnvBinPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private val initPythonPackages = 
conf.getOption("spark.pyspark.virtualenv.packages")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver)
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   */
+  def setupVirtualEnv(): String = {
+/*
+ *
+ * Native Virtualenv:
+ *   -  Execute command: virtualenv -p  --no-site-packages 

+ *   -  Execute command: python -m pip --cache-dir  install 
-r 
+ *
+ * Conda
+ *   -  Execute command: conda create --prefix  --file 
 -y
+ *
+ */
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: $virtualEnvType is not supported." )
+require(new File(virtualEnvBinPath).exists(),
+  s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't 
exist.")
+// Two scenarios of creating virtualenv:
+// 1. created in yarn container. Yarn will clean it up after container 
is exited
+// 2. created outside yarn container. Spark need to create temp 
directory and clean it after app
+//finish.
+//  - driver of PySpark shell
+//  - driver of yarn-client mode
+if (isLauncher ||
+  (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+  val virtualenvBasedir = Files.createTempDir()
+  virtualenvBasedir.deleteOnExit()
+  virtualEnvName = virtualenvBasedir.getAbsolutePath
+} else if (isDriver && conf.get("spark.submit.deployMode") == 
"cluster") {
+  virtualEnvName = "virtualenv_driver"
+} else {
+  // use the working directory of Executor
+  virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+}
+
+// Use the absolute path of requirement file in the following cases
+// 1. driver of pyspark shell
+// 2. driver of yarn-client mode
+// otherwise just use filename as it would be downloaded to the 
working directory of Executor
+val pysparkRequirements =
+  if (isLauncher ||
+(isDriver && conf.get("spark.subm

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164034980
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: 
Boolean)
+  extends Logging {
+
+  private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private val virtualEnvBinPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private val initPythonPackages = 
conf.getOption("spark.pyspark.virtualenv.packages")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver)
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   */
+  def setupVirtualEnv(): String = {
+/*
+ *
+ * Native Virtualenv:
+ *   -  Execute command: virtualenv -p  --no-site-packages 

+ *   -  Execute command: python -m pip --cache-dir  install 
-r 
+ *
+ * Conda
+ *   -  Execute command: conda create --prefix  --file 
 -y
+ *
+ */
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: $virtualEnvType is not supported." )
+require(new File(virtualEnvBinPath).exists(),
+  s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't 
exist.")
+// Two scenarios of creating virtualenv:
+// 1. created in yarn container. Yarn will clean it up after container 
is exited
+// 2. created outside yarn container. Spark need to create temp 
directory and clean it after app
+//finish.
+//  - driver of PySpark shell
+//  - driver of yarn-client mode
+if (isLauncher ||
+  (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+  val virtualenvBasedir = Files.createTempDir()
+  virtualenvBasedir.deleteOnExit()
+  virtualEnvName = virtualenvBasedir.getAbsolutePath
+} else if (isDriver && conf.get("spark.submit.deployMode") == 
"cluster") {
+  virtualEnvName = "virtualenv_driver"
+} else {
+  // use the working directory of Executor
+  virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+}
+
+// Use the absolute path of requirement file in the following cases
+// 1. driver of pyspark shell
+// 2. driver of yarn-client mode
+// otherwise just use filename as it would be downloaded to the 
working directory of Executor
+val pysparkRequirements =
+  if (isLauncher ||
+(isDriver && conf.get("spark.subm

[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-25 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r164034871
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,41 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages):
+"""
+Install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled is set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+return pip.main(["install"] + packages)
+
+# install package on driver first. if installation succeeded, 
continue the installation
+# on executors, otherwise return directly.
+if _run_pip(packages, None) != 0:
+return
--- End diff --

The pip install output will be displayed.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-23 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r163427975
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,42 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages):
+"""
+install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled is set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+return pip.main(["install"] + packages)
+
+# install package on driver first. if installation succeeded, 
continue the installation
+# on executors, otherwise return directly.
+if _run_pip(packages, None) != 0:
+return
+
+virtualenvPackages = 
self._conf.get("spark.pyspark.virtualenv.packages")
+if virtualenvPackages:
+self._conf.set("spark.pyspark.virtualenv.packages", 
virtualenvPackages + "," +
+   ",".join(packages))
+else:
+self._conf.set("spark.pyspark.virtualenv.packages", 
",".join(packages))
+
+import functools
+dummyRDD.foreachPartition(functools.partial(_run_pip, packages))
--- End diff --

You are right, it is not guaranteed. From my experiment, it works pretty 
well most of time. And even it is not executed on all executors in this rdd 
operation, packages will be installed when later python daemon is started on 
that executor. So in any case, python packages will be installed in all python 
daemons.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-01-22 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
@holdenk @HyukjinKwon @ueshin I have updated the PR, and now it also works 
when executor is restarted and even dynamic allocation is enabled.  The only 
overhead is on the driver side when executor ask for configuration (changes in 
https://github.com/apache/spark/pull/13599/files#diff-7d99a7c7a051e5e851aaaefb275a44a1),
 Now I will constructing new properties when executor ask for properties 
instead of like before that creating properties before hand and never change 
even when SparkConf is  changed.  But I believe the overhead is trivial, so the 
improvement could be done in a separated PR. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-09 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160572606
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,35 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
+"""
+install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+pip.main(["install"] + packages)
+
+# run it in the main thread. Will do it in a separated thread after
+# https://github.com/pypa/pip/issues/2553 is fixed
+if install_driver:
+_run_pip(packages, None)
+
+import functools
+dummyRDD.foreachPartition(functools.partial(_run_pip, packages))
--- End diff --

It make sense to making this feature as experimental. Because although it 
is not reliable in some cases, it is still pretty useful in interactive mode, 
e.g. In notebook, it is not possible to set down all the dependent packages 
before launching spark app. Installing packages at runtime is very useful for 
interactive mode. And since usually notebook is a experimental phase, not 
production phase. These corner cases should be acceptable IMHO as long as we 
document them and make users aware. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160310618
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1039,33 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
+"""
+install python packages on all executors and driver through pip
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
--- End diff --

@HyukjinKwon Could you guide me how to skip such doctests ? Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160308321
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java 
---
@@ -299,20 +301,39 @@
 // 4. environment variable PYSPARK_PYTHON
 // 5. python
 List pyargs = new ArrayList<>();
-pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON),
-  conf.get(SparkLauncher.PYSPARK_PYTHON),
-  System.getenv("PYSPARK_DRIVER_PYTHON"),
-  System.getenv("PYSPARK_PYTHON"),
-  "python"));
-String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS");
-if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) {
-  // pass conf spark.pyspark.python to python by environment variable.
-  env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON));
-}
-if (!isEmpty(pyOpts)) {
-  pyargs.addAll(parseOptionString(pyOpts));
-}
+String pythonExec = 
firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON),
+conf.get(SparkLauncher.PYSPARK_PYTHON),
+System.getenv("PYSPARK_DRIVER_PYTHON"),
+System.getenv("PYSPARK_PYTHON"),
+"python");
+if (conf.getOrDefault("spark.pyspark.virtualenv.enabled", 
"false").equals("true")) {
+  try {
+// setup virtualenv in launcher when virtualenv is enabled in 
pyspark shell
+Class virtualEnvClazz = 
getClass().forName("org.apache.spark.api.python.VirtualEnvFactory");
+Object virtualEnv = virtualEnvClazz.getConstructor(String.class, 
Map.class, Boolean.class)
+.newInstance(pythonExec, conf, true);
+Method virtualEnvMethod = 
virtualEnv.getClass().getMethod("setupVirtualEnv");
+pythonExec = (String) virtualEnvMethod.invoke(virtualEnv);
+pyargs.add(pythonExec);
+  } catch (Exception e) {
+throw new IOException(e);
+  }
+} else {
+  
pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON),
+  conf.get(SparkLauncher.PYSPARK_PYTHON),
+  System.getenv("PYSPARK_DRIVER_PYTHON"),
+  System.getenv("PYSPARK_PYTHON"),
+  "python"));
+  String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS");
+  if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) {
+// pass conf spark.pyspark.python to python by environment 
variable.
+env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON));
+  }
 
+  if (!isEmpty(pyOpts)) {
+pyargs.addAll(parseOptionString(pyOpts));
+  }
--- End diff --

Good point, fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160285377
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+private[spark] class VirtualEnvFactory(pythonExec: String, conf: 
SparkConf, isDriver: Boolean)
+  extends Logging {
+
+  private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private var virtualEnvPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf(), isDriver)
--- End diff --

I am afraid not. Because this method is used by module launcher where 
SparkConf is unknown.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-01-08 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
@holdenk @ueshin @HyukjinKwon Thanks for review the long pending PR. Will 
refine the PR soon. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160141363
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1039,33 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
--- End diff --

user can call this method to install package at runtime. e.g. user can 
install package in pyspark shell. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160140451
  
--- Diff: docs/submitting-applications.md ---
@@ -218,6 +218,73 @@ These commands can be used with `pyspark`, 
`spark-shell`, and `spark-submit` to
 For Python, the equivalent `--py-files` option can be used to distribute 
`.egg`, `.zip` and `.py` libraries
 to executors.
 
+# VirtualEnv for Pyspark
+For simple PySpark application, we can use `--py-files` to add its 
dependencies. While for a large PySpark application,
+usually you will have many dependencies which may also have transitive 
dependencies and even some dependencies need to be compiled
+to be installed. In this case `--py-files` is not so convenient. Luckily, 
in python world we have virtualenv/conda to help create isolated
+python work environment. We also implement virtualenv in PySpark (It is 
only supported in yarn mode for now). 
+
+# Prerequisites
+- Each node have virtualenv/conda, python-devel installed
+- Each node is internet accessible (for downloading packages)
+
+{% highlight bash %}
+# Setup virtualenv using native virtualenv on yarn-client mode
+bin/spark-submit \
+--master yarn \
+--deploy-mode client \
+--conf "spark.pyspark.virtualenv.enabled=true" \
+--conf "spark.pyspark.virtualenv.type=native" \
+--conf 
"spark.pyspark.virtualenv.requirements=" \
+--conf "spark.pyspark.virtualenv.bin.path=" \
+
+
+# Setup virtualenv using conda on yarn-client mode
+bin/spark-submit \
+--master yarn \
+--deploy-mode client \
+--conf "spark.pyspark.virtualenv.enabled=true" \
+--conf "spark.pyspark.virtualenv.type=conda" \
+--conf 
"spark.pyspark.virtualenv.requirements=<" \
+--conf "spark.pyspark.virtualenv.bin.path=" \
+
+{% endhighlight %}
+
+## PySpark VirtualEnv Configurations
+
+Property NameDefaultMeaning
+
+  spark.pyspark.virtualenv.enabled
+  false
+  Whether to enable virtualenv
+
+
+  Spark.pyspark.virtualenv.type
+  virtualenv
--- End diff --

I don't have strong preference on `native`. The reason I use native is that 
I already use virtualenv in the all the virtualenv related configuration, just 
don't want to introduce virtualenv here again to avoid confusion. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160139161
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+private[spark] class VirtualEnvFactory(pythonExec: String, conf: 
SparkConf, isDriver: Boolean)
+  extends Logging {
+
+  private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private var virtualEnvPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf(), isDriver)
+properties.asScala.foreach(entry => this.conf.set(entry._1, entry._2))
+virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native")
+virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path")
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   * Native Virtualenv:
+   *   -  Execute command: virtualenv -p  --no-site-packages 

+   *   -  Execute command: python -m pip --cache-dir  install 
-r 
+   *
+   * Conda
+   *   -  Execute command: conda create --prefix  --file 
 -y
+   *
+   */
+  def setupVirtualEnv(): String = {
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: ${virtualEnvType} is not supported." )
+require(new File(virtualEnvPath).exists(),
+  s"VirtualEnvPath: ${virtualEnvPath} is not defined or doesn't 
exist.")
+// Use a temp directory for virtualenv in the following cases:
+// 1. driver of pyspark shell
+// 2. driver of yarn-client mode
+// otherwise create the virtualenv folder under the executor working 
directory.
+if (isLauncher ||
+  (isDriver && conf.get("spark.submit.deployMode") == "client")) {
+  val virtualenv_basedir = Files.createTempDir()
+  virtualenv_basedir.deleteOnExit()
+  virtualEnvName = virtualenv_basedir.getAbsolutePath
+} else if (isDriver && conf.get("spark.submit.deployMode") == 
"cluster") {
+  virtualEnvName = "virtualenv_driver"
+} else {
+  // use the working directory of Executor
+  virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+}
+
+// Use the absolute path of requirement file in the following cases
+// 1. driver of pyspark shell
+// 2. driver of yarn-client mode
+// otherwise just use filename as it would be downloaded to the 
working directory of Executor
+val pyspark_requirements =
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160138782
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+private[spark] class VirtualEnvFactory(pythonExec: String, conf: 
SparkConf, isDriver: Boolean)
+  extends Logging {
+
+  private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private var virtualEnvPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
--- End diff --

This is because it will also be called in java side via java reflection. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160138391
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -60,6 +66,12 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
 envVars.getOrElse("PYTHONPATH", ""),
 sys.env.getOrElse("PYTHONPATH", ""))
 
+
+  if (virtualEnvEnabled) {
+val virtualEnvFactory = new VirtualEnvFactory(pythonExec, conf, false)
+pythonExec = virtualEnvFactory.setupVirtualEnv()
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160138349
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -29,7 +30,10 @@ import org.apache.spark._
 import org.apache.spark.internal.Logging
 import org.apache.spark.util.{RedirectThread, Utils}
 
-private[spark] class PythonWorkerFactory(pythonExec: String, envVars: 
Map[String, String])
+
+private[spark] class PythonWorkerFactory(var pythonExec: String,
--- End diff --

Fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-07 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160070613
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io.File
+import java.util.{Map => JMap}
+import java.util.Arrays
+import java.util.concurrent.atomic.AtomicInteger
+
+import scala.collection.JavaConverters._
+
+import com.google.common.io.Files
+
+import org.apache.spark.SparkConf
+import org.apache.spark.internal.Logging
+
+
+private[spark] class VirtualEnvFactory(pythonExec: String, conf: 
SparkConf, isDriver: Boolean)
+  extends Logging {
+
+  private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", 
"native")
+  private var virtualEnvPath = 
conf.get("spark.pyspark.virtualenv.bin.path", "")
+  private var virtualEnvName: String = _
+  private var virtualPythonExec: String = _
+  private val VIRTUALENV_ID = new AtomicInteger()
+  private var isLauncher: Boolean = false
+
+  // used by launcher when user want to use virtualenv in pyspark shell. 
Launcher need this class
+  // to create virtualenv for driver.
+  def this(pythonExec: String, properties: JMap[String, String], isDriver: 
java.lang.Boolean) {
+this(pythonExec, new SparkConf(), isDriver)
+properties.asScala.foreach(entry => this.conf.set(entry._1, entry._2))
+virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native")
+virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path")
+this.isLauncher = true
+  }
+
+  /*
+   * Create virtualenv using native virtualenv or conda
+   *
+   * Native Virtualenv:
+   *   -  Execute command: virtualenv -p  --no-site-packages 

+   *   -  Execute command: python -m pip --cache-dir  install 
-r 
+   *
+   * Conda
+   *   -  Execute command: conda create --prefix  --file 
 -y
+   *
+   */
+  def setupVirtualEnv(): String = {
+logInfo("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: ${virtualEnvType} is not supported." )
+require(new File(virtualEnvPath).exists(),
--- End diff --

Sorry, the variable name is a little misleading, it should be 
`virtualEnvBinPath` which is `spark.pyspark.virtualenv.bin.path`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-07 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160070518
  
--- Diff: python/pyspark/context.py ---
@@ -980,6 +996,33 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
+"""
+install python packages on all executors and driver through pip
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise Exception("install_packages can only use called when "
+"spark.pyspark.virtualenv.enabled set as true")
+if isinstance(packages, basestring):
+packages = [packages]
+num_executors = int(self._conf.get("spark.executor.instances"))
+dummyRDD = self.parallelize(range(num_executors), num_executors)
--- End diff --

Right, it would not work when dynamic allocation is enabled. will add 
condition for that 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-07 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160070457
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -475,6 +475,19 @@ object SparkSubmit extends CommandLineUtils with 
Logging {
   }
 }
 
+// for pyspark virtualenv
+if (args.isPython) {
+  if (clusterManager != YARN &&
+args.sparkProperties.getOrElse("spark.pyspark.virtualenv.enabled", 
"false") == "true") {
+printErrorAndExit("virtualenv is only supported in yarn mode")
--- End diff --

I haven't tested it in standalone mode, so not guaranteed for that, it is 
on my plan to support it for standalone later. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-07-05 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
Thanks @viirya 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-07-05 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
This PR fails fails PySpark pip packaging tests. But I don't know what's 
wrong here. @holdenk Is the `PySpark pip packaging test` an known issue ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-06-24 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r123876794
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
 ---
@@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution
 import scala.collection.JavaConverters._
 import scala.util.Random
 
+import test.org.apache.spark.sql.MyDoubleAvg
+import test.org.apache.spark.sql.MyDoubleSum
+
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow
 import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
UserDefinedAggregateFunction}
 import org.apache.spark.sql.functions._
-import org.apache.spark.sql.hive.aggregate.{MyDoubleAvg, MyDoubleSum}
 import org.apache.spark.sql.hive.test.TestHiveSingleton
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SQLTestUtils
 import org.apache.spark.sql.types._
 
+
--- End diff --

It depends on some hive stuff (`TestHiveSingleton`), so I guess it is 
intended to be put in sql/hive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-06-24 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r123871674
  
--- Diff: 
sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java ---
@@ -31,7 +31,7 @@
 import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
 import static org.apache.spark.sql.functions.*;
 import org.apache.spark.sql.hive.test.TestHive$;
-import org.apache.spark.sql.hive.aggregate.MyDoubleSum;
+import test.org.apache.spark.sql.MyDoubleSum;
 
 public class JavaDataFrameSuite {
--- End diff --

do you mean move JavaDataFrameSuite to sql/core ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-06-24 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r123871670
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
 ---
@@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution
 import scala.collection.JavaConverters._
 import scala.util.Random
 
+import test.org.apache.spark.sql.MyDoubleAvg
+import test.org.apache.spark.sql.MyDoubleSum
+
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.expressions.UnsafeRow
 import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
UserDefinedAggregateFunction}
 import org.apache.spark.sql.functions._
-import org.apache.spark.sql.hive.aggregate.{MyDoubleAvg, MyDoubleSum}
 import org.apache.spark.sql.hive.test.TestHiveSingleton
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SQLTestUtils
 import org.apache.spark.sql.types._
 
+
--- End diff --

I didn't add any test in this file. Or do you mean move 
AggregationQuerySuite.scala to sql/core ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...

2017-06-16 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/14180
  
Here's my approach #13599 for virtualenv and conda support, welcome any 
comments and reviews



https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-06-12 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@gatorsmile sorry for late response, will update it soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-12 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r116325723
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -491,20 +491,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
--- End diff --

Sorry, miss your last comment, fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-12 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r116293947
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -491,20 +491,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
--- End diff --

I didn't find a more proper exception type, so just use IOException. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-12 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r116293890
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
 ---
@@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution
 import scala.collection.JavaConverters._
 import scala.util.Random
 
+import _root_.test.org.apache.spark.sql.MyDoubleAvg
+import _root_.test.org.apache.spark.sql.MyDoubleSum
--- End diff --

oops, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-05 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r114963449
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
   }
 } catch {
   case e @ (_: InstantiationException | _: 
IllegalArgumentException) =>
-logError(s"Can not instantiate class ${className}, please make 
sure it has public non argument constructor")
+throw new IOException(s"Can not instantiate class 
${className}, please make sure it has public non argument constructor")
 }
   }
 } catch {
-  case e: ClassNotFoundException => logError(s"Can not load class 
${className}, please make sure it is on the classpath")
+  case e: ClassNotFoundException => throw new IOException(s"Can not 
load class ${className}, please make sure it is on the classpath")
 }
 
   }
 
   /**
+   * Register a Java UDAF class using reflection, for use from pyspark
+   *
+   * @param name UDAF name
+   * @param classNamefully qualified class name of UDAF
+   */
+  private[sql] def registerJavaUDAF(name: String, className: String): Unit 
= {
--- End diff --

This due to in scala side `registerJava` of `UDFRegistration' needs 
returnType. Yeah, it do looks like a little weird for python side to require 
returnType. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-05 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r114962484
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
   }
 } catch {
   case e @ (_: InstantiationException | _: 
IllegalArgumentException) =>
-logError(s"Can not instantiate class ${className}, please make 
sure it has public non argument constructor")
+throw new IOException(s"Can not instantiate class 
${className}, please make sure it has public non argument constructor")
 }
   }
 } catch {
-  case e: ClassNotFoundException => logError(s"Can not load class 
${className}, please make sure it is on the classpath")
+  case e: ClassNotFoundException => throw new IOException(s"Can not 
load class ${className}, please make sure it is on the classpath")
 }
 
   }
 
   /**
+   * Register a Java UDAF class using reflection, for use from pyspark
+   *
+   * @param name UDAF name
+   * @param classNamefully qualified class name of UDAF
+   */
+  private[sql] def registerJavaUDAF(name: String, className: String): Unit 
= {
--- End diff --

I mean `registerJavaUDAF` in `context.py` does't have `returnType`, so here 
in scala side, I don't provide `returnType `either since this scala method is 
only used for `registerJavaUDAF` of pyspark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-05-05 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r114948086
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
   }
 } catch {
   case e @ (_: InstantiationException | _: 
IllegalArgumentException) =>
-logError(s"Can not instantiate class ${className}, please make 
sure it has public non argument constructor")
+throw new IOException(s"Can not instantiate class 
${className}, please make sure it has public non argument constructor")
 }
   }
 } catch {
-  case e: ClassNotFoundException => logError(s"Can not load class 
${className}, please make sure it is on the classpath")
+  case e: ClassNotFoundException => throw new IOException(s"Can not 
load class ${className}, please make sure it is on the classpath")
 }
 
   }
 
   /**
+   * Register a Java UDAF class using reflection, for use from pyspark
+   *
+   * @param name UDAF name
+   * @param classNamefully qualified class name of UDAF
+   */
+  private[sql] def registerJavaUDAF(name: String, className: String): Unit 
= {
--- End diff --

pyspark side don't need `returnType` so I didn't use `returnType` here, and 
it is private function so should be open for adding `returnType` in future if 
necessary. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-05-04 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@cloud-fan This is not about using python UDF, it is to allow pyspark to 
use java UDF (no python daemon will be launched). So actually it would improve 
the performance. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-05-04 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@holdenk @gatorsmile Any more comments ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-24 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@holdenk The link you pasted is for the case that using scala closure to 
create udf. While `registerJava` use java reflection to create udf. This is 
what I use in `registerJava` 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L528
 It returns Unit.
Maybe it is possible to create `registerScala` to return scala udf. But it 
seems it is not possible for java udf. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-04-24 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r113085517
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
 case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, 
_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType)
-case n => logError(s"UDF class with ${n} type arguments is not 
supported ")
+case n =>
+  throw new IOException(s"UDF class with ${n} type arguments 
is not supported.")
   }
 } catch {
   case e @ (_: InstantiationException | _: 
IllegalArgumentException) =>
-logError(s"Can not instantiate class ${className}, please make 
sure it has public non argument constructor")
+throw new IOException(s"Can not instantiate class 
${className}, please make sure it has public non argument constructor")
 }
   }
 } catch {
-  case e: ClassNotFoundException => logError(s"Can not load class 
${className}, please make sure it is on the classpath")
+  case e: ClassNotFoundException => throw new IOException(s"Can not 
load class ${className}, please make sure it is on the classpath")
 }
 
   }
 
   /**
+   * Register a Java UDAF class using reflection, for use from pyspark
+   *
+   * @param name UDAF name
+   * @param classNamefully qualified class name of UDAF
--- End diff --

@since is needed for private function ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-24 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@holdenk But it has nothing to return, because scala side return Unit.  See 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L528


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-20 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
ping @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...

2017-04-12 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17586#discussion_r111279320
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -172,6 +172,47 @@ def intercept(self):
 """
 return self._call_java("intercept")
 
+@property
+@since("2.2.0")
--- End diff --

yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-11 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
Good catch ! @holdenk `return` is removed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...

2017-04-11 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17586#discussion_r111042227
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -355,6 +368,19 @@ object LinearSVCModel extends 
MLReadable[LinearSVCModel] {
 }
 
 /**
+ * Abstraction for Linear SVC Training results.
+ * Currently, the training summary ignores the training weights except
+ * for the objective trace.
+ */
+case class LinearSVCTrainingSummary(
--- End diff --

The classes below `LinearSVCTrainingSummary` are private classes, so I 
think it would better to keep LinearSVCTrainingSummary there (above the private 
classes)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...

2017-04-11 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17586#discussion_r111042049
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -355,6 +368,19 @@ object LinearSVCModel extends 
MLReadable[LinearSVCModel] {
 }
 
 /**
+ * Abstraction for Linear SVC Training results.
+ * Currently, the training summary ignores the training weights except
+ * for the objective trace.
+ */
--- End diff --

weight column also is not included in `LogisticRegressionTrainingSummary`, 
should I add that as well ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...

2017-04-11 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16906
  
Kindly ping @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17586
  
@hhbyyh @jkbradley Please help review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...

2017-04-10 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17586
  
I didn't add metrics like roc for this summary yet, I can add it if it is 
necessary.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...

2017-04-10 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/17586

[SPARK-20249][ML][PYSPARK] Add summary for LinearSVCModel

## What changes were proposed in this pull request?

Add summary for LinearSVCModel so that user can get the training process 
status, such as loss value of each iteration and total iteration number. 

## How was this patch tested?

Tested manually in both spark example and pyspark example. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-20249

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17586.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17586


commit 368af49770f9bee35ea364c6c8a47c96eecf2eaa
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-04-10T06:17:48Z

[SPARK-20249][ML][PYSPARK] Add summary for LinearSVCModel




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-06 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@viirya Thanks for careful review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-04-06 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r110113683
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -436,6 +436,20 @@ def test_udf_with_order_by_and_limit(self):
 res.explain(True)
 self.assertEqual(res.collect(), [Row(id=0, copy=0)])
 
+def test_non_existed_udf(self):
+try:
+self.spark.udf.registerJavaFunction("udf1", "non_existed_udf")
+self.fail("should fail due to can not load java udf class")
+except py4j.protocol.Py4JError as e:
+self.assertTrue("Can not load class non_existed_udf" in e.desc)
+
+def test_non_existed_udaf(self):
+try:
+self.spark.udf.registerJavaFunction("udf1", "non_existed_udaf")
--- End diff --

Correct, fixed 😄 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-04-06 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r110108533
  
--- Diff: python/pyspark/sql/context.py ---
@@ -228,6 +228,24 @@ def registerJavaFunction(self, name, javaClassName, 
returnType=None):
 jdt = 
self.sparkSession._jsparkSession.parseDataType(returnType.json())
 self.sparkSession._jsparkSession.udf().registerJava(name, 
javaClassName, jdt)
 
+@ignore_unicode_prefix
+@since(2.2)
+def registerJavaUDAF(self, name, javaClassName):
+"""Register a java UDAF so it can be used in SQL statements.
+
+:param name:  name of the UDF
+:param javaClassName: fully qualified name of java class
+
+>>> sqlContext.registerJavaUDAF("javaUDAF",
+...   "org.apache.spark.sql.hive.aggregate.MyDoubleAvg")
+>>> df = sqlContext.createDataFrame([(1, "a"),(2, "b"), (3, 
"a")],["id", "name"])
+>>> df.registerTempTable("df")
+>>> sqlContext.sql("SELECT name,javaUDAF(id) as avg from df group 
by name").collect()
+[Row(name=u'b', avg=102.0), Row(name=u'a', avg=102.0)]
--- End diff --

Good point, will add that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-04-05 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@holdenk Mind to review it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-03-29 Thread zjffdu
GitHub user zjffdu reopened a pull request:

https://github.com/apache/spark/pull/17222

[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support 
UDAFs

## What changes were proposed in this pull request?

Support register Java UDAFs in PySpark so that user can use Java UDAF in 
PySpark. Besides that I also add api in `UDFRegistration`

## How was this patch tested?

Unit test is added




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-19439

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17222


commit 8c1e837e2e97c08c4a5753c79aea71da772b0eaa
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-03-09T07:06:50Z

[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support 
UDAFs

commit 89b8d6588d4d6258f9c4d84339775544d93e6e3c
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-03-10T00:28:12Z

add scala doc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-03-29 Thread zjffdu
Github user zjffdu closed the pull request at:

https://github.com/apache/spark/pull/17222


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17367: [MINOR][PYSPARK] Remove _inferSchema in context.py

2017-03-21 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17367
  
Close it as _inferSchema is still used in many places. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17367: [MINOR][PYSPARK] Remove _inferSchema in context.p...

2017-03-21 Thread zjffdu
Github user zjffdu closed the pull request at:

https://github.com/apache/spark/pull/17367


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...

2017-03-20 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16906
  
Yeah, make sense. Fixed it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17367: [MINOR][PYSPARK] Remove _inferSchema in context.p...

2017-03-20 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/17367

[MINOR][PYSPARK] Remove _inferSchema in context.py

## What changes were proposed in this pull request?

_inferSchema is not used in context.py, all the things have been moved to 
`SparkSession`.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark minor_pyspark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17367


commit 42211f5e6bb1a281f7f99552095090a41a248ffa
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-03-21T03:37:12Z

[MINOR][PYSPARK] Remove _inferSchema in context.py




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-03-14 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
@holdenk  Do you have time to review this ? Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2017-03-14 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
I created a google doc about how to use it, 
https://docs.google.com/document/d/1KB9RYW8_bSeOzwVqZFc_zy_vXqqqctwrU5TROP_16Ds/edit?usp=sharing



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-03-10 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17222#discussion_r105392650
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala ---
@@ -484,6 +484,21 @@ class UDFRegistration private[sql] (functionRegistry: 
FunctionRegistry) extends
 
   }
 
+  private[sql] def registerJavaUDAF(name: String, className: String): Unit 
= {
--- End diff --

ScalaDoc is added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...

2017-03-09 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17222
  
@holdenk @marmbrus Please help review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...

2017-03-08 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/17222

[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support 
UDAFs

## What changes were proposed in this pull request?

Support register Java UDAFs in PySpark so that user can use Java UDAF in 
PySpark.

## How was this patch tested?

Unit test is added




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-19439

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17222


commit 391c9591537f6c35d1aaffd4fa8238a4d13191e6
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-03-09T07:06:50Z

[SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support 
UDAFs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17194: Add new aggregates EVERY and ANY (SOME).

2017-03-08 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/17194
  
@ptkool Please help the title to include the JIRA Id so that it can be 
linked to jira automatically. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...

2017-03-08 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/10307#discussion_r104874993
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -282,6 +282,23 @@ def parquet(self, *paths):
 """
 return self._df(self._jreader.parquet(_to_seq(self._spark._sc, 
paths)))
 
+@since(2.2)
+def parquet(self, path):
--- End diff --

Thanks @holdenk , I learned a new thing of python. I reverted the changes 
on parquet, It would be very weird to change it as `def parquet(self, *paths, 
path=None):` and  `def parquet(self, **kwargs:)` would break the code without 
using keyword argument, e.g. `parquet("p_file")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...

2017-03-07 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/10307#discussion_r104829036
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -407,15 +424,17 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 
 @since(1.5)
 def orc(self, path):
-"""Loads an ORC file, returning the result as a :class:`DataFrame`.
+"""Loads ORC files, returning the result as a :class:`DataFrame`.
--- End diff --

It is in `python/pyspark/sql/tests.py`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...

2017-02-28 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/10307#discussion_r103600310
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -388,16 +388,18 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
 
 @since(1.5)
-def orc(self, path):
-"""Loads an ORC file, returning the result as a :class:`DataFrame`.
+def orc(self, paths):
--- End diff --

Good catch, I should not break the compatibility.  BTW,  I found that 
`DataFrameReader.parquet` use variable length argument which is not consistent 
with other file formats such as text, json and orc that use string or a list of 
string.  I can fix this in this PR or can do it in another PR to make them 
consistent. What do you think ?

```
@since(1.4)
def parquet(self, *paths):
"""Loads Parquet files, returning the result as a 
:class:`DataFrame`.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16907: [SPARK-19572][SPARKR] Allow to disable hive in sp...

2017-02-28 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16907#discussion_r103599670
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -47,12 +47,15 @@ private[sql] object SQLUtils extends Logging {
   jsc: JavaSparkContext,
   sparkConfigMap: JMap[Object, Object],
   enableHiveSupport: Boolean): SparkSession = {
-val spark = if (SparkSession.hiveClassesArePresent && 
enableHiveSupport) {
+val spark = if (SparkSession.hiveClassesArePresent && enableHiveSupport
+&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase 
== "hive") {
   
SparkSession.builder().sparkContext(withHiveExternalCatalog(jsc.sc)).getOrCreate()
 } else {
-  if (enableHiveSupport) {
+  if (enableHiveSupport
+&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase 
== "hive") {
 logWarning("SparkR: enableHiveSupport is requested for 
SparkSession but " +
-  "Spark is not built with Hive; falling back to without Hive 
support.")
+  s"Spark is not built with Hive or ${CATALOG_IMPLEMENTATION.key} 
is not set to 'hive', " +
--- End diff --

Good catch, I updated the if condition to match the message. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...

2017-02-24 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16907
  
Yeah, it would be nice to be merged into 2.1 as well. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...

2017-02-24 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16907
  
Seems a flaky test, let me trigger the build


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16907: [SPARK-19572][SPARKR] Allow to disable hive in sp...

2017-02-23 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/16907#discussion_r102865692
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -48,13 +48,14 @@ private[sql] object SQLUtils extends Logging {
   sparkConfigMap: JMap[Object, Object],
   enableHiveSupport: Boolean): SparkSession = {
 val spark = if (SparkSession.hiveClassesArePresent && enableHiveSupport
-&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive") == "hive") {
+&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase 
== "hive") {
   
SparkSession.builder().sparkContext(withHiveExternalCatalog(jsc.sc)).getOrCreate()
 } else {
   if (enableHiveSupport
-&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive") == "hive") {
+&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase 
== "hive") {
 logWarning("SparkR: enableHiveSupport is requested for 
SparkSession but " +
-  "Spark is not built with Hive; falling back to without Hive 
support.")
+  s"Spark is not built with Hive or Hive is disabled via 
${CATALOG_IMPLEMENTATION.key} " +
--- End diff --

Ping @felixcheung, message is updated.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2017-02-22 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/11211
  
@holdenk description is updated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2017-02-20 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/11211
  
ping @holdenk @HyukjinKwon PR is updated, please help review. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...

2017-02-13 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/11211
  
Sorry for late reply, I may come back to this issue late of this week. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...

2017-02-13 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16907
  
Address the comments. @felixcheung, correct, `shell.R` is not supposed to 
be used outside. This ticket is mainly for disabling hive in sparkR shell, 
sparkR batch mode already support this feature. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...

2017-02-13 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16907
  
@felixcheung  Please help review. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16907: [SPARK-19582][SPARKR] Allow to disable hive in sp...

2017-02-12 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/16907

[SPARK-19582][SPARKR] Allow to disable hive in sparkR shell

## What changes were proposed in this pull request?
SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This 
is not only for sparkR itself, but can also benefit downstream project like 
livy which use shell.R for its interactive session. For now, livy has no 
control of whether enable hive or not.

## How was this patch tested?

Tested it manually, run `bin/sparkR --master local --conf 
spark.sql.catalogImplementation=in-memory` and verify hive is not enabled. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-19572

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16907.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16907


commit 8329be6dce176022d08bb3109dc994434bf7c84a
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-02-13T05:52:22Z

[SPARK-19582][SPARKR] Allow to disable hive in sparkR shell




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...

2017-02-12 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/16906
  
 @holdenk Please help review 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16906: [SPARK-19570][PYSPARK] Allow to disable hive in p...

2017-02-12 Thread zjffdu
GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/16906

[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell

## What changes were proposed in this pull request?

SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This 
is not only for pyspark itself, but can also benefit downstream project like 
livy which use shell.py for its interactive session. For now, livy has no 
control of whether enable hive or not.

## How was this patch tested?

I didn't find a way to add test for it. Just manually test it. 
Run `bin/pyspark --master local --conf 
spark.sql.catalogImplementation=in-memory` and verify hive is not enabled. 




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-19570

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16906.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16906


commit 431bcf8d332afe9d971b1f44a51e5dd2ca32ff81
Author: Jeff Zhang <zjf...@apache.org>
Date:   2017-02-13T02:03:40Z

[SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of...

2016-11-28 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13557
  
@sethah Thanks for the review, I have updated the PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files & spark.jars shoul...

2016-11-01 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/15669
  
hmm, notice spark.files is still passed to SparkContext in yarn-client 
mode, seems I need to do that in SparkSubmit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files should not be pass...

2016-11-01 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/15669
  
That's correct, this PR will also fix the yarn-client case. PR title is 
updated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...

2016-11-01 Thread zjffdu
Github user zjffdu closed the pull request at:

https://github.com/apache/spark/pull/15669


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...

2016-11-01 Thread zjffdu
GitHub user zjffdu reopened a pull request:

https://github.com/apache/spark/pull/15669

[SPARK-18160][CORE][YARN] spark.files should not be passed to driver in 
yarn-cluster mode

## What changes were proposed in this pull request?

spark.files is still passed to driver in yarn mode, so SparkContext will 
still handle it which cause the error in the jira desc.

## How was this patch tested?

Tested manually in a 5 node cluster. As this issue only happens in multiple 
node cluster, so I didn't write test for it. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-18160

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15669.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15669


commit 67a5ccf7a9d02a8b930ab97e10c0858b4d046496
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-10-28T07:50:59Z

[SPARK-18160][CORE][YARN] SparkContext.addFile doesn't work in yarn-cluster 
mode

commit 8033bd1bce9aa2fa0b05fe53c66ea072656dbd23
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-10-31T23:23:53Z

Revert "[SPARK-18160][CORE][YARN] SparkContext.addFile doesn't work in 
yarn-cluster mode"

This reverts commit 67a5ccf7a9d02a8b930ab97e10c0858b4d046496.

commit 230d56c90ce7ff30251a03afe6c677fe9df8faca
Author: Jeff Zhang <zjf...@apache.org>
Date:   2016-11-01T03:58:08Z

remove spark.files & spark.jars from SparkConf in yarn mode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2016-10-31 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r85877935
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
   }
 
   /**
+   * Create virtualenv using native virtualenv or conda
+   *
+   * Native Virtualenv:
+   *   -  Execute command: virtualenv -p pythonExec --no-site-packages 
virtualenvName
+   *   -  Execute command: python -m pip --cache-dir cache-dir install -r 
requirement_file
+   *
+   * Conda
+   *   -  Execute command: conda create --prefix prefix --file 
requirement_file -y
+   *
+   */
+  def setupVirtualEnv(): Unit = {
+logDebug("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: ${virtualEnvType} is not supported" )
+virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+// use the absolute path when it is local mode otherwise just use 
filename as it would be
+// fetched from FileServer
+val pyspark_requirements =
+  if (Utils.isLocalMaster(conf)) {
+conf.get("spark.pyspark.virtualenv.requirements")
+  } else {
+conf.get("spark.pyspark.virtualenv.requirements").split("/").last
+  }
+
+val createEnvCommand =
+  if (virtualEnvType == "native") {
+Arrays.asList(virtualEnvPath,
+  "-p", pythonExec,
+  "--no-site-packages", virtualEnvName)
+  } else {
+Arrays.asList(virtualEnvPath,
+  "create", "--prefix", System.getProperty("user.dir") + "/" + 
virtualEnvName,
+  "--file", pyspark_requirements, "-y")
+  }
+execCommand(createEnvCommand)
+// virtualenv will be created in the working directory of Executor.
+virtualPythonExec = virtualEnvName + "/bin/python"
--- End diff --

not tested yet for windows. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2016-10-31 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r85877793
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
   }
 
   /**
+   * Create virtualenv using native virtualenv or conda
+   *
+   * Native Virtualenv:
+   *   -  Execute command: virtualenv -p pythonExec --no-site-packages 
virtualenvName
+   *   -  Execute command: python -m pip --cache-dir cache-dir install -r 
requirement_file
+   *
+   * Conda
+   *   -  Execute command: conda create --prefix prefix --file 
requirement_file -y
+   *
+   */
+  def setupVirtualEnv(): Unit = {
+logDebug("Start to setup virtualenv...")
+logDebug("user.dir=" + System.getProperty("user.dir"))
+logDebug("user.home=" + System.getProperty("user.home"))
+
+require(virtualEnvType == "native" || virtualEnvType == "conda",
+  s"VirtualEnvType: ${virtualEnvType} is not supported" )
+virtualEnvName = "virtualenv_" + conf.getAppId + "_" + 
VIRTUALENV_ID.getAndIncrement()
+// use the absolute path when it is local mode otherwise just use 
filename as it would be
+// fetched from FileServer
+val pyspark_requirements =
+  if (Utils.isLocalMaster(conf)) {
+conf.get("spark.pyspark.virtualenv.requirements")
+  } else {
+conf.get("spark.pyspark.virtualenv.requirements").split("/").last
+  }
+
+val createEnvCommand =
+  if (virtualEnvType == "native") {
+Arrays.asList(virtualEnvPath,
+  "-p", pythonExec,
+  "--no-site-packages", virtualEnvName)
+  } else {
+Arrays.asList(virtualEnvPath,
+  "create", "--prefix", System.getProperty("user.dir") + "/" + 
virtualEnvName,
+  "--file", pyspark_requirements, "-y")
+  }
+execCommand(createEnvCommand)
+// virtualenv will be created in the working directory of Executor.
+virtualPythonExec = virtualEnvName + "/bin/python"
+if (virtualEnvType == "native") {
+  execCommand(Arrays.asList(virtualPythonExec, "-m", "pip",
+"--cache-dir", System.getProperty("user.home"),
+"install", "-r", pyspark_requirements))
+}
+  }
+
+  def execCommand(commands: java.util.List[String]): Unit = {
+logDebug("Running command:" + commands.asScala.mkString(" "))
+val pb = new ProcessBuilder(commands).inheritIO()
+// pip internally use environment variable `HOME`
+pb.environment().put("HOME", System.getProperty("user.home"))
--- End diff --

For yarn mode, HOME is "/home/" which is not correct. So here I get it from 
system property user.home
launch_container.sh
```
export HOME="/home/"
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2016-10-31 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/13599
  
Thanks for the review @mridulm , this approach is trying the move the 
overhead from user to cluster. User just need to specify the requirement file 
and spark will set up the virtualenv automatically. This is consistent with the 
usage of single machine virtualenv setup.  `SPARK-16367` use another approach 
to distribute the dependencies via spark-submit, but require the cluster to be 
homogeneous. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...

2016-10-31 Thread zjffdu
Github user zjffdu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15669#discussion_r85872343
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1716,29 +1716,12 @@ class SparkContext(config: SparkConf) extends 
Logging {
 key = uri.getScheme match {
   // A JAR file which exists only on the driver node
   case null | "file" =>
-if (master == "yarn" && deployMode == "cluster") {
-  // In order for this to work in yarn cluster mode the user 
must specify the
-  // --addJars option to the client to upload the file into 
the distributed cache
-  // of the AM to make it show up in the current working 
directory.
-  val fileName = new Path(uri.getPath).getName()
-  try {
-env.rpcEnv.fileServer.addJar(new File(fileName))
-  } catch {
-case e: Exception =>
-  // For now just log an error but allow to go through so 
spark examples work.
-  // The spark examples don't really need the jar 
distributed since its also
-  // the app jar.
-  logError("Error adding jar (" + e + "), was the 
--addJars option used?")
-  null
-  }
-} else {
-  try {
-env.rpcEnv.fileServer.addJar(new File(uri.getPath))
-  } catch {
-case exc: FileNotFoundException =>
-  logError(s"Jar not found at $path")
-  null
-  }
--- End diff --

These are obsoleted code IMO, so I remove them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] SparkContext.addFile doesn't w...

2016-10-31 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/15669
  
that's correct, it is due to `spark.files`, jira has been updated.  Will 
update the PR soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] SparkContext.addFile doesn't w...

2016-10-31 Thread zjffdu
Github user zjffdu commented on the issue:

https://github.com/apache/spark/pull/15669
  
spark.files would still be passed to driver even in yarn-cluster if you 
check the following code. 

https://github.com/apache/spark/blob/7bf8a4049866b2ec7fdf0406b1ad0c3a12488645/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L609
```
// Load any properties specified through --conf and the default 
properties file
for ((k, v) <- args.sparkProperties) {
  sysProps.getOrElseUpdate(k, v)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >