[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r193664416 --- Diff: docs/submitting-applications.md --- @@ -218,6 +218,115 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries to executors. +# VirtualEnv for PySpark --- End diff -- Thanks @HyukjinKwon , doc is updated --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 Thanks for the interest on this PR and the info about `Pipfiles`. I think we could support that after this PR get merged so that we can provide users more options for virtualenv based on their enviroment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13493 Thanks @jkbradley The failed tests seems unrelated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 That would be awesome. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 I am afraid I would not be present in Strata SJ, I live in Shanghai China, and may not be able to travel at time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 ping @holdenk @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164646157 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { --- End diff -- It is used by launcher module which doesn't depend on scala. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164069473 --- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala --- @@ -39,12 +39,17 @@ object PythonRunner { val pyFiles = args(1) val otherArgs = args.slice(2, args.length) val sparkConf = new SparkConf() -val pythonExec = sparkConf.get(PYSPARK_DRIVER_PYTHON) +var pythonExec = sparkConf.get(PYSPARK_DRIVER_PYTHON) --- End diff -- I am afraid I have to use var, because `VirtualenvFactory` also need `pythonExec` as constructor arg. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164068172 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private val virtualEnvBinPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private val initPythonPackages = conf.getOption("spark.pyspark.virtualenv.packages") --- End diff -- For the existing running executor, the only way to install additional packages is via `sc.intall_packages`. And `spark.pyspark.virtualenv.packages` will be updated on driver side first when `sc.install_packges` is called, then new allocated executor will fetch this property and install all the additional packages properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164037516 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,41 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages): +""" +Install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled is set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +return pip.main(["install"] + packages) + +# install package on driver first. if installation succeeded, continue the installation +# on executors, otherwise return directly. +if _run_pip(packages, None) != 0: +return + +virtualenvPackages = self._conf.get("spark.pyspark.virtualenv.packages") +if virtualenvPackages: +self._conf.set("spark.pyspark.virtualenv.packages", virtualenvPackages + "," + + ",".join(packages)) +else: +self._conf.set("spark.pyspark.virtualenv.packages", ",".join(packages)) --- End diff -- Good catch, I will use other separator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164037239 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala --- @@ -98,7 +98,7 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp private val reviveThread = ThreadUtils.newDaemonSingleThreadScheduledExecutor("driver-revive-thread") - class DriverEndpoint(override val rpcEnv: RpcEnv, sparkProperties: Seq[(String, String)]) + class DriverEndpoint(override val rpcEnv: RpcEnv) --- End diff -- Without this change, the following scenario won't work. 1. Launch spark app 2. call `sc.install_packages("numpy")` 3. run `sc.range(3).map(lambda x: np.__version__).collect()` 4. Restart executor (by kill it, scheduler will scheduler another executor) 5. run `sc.range(3).map(lambda x: np.__version___.collect()` again, this time it would fail. Because the new scheduled executor can not set up virtualenv correctly as it can not get the updated `spark.pyspark.virtualenv.packages`. That's why make this change in core. Now executor would always get the updated SparkConf instead of the SparkConf created when spark app is started. There's some overhead, but I believe it is very trivial, and could be improved later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164037055 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java --- @@ -299,20 +300,34 @@ // 4. environment variable PYSPARK_PYTHON // 5. python List pyargs = new ArrayList<>(); -pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON), +String pythonExec = firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON), conf.get(SparkLauncher.PYSPARK_PYTHON), System.getenv("PYSPARK_DRIVER_PYTHON"), System.getenv("PYSPARK_PYTHON"), - "python")); -String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS"); -if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) { - // pass conf spark.pyspark.python to python by environment variable. - env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON)); + "python"); +if (conf.getOrDefault("spark.pyspark.virtualenv.enabled", "false").equals("true")) { + try { +// setup virtualenv in launcher when virtualenv is enabled in pyspark shell +Class virtualEnvClazz = Class.forName("org.apache.spark.api.python.VirtualEnvFactory"); --- End diff -- Because launcher module does not depend on core module explicitly in pom.xml --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164036488 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private val virtualEnvBinPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private val initPythonPackages = conf.getOption("spark.pyspark.virtualenv.packages") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { +this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver) +this.isLauncher = true + } + + /* + * Create virtualenv using native virtualenv or conda + * + */ + def setupVirtualEnv(): String = { +/* + * + * Native Virtualenv: + * - Execute command: virtualenv -p --no-site-packages + * - Execute command: python -m pip --cache-dir install -r + * + * Conda + * - Execute command: conda create --prefix --file -y + * + */ +logInfo("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: $virtualEnvType is not supported." ) +require(new File(virtualEnvBinPath).exists(), + s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't exist.") +// Two scenarios of creating virtualenv: +// 1. created in yarn container. Yarn will clean it up after container is exited +// 2. created outside yarn container. Spark need to create temp directory and clean it after app +//finish. +// - driver of PySpark shell +// - driver of yarn-client mode +if (isLauncher || + (isDriver && conf.get("spark.submit.deployMode") == "client")) { + val virtualenvBasedir = Files.createTempDir() + virtualenvBasedir.deleteOnExit() + virtualEnvName = virtualenvBasedir.getAbsolutePath +} else if (isDriver && conf.get("spark.submit.deployMode") == "cluster") { + virtualEnvName = "virtualenv_driver" +} else { + // use the working directory of Executor + virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() +} + +// Use the absolute path of requirement file in the following cases +// 1. driver of pyspark shell +// 2. driver of yarn-client mode +// otherwise just use filename as it would be downloaded to the working directory of Executor +val pysparkRequirements = + if (isLauncher || +(isDriver && conf.get("spark.subm
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164034980 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,164 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private val virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private val virtualEnvBinPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private val initPythonPackages = conf.getOption("spark.pyspark.virtualenv.packages") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { +this(pythonExec, new SparkConf().setAll(properties.asScala), isDriver) +this.isLauncher = true + } + + /* + * Create virtualenv using native virtualenv or conda + * + */ + def setupVirtualEnv(): String = { +/* + * + * Native Virtualenv: + * - Execute command: virtualenv -p --no-site-packages + * - Execute command: python -m pip --cache-dir install -r + * + * Conda + * - Execute command: conda create --prefix --file -y + * + */ +logInfo("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: $virtualEnvType is not supported." ) +require(new File(virtualEnvBinPath).exists(), + s"VirtualEnvBinPath: $virtualEnvBinPath is not defined or doesn't exist.") +// Two scenarios of creating virtualenv: +// 1. created in yarn container. Yarn will clean it up after container is exited +// 2. created outside yarn container. Spark need to create temp directory and clean it after app +//finish. +// - driver of PySpark shell +// - driver of yarn-client mode +if (isLauncher || + (isDriver && conf.get("spark.submit.deployMode") == "client")) { + val virtualenvBasedir = Files.createTempDir() + virtualenvBasedir.deleteOnExit() + virtualEnvName = virtualenvBasedir.getAbsolutePath +} else if (isDriver && conf.get("spark.submit.deployMode") == "cluster") { + virtualEnvName = "virtualenv_driver" +} else { + // use the working directory of Executor + virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() +} + +// Use the absolute path of requirement file in the following cases +// 1. driver of pyspark shell +// 2. driver of yarn-client mode +// otherwise just use filename as it would be downloaded to the working directory of Executor +val pysparkRequirements = + if (isLauncher || +(isDriver && conf.get("spark.subm
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r164034871 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,41 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages): +""" +Install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled is set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +return pip.main(["install"] + packages) + +# install package on driver first. if installation succeeded, continue the installation +# on executors, otherwise return directly. +if _run_pip(packages, None) != 0: +return --- End diff -- The pip install output will be displayed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r163427975 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,42 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages): +""" +install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled is set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +return pip.main(["install"] + packages) + +# install package on driver first. if installation succeeded, continue the installation +# on executors, otherwise return directly. +if _run_pip(packages, None) != 0: +return + +virtualenvPackages = self._conf.get("spark.pyspark.virtualenv.packages") +if virtualenvPackages: +self._conf.set("spark.pyspark.virtualenv.packages", virtualenvPackages + "," + + ",".join(packages)) +else: +self._conf.set("spark.pyspark.virtualenv.packages", ",".join(packages)) + +import functools +dummyRDD.foreachPartition(functools.partial(_run_pip, packages)) --- End diff -- You are right, it is not guaranteed. From my experiment, it works pretty well most of time. And even it is not executed on all executors in this rdd operation, packages will be installed when later python daemon is started on that executor. So in any case, python packages will be installed in all python daemons. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 @holdenk @HyukjinKwon @ueshin I have updated the PR, and now it also works when executor is restarted and even dynamic allocation is enabled. The only overhead is on the driver side when executor ask for configuration (changes in https://github.com/apache/spark/pull/13599/files#diff-7d99a7c7a051e5e851aaaefb275a44a1), Now I will constructing new properties when executor ask for properties instead of like before that creating properties before hand and never change even when SparkConf is changed. But I believe the overhead is trivial, so the improvement could be done in a separated PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160572606 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,35 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): +""" +install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +pip.main(["install"] + packages) + +# run it in the main thread. Will do it in a separated thread after +# https://github.com/pypa/pip/issues/2553 is fixed +if install_driver: +_run_pip(packages, None) + +import functools +dummyRDD.foreachPartition(functools.partial(_run_pip, packages)) --- End diff -- It make sense to making this feature as experimental. Because although it is not reliable in some cases, it is still pretty useful in interactive mode, e.g. In notebook, it is not possible to set down all the dependent packages before launching spark app. Installing packages at runtime is very useful for interactive mode. And since usually notebook is a experimental phase, not production phase. These corner cases should be acceptable IMHO as long as we document them and make users aware. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160310618 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1039,33 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): +""" +install python packages on all executors and driver through pip +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client --- End diff -- @HyukjinKwon Could you guide me how to skip such doctests ? Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160308321 --- Diff: launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java --- @@ -299,20 +301,39 @@ // 4. environment variable PYSPARK_PYTHON // 5. python List pyargs = new ArrayList<>(); -pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON), - conf.get(SparkLauncher.PYSPARK_PYTHON), - System.getenv("PYSPARK_DRIVER_PYTHON"), - System.getenv("PYSPARK_PYTHON"), - "python")); -String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS"); -if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) { - // pass conf spark.pyspark.python to python by environment variable. - env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON)); -} -if (!isEmpty(pyOpts)) { - pyargs.addAll(parseOptionString(pyOpts)); -} +String pythonExec = firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON), +conf.get(SparkLauncher.PYSPARK_PYTHON), +System.getenv("PYSPARK_DRIVER_PYTHON"), +System.getenv("PYSPARK_PYTHON"), +"python"); +if (conf.getOrDefault("spark.pyspark.virtualenv.enabled", "false").equals("true")) { + try { +// setup virtualenv in launcher when virtualenv is enabled in pyspark shell +Class virtualEnvClazz = getClass().forName("org.apache.spark.api.python.VirtualEnvFactory"); +Object virtualEnv = virtualEnvClazz.getConstructor(String.class, Map.class, Boolean.class) +.newInstance(pythonExec, conf, true); +Method virtualEnvMethod = virtualEnv.getClass().getMethod("setupVirtualEnv"); +pythonExec = (String) virtualEnvMethod.invoke(virtualEnv); +pyargs.add(pythonExec); + } catch (Exception e) { +throw new IOException(e); + } +} else { + pyargs.add(firstNonEmpty(conf.get(SparkLauncher.PYSPARK_DRIVER_PYTHON), + conf.get(SparkLauncher.PYSPARK_PYTHON), + System.getenv("PYSPARK_DRIVER_PYTHON"), + System.getenv("PYSPARK_PYTHON"), + "python")); + String pyOpts = System.getenv("PYSPARK_DRIVER_PYTHON_OPTS"); + if (conf.containsKey(SparkLauncher.PYSPARK_PYTHON)) { +// pass conf spark.pyspark.python to python by environment variable. +env.put("PYSPARK_PYTHON", conf.get(SparkLauncher.PYSPARK_PYTHON)); + } + if (!isEmpty(pyOpts)) { +pyargs.addAll(parseOptionString(pyOpts)); + } --- End diff -- Good point, fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160285377 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { +this(pythonExec, new SparkConf(), isDriver) --- End diff -- I am afraid not. Because this method is used by module launcher where SparkConf is unknown. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 @holdenk @ueshin @HyukjinKwon Thanks for review the long pending PR. Will refine the PR soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160141363 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1039,33 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): --- End diff -- user can call this method to install package at runtime. e.g. user can install package in pyspark shell. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160140451 --- Diff: docs/submitting-applications.md --- @@ -218,6 +218,73 @@ These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries to executors. +# VirtualEnv for Pyspark +For simple PySpark application, we can use `--py-files` to add its dependencies. While for a large PySpark application, +usually you will have many dependencies which may also have transitive dependencies and even some dependencies need to be compiled +to be installed. In this case `--py-files` is not so convenient. Luckily, in python world we have virtualenv/conda to help create isolated +python work environment. We also implement virtualenv in PySpark (It is only supported in yarn mode for now). + +# Prerequisites +- Each node have virtualenv/conda, python-devel installed +- Each node is internet accessible (for downloading packages) + +{% highlight bash %} +# Setup virtualenv using native virtualenv on yarn-client mode +bin/spark-submit \ +--master yarn \ +--deploy-mode client \ +--conf "spark.pyspark.virtualenv.enabled=true" \ +--conf "spark.pyspark.virtualenv.type=native" \ +--conf "spark.pyspark.virtualenv.requirements=" \ +--conf "spark.pyspark.virtualenv.bin.path=" \ + + +# Setup virtualenv using conda on yarn-client mode +bin/spark-submit \ +--master yarn \ +--deploy-mode client \ +--conf "spark.pyspark.virtualenv.enabled=true" \ +--conf "spark.pyspark.virtualenv.type=conda" \ +--conf "spark.pyspark.virtualenv.requirements=<" \ +--conf "spark.pyspark.virtualenv.bin.path=" \ + +{% endhighlight %} + +## PySpark VirtualEnv Configurations + +Property NameDefaultMeaning + + spark.pyspark.virtualenv.enabled + false + Whether to enable virtualenv + + + Spark.pyspark.virtualenv.type + virtualenv --- End diff -- I don't have strong preference on `native`. The reason I use native is that I already use virtualenv in the all the virtualenv related configuration, just don't want to introduce virtualenv here again to avoid confusion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160139161 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { +this(pythonExec, new SparkConf(), isDriver) +properties.asScala.foreach(entry => this.conf.set(entry._1, entry._2)) +virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") +virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path") +this.isLauncher = true + } + + /* + * Create virtualenv using native virtualenv or conda + * + * Native Virtualenv: + * - Execute command: virtualenv -p --no-site-packages + * - Execute command: python -m pip --cache-dir install -r + * + * Conda + * - Execute command: conda create --prefix --file -y + * + */ + def setupVirtualEnv(): String = { +logInfo("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: ${virtualEnvType} is not supported." ) +require(new File(virtualEnvPath).exists(), + s"VirtualEnvPath: ${virtualEnvPath} is not defined or doesn't exist.") +// Use a temp directory for virtualenv in the following cases: +// 1. driver of pyspark shell +// 2. driver of yarn-client mode +// otherwise create the virtualenv folder under the executor working directory. +if (isLauncher || + (isDriver && conf.get("spark.submit.deployMode") == "client")) { + val virtualenv_basedir = Files.createTempDir() + virtualenv_basedir.deleteOnExit() + virtualEnvName = virtualenv_basedir.getAbsolutePath +} else if (isDriver && conf.get("spark.submit.deployMode") == "cluster") { + virtualEnvName = "virtualenv_driver" +} else { + // use the working directory of Executor + virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() +} + +// Use the absolute path of requirement file in the following cases +// 1. driver of pyspark shell +// 2. driver of yarn-client mode +// otherwise just use filename as it would be downloaded to the working directory of Executor +val pyspark_requirements = --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160138782 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { --- End diff -- This is because it will also be called in java side via java reflection. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160138391 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -60,6 +66,12 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String envVars.getOrElse("PYTHONPATH", ""), sys.env.getOrElse("PYTHONPATH", "")) + + if (virtualEnvEnabled) { +val virtualEnvFactory = new VirtualEnvFactory(pythonExec, conf, false) +pythonExec = virtualEnvFactory.setupVirtualEnv() --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160138349 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -29,7 +30,10 @@ import org.apache.spark._ import org.apache.spark.internal.Logging import org.apache.spark.util.{RedirectThread, Utils} -private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String, String]) + +private[spark] class PythonWorkerFactory(var pythonExec: String, --- End diff -- Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160070613 --- Diff: core/src/main/scala/org/apache/spark/api/python/VirtualEnvFactory.scala --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io.File +import java.util.{Map => JMap} +import java.util.Arrays +import java.util.concurrent.atomic.AtomicInteger + +import scala.collection.JavaConverters._ + +import com.google.common.io.Files + +import org.apache.spark.SparkConf +import org.apache.spark.internal.Logging + + +private[spark] class VirtualEnvFactory(pythonExec: String, conf: SparkConf, isDriver: Boolean) + extends Logging { + + private var virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") + private var virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path", "") + private var virtualEnvName: String = _ + private var virtualPythonExec: String = _ + private val VIRTUALENV_ID = new AtomicInteger() + private var isLauncher: Boolean = false + + // used by launcher when user want to use virtualenv in pyspark shell. Launcher need this class + // to create virtualenv for driver. + def this(pythonExec: String, properties: JMap[String, String], isDriver: java.lang.Boolean) { +this(pythonExec, new SparkConf(), isDriver) +properties.asScala.foreach(entry => this.conf.set(entry._1, entry._2)) +virtualEnvType = conf.get("spark.pyspark.virtualenv.type", "native") +virtualEnvPath = conf.get("spark.pyspark.virtualenv.bin.path") +this.isLauncher = true + } + + /* + * Create virtualenv using native virtualenv or conda + * + * Native Virtualenv: + * - Execute command: virtualenv -p --no-site-packages + * - Execute command: python -m pip --cache-dir install -r + * + * Conda + * - Execute command: conda create --prefix --file -y + * + */ + def setupVirtualEnv(): String = { +logInfo("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: ${virtualEnvType} is not supported." ) +require(new File(virtualEnvPath).exists(), --- End diff -- Sorry, the variable name is a little misleading, it should be `virtualEnvBinPath` which is `spark.pyspark.virtualenv.bin.path` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160070518 --- Diff: python/pyspark/context.py --- @@ -980,6 +996,33 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): +""" +install python packages on all executors and driver through pip +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise Exception("install_packages can only use called when " +"spark.pyspark.virtualenv.enabled set as true") +if isinstance(packages, basestring): +packages = [packages] +num_executors = int(self._conf.get("spark.executor.instances")) +dummyRDD = self.parallelize(range(num_executors), num_executors) --- End diff -- Right, it would not work when dynamic allocation is enabled. will add condition for that --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160070457 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -475,6 +475,19 @@ object SparkSubmit extends CommandLineUtils with Logging { } } +// for pyspark virtualenv +if (args.isPython) { + if (clusterManager != YARN && +args.sparkProperties.getOrElse("spark.pyspark.virtualenv.enabled", "false") == "true") { +printErrorAndExit("virtualenv is only supported in yarn mode") --- End diff -- I haven't tested it in standalone mode, so not guaranteed for that, it is on my plan to support it for standalone later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 Thanks @viirya --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 This PR fails fails PySpark pip packaging tests. But I don't know what's wrong here. @holdenk Is the `PySpark pip packaging test` an known issue ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r123876794 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala --- @@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution import scala.collection.JavaConverters._ import scala.util.Random +import test.org.apache.spark.sql.MyDoubleAvg +import test.org.apache.spark.sql.MyDoubleSum + import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.expressions.UnsafeRow import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.functions._ -import org.apache.spark.sql.hive.aggregate.{MyDoubleAvg, MyDoubleSum} import org.apache.spark.sql.hive.test.TestHiveSingleton import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SQLTestUtils import org.apache.spark.sql.types._ + --- End diff -- It depends on some hive stuff (`TestHiveSingleton`), so I guess it is intended to be put in sql/hive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r123871674 --- Diff: sql/hive/src/test/java/org/apache/spark/sql/hive/JavaDataFrameSuite.java --- @@ -31,7 +31,7 @@ import org.apache.spark.sql.expressions.UserDefinedAggregateFunction; import static org.apache.spark.sql.functions.*; import org.apache.spark.sql.hive.test.TestHive$; -import org.apache.spark.sql.hive.aggregate.MyDoubleSum; +import test.org.apache.spark.sql.MyDoubleSum; public class JavaDataFrameSuite { --- End diff -- do you mean move JavaDataFrameSuite to sql/core ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r123871670 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala --- @@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution import scala.collection.JavaConverters._ import scala.util.Random +import test.org.apache.spark.sql.MyDoubleAvg +import test.org.apache.spark.sql.MyDoubleSum + import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.expressions.UnsafeRow import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.functions._ -import org.apache.spark.sql.hive.aggregate.{MyDoubleAvg, MyDoubleSum} import org.apache.spark.sql.hive.test.TestHiveSingleton import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SQLTestUtils import org.apache.spark.sql.types._ + --- End diff -- I didn't add any test in this file. Or do you mean move AggregationQuerySuite.scala to sql/core ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14180: [SPARK-16367][PYSPARK] Support for deploying Anaconda an...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/14180 Here's my approach #13599 for virtualenv and conda support, welcome any comments and reviews https://docs.google.com/document/d/1EGNEf4vFmpGXSd2DPOLu_HL23Xhw9aWKeUrzzxsEbQs/edit?usp=sharing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @gatorsmile sorry for late response, will update it soon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r116325723 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -491,20 +491,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") --- End diff -- Sorry, miss your last comment, fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r116293947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -491,20 +491,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") --- End diff -- I didn't find a more proper exception type, so just use IOException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r116293890 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala --- @@ -20,16 +20,19 @@ package org.apache.spark.sql.hive.execution import scala.collection.JavaConverters._ import scala.util.Random +import _root_.test.org.apache.spark.sql.MyDoubleAvg +import _root_.test.org.apache.spark.sql.MyDoubleSum --- End diff -- oops, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r114963449 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") } } catch { case e @ (_: InstantiationException | _: IllegalArgumentException) => -logError(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") +throw new IOException(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") } } } catch { - case e: ClassNotFoundException => logError(s"Can not load class ${className}, please make sure it is on the classpath") + case e: ClassNotFoundException => throw new IOException(s"Can not load class ${className}, please make sure it is on the classpath") } } /** + * Register a Java UDAF class using reflection, for use from pyspark + * + * @param name UDAF name + * @param classNamefully qualified class name of UDAF + */ + private[sql] def registerJavaUDAF(name: String, className: String): Unit = { --- End diff -- This due to in scala side `registerJava` of `UDFRegistration' needs returnType. Yeah, it do looks like a little weird for python side to require returnType. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r114962484 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") } } catch { case e @ (_: InstantiationException | _: IllegalArgumentException) => -logError(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") +throw new IOException(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") } } } catch { - case e: ClassNotFoundException => logError(s"Can not load class ${className}, please make sure it is on the classpath") + case e: ClassNotFoundException => throw new IOException(s"Can not load class ${className}, please make sure it is on the classpath") } } /** + * Register a Java UDAF class using reflection, for use from pyspark + * + * @param name UDAF name + * @param classNamefully qualified class name of UDAF + */ + private[sql] def registerJavaUDAF(name: String, className: String): Unit = { --- End diff -- I mean `registerJavaUDAF` in `context.py` does't have `returnType`, so here in scala side, I don't provide `returnType `either since this scala method is only used for `registerJavaUDAF` of pyspark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r114948086 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") } } catch { case e @ (_: InstantiationException | _: IllegalArgumentException) => -logError(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") +throw new IOException(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") } } } catch { - case e: ClassNotFoundException => logError(s"Can not load class ${className}, please make sure it is on the classpath") + case e: ClassNotFoundException => throw new IOException(s"Can not load class ${className}, please make sure it is on the classpath") } } /** + * Register a Java UDAF class using reflection, for use from pyspark + * + * @param name UDAF name + * @param classNamefully qualified class name of UDAF + */ + private[sql] def registerJavaUDAF(name: String, className: String): Unit = { --- End diff -- pyspark side don't need `returnType` so I didn't use `returnType` here, and it is private function so should be open for adding `returnType` in future if necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @cloud-fan This is not about using python UDF, it is to allow pyspark to use java UDF (no python daemon will be launched). So actually it would improve the performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @holdenk @gatorsmile Any more comments ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @holdenk The link you pasted is for the case that using scala closure to create udf. While `registerJava` use java reflection to create udf. This is what I use in `registerJava` https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L528 It returns Unit. Maybe it is possible to create `registerScala` to return scala udf. But it seems it is not possible for java udf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r113085517 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -475,20 +475,42 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends case 21 => register(name, udf.asInstanceOf[UDF20[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 22 => register(name, udf.asInstanceOf[UDF21[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) case 23 => register(name, udf.asInstanceOf[UDF22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _]], returnType) -case n => logError(s"UDF class with ${n} type arguments is not supported ") +case n => + throw new IOException(s"UDF class with ${n} type arguments is not supported.") } } catch { case e @ (_: InstantiationException | _: IllegalArgumentException) => -logError(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") +throw new IOException(s"Can not instantiate class ${className}, please make sure it has public non argument constructor") } } } catch { - case e: ClassNotFoundException => logError(s"Can not load class ${className}, please make sure it is on the classpath") + case e: ClassNotFoundException => throw new IOException(s"Can not load class ${className}, please make sure it is on the classpath") } } /** + * Register a Java UDAF class using reflection, for use from pyspark + * + * @param name UDAF name + * @param classNamefully qualified class name of UDAF --- End diff -- @since is needed for private function ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @holdenk But it has nothing to return, because scala side return Unit. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala#L528 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 ping @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17586#discussion_r111279320 --- Diff: python/pyspark/ml/classification.py --- @@ -172,6 +172,47 @@ def intercept(self): """ return self._call_java("intercept") +@property +@since("2.2.0") --- End diff -- yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 Good catch ! @holdenk `return` is removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17586#discussion_r111042227 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -355,6 +368,19 @@ object LinearSVCModel extends MLReadable[LinearSVCModel] { } /** + * Abstraction for Linear SVC Training results. + * Currently, the training summary ignores the training weights except + * for the objective trace. + */ +case class LinearSVCTrainingSummary( --- End diff -- The classes below `LinearSVCTrainingSummary` are private classes, so I think it would better to keep LinearSVCTrainingSummary there (above the private classes) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17586#discussion_r111042049 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala --- @@ -355,6 +368,19 @@ object LinearSVCModel extends MLReadable[LinearSVCModel] { } /** + * Abstraction for Linear SVC Training results. + * Currently, the training summary ignores the training weights except + * for the objective trace. + */ --- End diff -- weight column also is not included in `LogisticRegressionTrainingSummary`, should I add that as well ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16906 Kindly ping @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17586 @hhbyyh @jkbradley Please help review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17586: [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCMode...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17586 I didn't add metrics like roc for this summary yet, I can add it if it is necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17586: [SPARK-20249][ML][PYSPARK] Add summary for Linear...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/17586 [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCModel ## What changes were proposed in this pull request? Add summary for LinearSVCModel so that user can get the training process status, such as loss value of each iteration and total iteration number. ## How was this patch tested? Tested manually in both spark example and pyspark example. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-20249 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17586.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17586 commit 368af49770f9bee35ea364c6c8a47c96eecf2eaa Author: Jeff Zhang <zjf...@apache.org> Date: 2017-04-10T06:17:48Z [SPARK-20249][ML][PYSPARK] Add summary for LinearSVCModel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @viirya Thanks for careful review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r110113683 --- Diff: python/pyspark/sql/tests.py --- @@ -436,6 +436,20 @@ def test_udf_with_order_by_and_limit(self): res.explain(True) self.assertEqual(res.collect(), [Row(id=0, copy=0)]) +def test_non_existed_udf(self): +try: +self.spark.udf.registerJavaFunction("udf1", "non_existed_udf") +self.fail("should fail due to can not load java udf class") +except py4j.protocol.Py4JError as e: +self.assertTrue("Can not load class non_existed_udf" in e.desc) + +def test_non_existed_udaf(self): +try: +self.spark.udf.registerJavaFunction("udf1", "non_existed_udaf") --- End diff -- Correct, fixed 😄 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r110108533 --- Diff: python/pyspark/sql/context.py --- @@ -228,6 +228,24 @@ def registerJavaFunction(self, name, javaClassName, returnType=None): jdt = self.sparkSession._jsparkSession.parseDataType(returnType.json()) self.sparkSession._jsparkSession.udf().registerJava(name, javaClassName, jdt) +@ignore_unicode_prefix +@since(2.2) +def registerJavaUDAF(self, name, javaClassName): +"""Register a java UDAF so it can be used in SQL statements. + +:param name: name of the UDF +:param javaClassName: fully qualified name of java class + +>>> sqlContext.registerJavaUDAF("javaUDAF", +... "org.apache.spark.sql.hive.aggregate.MyDoubleAvg") +>>> df = sqlContext.createDataFrame([(1, "a"),(2, "b"), (3, "a")],["id", "name"]) +>>> df.registerTempTable("df") +>>> sqlContext.sql("SELECT name,javaUDAF(id) as avg from df group by name").collect() +[Row(name=u'b', avg=102.0), Row(name=u'a', avg=102.0)] --- End diff -- Good point, will add that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @holdenk Mind to review it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
GitHub user zjffdu reopened a pull request: https://github.com/apache/spark/pull/17222 [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs ## What changes were proposed in this pull request? Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. Besides that I also add api in `UDFRegistration` ## How was this patch tested? Unit test is added You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-19439 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17222.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17222 commit 8c1e837e2e97c08c4a5753c79aea71da772b0eaa Author: Jeff Zhang <zjf...@apache.org> Date: 2017-03-09T07:06:50Z [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs commit 89b8d6588d4d6258f9c4d84339775544d93e6e3c Author: Jeff Zhang <zjf...@apache.org> Date: 2017-03-10T00:28:12Z add scala doc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu closed the pull request at: https://github.com/apache/spark/pull/17222 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17367: [MINOR][PYSPARK] Remove _inferSchema in context.py
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17367 Close it as _inferSchema is still used in many places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17367: [MINOR][PYSPARK] Remove _inferSchema in context.p...
Github user zjffdu closed the pull request at: https://github.com/apache/spark/pull/17367 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16906 Yeah, make sense. Fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17367: [MINOR][PYSPARK] Remove _inferSchema in context.p...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/17367 [MINOR][PYSPARK] Remove _inferSchema in context.py ## What changes were proposed in this pull request? _inferSchema is not used in context.py, all the things have been moved to `SparkSession`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark minor_pyspark Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17367.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17367 commit 42211f5e6bb1a281f7f99552095090a41a248ffa Author: Jeff Zhang <zjf...@apache.org> Date: 2017-03-21T03:37:12Z [MINOR][PYSPARK] Remove _inferSchema in context.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 @holdenk Do you have time to review this ? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 I created a google doc about how to use it, https://docs.google.com/document/d/1KB9RYW8_bSeOzwVqZFc_zy_vXqqqctwrU5TROP_16Ds/edit?usp=sharing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/17222#discussion_r105392650 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala --- @@ -484,6 +484,21 @@ class UDFRegistration private[sql] (functionRegistry: FunctionRegistry) extends } + private[sql] def registerJavaUDAF(name: String, className: String): Unit = { --- End diff -- ScalaDoc is added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFuncti...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17222 @holdenk @marmbrus Please help review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17222: [SPARK-19439][PYSPARK][SQL] PySpark's registerJav...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/17222 [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs ## What changes were proposed in this pull request? Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. ## How was this patch tested? Unit test is added You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-19439 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17222.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17222 commit 391c9591537f6c35d1aaffd4fa8238a4d13191e6 Author: Jeff Zhang <zjf...@apache.org> Date: 2017-03-09T07:06:50Z [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17194: Add new aggregates EVERY and ANY (SOME).
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/17194 @ptkool Please help the title to include the JIRA Id so that it can be linked to jira automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/10307#discussion_r104874993 --- Diff: python/pyspark/sql/readwriter.py --- @@ -282,6 +282,23 @@ def parquet(self, *paths): """ return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) +@since(2.2) +def parquet(self, path): --- End diff -- Thanks @holdenk , I learned a new thing of python. I reverted the changes on parquet, It would be very weird to change it as `def parquet(self, *paths, path=None):` and `def parquet(self, **kwargs:)` would break the code without using keyword argument, e.g. `parquet("p_file")` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/10307#discussion_r104829036 --- Diff: python/pyspark/sql/readwriter.py --- @@ -407,15 +424,17 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non @since(1.5) def orc(self, path): -"""Loads an ORC file, returning the result as a :class:`DataFrame`. +"""Loads ORC files, returning the result as a :class:`DataFrame`. --- End diff -- It is in `python/pyspark/sql/tests.py` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10307: [SPARK-12334][SQL][PYSPARK] Support read from mul...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/10307#discussion_r103600310 --- Diff: python/pyspark/sql/readwriter.py --- @@ -388,16 +388,18 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) @since(1.5) -def orc(self, path): -"""Loads an ORC file, returning the result as a :class:`DataFrame`. +def orc(self, paths): --- End diff -- Good catch, I should not break the compatibility. BTW, I found that `DataFrameReader.parquet` use variable length argument which is not consistent with other file formats such as text, json and orc that use string or a list of string. I can fix this in this PR or can do it in another PR to make them consistent. What do you think ? ``` @since(1.4) def parquet(self, *paths): """Loads Parquet files, returning the result as a :class:`DataFrame`. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16907: [SPARK-19572][SPARKR] Allow to disable hive in sp...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/16907#discussion_r103599670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -47,12 +47,15 @@ private[sql] object SQLUtils extends Logging { jsc: JavaSparkContext, sparkConfigMap: JMap[Object, Object], enableHiveSupport: Boolean): SparkSession = { -val spark = if (SparkSession.hiveClassesArePresent && enableHiveSupport) { +val spark = if (SparkSession.hiveClassesArePresent && enableHiveSupport +&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase == "hive") { SparkSession.builder().sparkContext(withHiveExternalCatalog(jsc.sc)).getOrCreate() } else { - if (enableHiveSupport) { + if (enableHiveSupport +&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase == "hive") { logWarning("SparkR: enableHiveSupport is requested for SparkSession but " + - "Spark is not built with Hive; falling back to without Hive support.") + s"Spark is not built with Hive or ${CATALOG_IMPLEMENTATION.key} is not set to 'hive', " + --- End diff -- Good catch, I updated the if condition to match the message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16907 Yeah, it would be nice to be merged into 2.1 as well. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16907 Seems a flaky test, let me trigger the build --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16907: [SPARK-19572][SPARKR] Allow to disable hive in sp...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/16907#discussion_r102865692 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala --- @@ -48,13 +48,14 @@ private[sql] object SQLUtils extends Logging { sparkConfigMap: JMap[Object, Object], enableHiveSupport: Boolean): SparkSession = { val spark = if (SparkSession.hiveClassesArePresent && enableHiveSupport -&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive") == "hive") { +&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase == "hive") { SparkSession.builder().sparkContext(withHiveExternalCatalog(jsc.sc)).getOrCreate() } else { if (enableHiveSupport -&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive") == "hive") { +&& jsc.sc.conf.get(CATALOG_IMPLEMENTATION.key, "hive").toLowerCase == "hive") { logWarning("SparkR: enableHiveSupport is requested for SparkSession but " + - "Spark is not built with Hive; falling back to without Hive support.") + s"Spark is not built with Hive or Hive is disabled via ${CATALOG_IMPLEMENTATION.key} " + --- End diff -- Ping @felixcheung, message is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/11211 @holdenk description is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/11211 ping @holdenk @HyukjinKwon PR is updated, please help review. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11211: [SPARK-13330][PYSPARK] PYTHONHASHSEED is not propgated t...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/11211 Sorry for late reply, I may come back to this issue late of this week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16907 Address the comments. @felixcheung, correct, `shell.R` is not supposed to be used outside. This ticket is mainly for disabling hive in sparkR shell, sparkR batch mode already support this feature. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16907: [SPARK-19572][SPARKR] Allow to disable hive in sparkR sh...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16907 @felixcheung Please help review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16907: [SPARK-19582][SPARKR] Allow to disable hive in sp...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/16907 [SPARK-19582][SPARKR] Allow to disable hive in sparkR shell ## What changes were proposed in this pull request? SPARK-15236 do this for scala shell, this ticket is for sparkR shell. This is not only for sparkR itself, but can also benefit downstream project like livy which use shell.R for its interactive session. For now, livy has no control of whether enable hive or not. ## How was this patch tested? Tested it manually, run `bin/sparkR --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-19572 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16907.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16907 commit 8329be6dce176022d08bb3109dc994434bf7c84a Author: Jeff Zhang <zjf...@apache.org> Date: 2017-02-13T05:52:22Z [SPARK-19582][SPARKR] Allow to disable hive in sparkR shell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16906: [SPARK-19570][PYSPARK] Allow to disable hive in pyspark ...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/16906 @holdenk Please help review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16906: [SPARK-19570][PYSPARK] Allow to disable hive in p...
GitHub user zjffdu opened a pull request: https://github.com/apache/spark/pull/16906 [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell ## What changes were proposed in this pull request? SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This is not only for pyspark itself, but can also benefit downstream project like livy which use shell.py for its interactive session. For now, livy has no control of whether enable hive or not. ## How was this patch tested? I didn't find a way to add test for it. Just manually test it. Run `bin/pyspark --master local --conf spark.sql.catalogImplementation=in-memory` and verify hive is not enabled. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-19570 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16906.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16906 commit 431bcf8d332afe9d971b1f44a51e5dd2ca32ff81 Author: Jeff Zhang <zjf...@apache.org> Date: 2017-02-13T02:03:40Z [SPARK-19570][PYSPARK] Allow to disable hive in pyspark shell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13557: [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13557 @sethah Thanks for the review, I have updated the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files & spark.jars shoul...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/15669 hmm, notice spark.files is still passed to SparkContext in yarn-client mode, seems I need to do that in SparkSubmit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] spark.files should not be pass...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/15669 That's correct, this PR will also fix the yarn-client case. PR title is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...
Github user zjffdu closed the pull request at: https://github.com/apache/spark/pull/15669 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...
GitHub user zjffdu reopened a pull request: https://github.com/apache/spark/pull/15669 [SPARK-18160][CORE][YARN] spark.files should not be passed to driver in yarn-cluster mode ## What changes were proposed in this pull request? spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc. ## How was this patch tested? Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zjffdu/spark SPARK-18160 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15669.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15669 commit 67a5ccf7a9d02a8b930ab97e10c0858b4d046496 Author: Jeff Zhang <zjf...@apache.org> Date: 2016-10-28T07:50:59Z [SPARK-18160][CORE][YARN] SparkContext.addFile doesn't work in yarn-cluster mode commit 8033bd1bce9aa2fa0b05fe53c66ea072656dbd23 Author: Jeff Zhang <zjf...@apache.org> Date: 2016-10-31T23:23:53Z Revert "[SPARK-18160][CORE][YARN] SparkContext.addFile doesn't work in yarn-cluster mode" This reverts commit 67a5ccf7a9d02a8b930ab97e10c0858b4d046496. commit 230d56c90ce7ff30251a03afe6c677fe9df8faca Author: Jeff Zhang <zjf...@apache.org> Date: 2016-11-01T03:58:08Z remove spark.files & spark.jars from SparkConf in yarn mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r85877935 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String } /** + * Create virtualenv using native virtualenv or conda + * + * Native Virtualenv: + * - Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName + * - Execute command: python -m pip --cache-dir cache-dir install -r requirement_file + * + * Conda + * - Execute command: conda create --prefix prefix --file requirement_file -y + * + */ + def setupVirtualEnv(): Unit = { +logDebug("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: ${virtualEnvType} is not supported" ) +virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() +// use the absolute path when it is local mode otherwise just use filename as it would be +// fetched from FileServer +val pyspark_requirements = + if (Utils.isLocalMaster(conf)) { +conf.get("spark.pyspark.virtualenv.requirements") + } else { +conf.get("spark.pyspark.virtualenv.requirements").split("/").last + } + +val createEnvCommand = + if (virtualEnvType == "native") { +Arrays.asList(virtualEnvPath, + "-p", pythonExec, + "--no-site-packages", virtualEnvName) + } else { +Arrays.asList(virtualEnvPath, + "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName, + "--file", pyspark_requirements, "-y") + } +execCommand(createEnvCommand) +// virtualenv will be created in the working directory of Executor. +virtualPythonExec = virtualEnvName + "/bin/python" --- End diff -- not tested yet for windows. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r85877793 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala --- @@ -69,6 +84,66 @@ private[spark] class PythonWorkerFactory(pythonExec: String, envVars: Map[String } /** + * Create virtualenv using native virtualenv or conda + * + * Native Virtualenv: + * - Execute command: virtualenv -p pythonExec --no-site-packages virtualenvName + * - Execute command: python -m pip --cache-dir cache-dir install -r requirement_file + * + * Conda + * - Execute command: conda create --prefix prefix --file requirement_file -y + * + */ + def setupVirtualEnv(): Unit = { +logDebug("Start to setup virtualenv...") +logDebug("user.dir=" + System.getProperty("user.dir")) +logDebug("user.home=" + System.getProperty("user.home")) + +require(virtualEnvType == "native" || virtualEnvType == "conda", + s"VirtualEnvType: ${virtualEnvType} is not supported" ) +virtualEnvName = "virtualenv_" + conf.getAppId + "_" + VIRTUALENV_ID.getAndIncrement() +// use the absolute path when it is local mode otherwise just use filename as it would be +// fetched from FileServer +val pyspark_requirements = + if (Utils.isLocalMaster(conf)) { +conf.get("spark.pyspark.virtualenv.requirements") + } else { +conf.get("spark.pyspark.virtualenv.requirements").split("/").last + } + +val createEnvCommand = + if (virtualEnvType == "native") { +Arrays.asList(virtualEnvPath, + "-p", pythonExec, + "--no-site-packages", virtualEnvName) + } else { +Arrays.asList(virtualEnvPath, + "create", "--prefix", System.getProperty("user.dir") + "/" + virtualEnvName, + "--file", pyspark_requirements, "-y") + } +execCommand(createEnvCommand) +// virtualenv will be created in the working directory of Executor. +virtualPythonExec = virtualEnvName + "/bin/python" +if (virtualEnvType == "native") { + execCommand(Arrays.asList(virtualPythonExec, "-m", "pip", +"--cache-dir", System.getProperty("user.home"), +"install", "-r", pyspark_requirements)) +} + } + + def execCommand(commands: java.util.List[String]): Unit = { +logDebug("Running command:" + commands.asScala.mkString(" ")) +val pb = new ProcessBuilder(commands).inheritIO() +// pip internally use environment variable `HOME` +pb.environment().put("HOME", System.getProperty("user.home")) --- End diff -- For yarn mode, HOME is "/home/" which is not correct. So here I get it from system property user.home launch_container.sh ``` export HOME="/home/" ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/13599 Thanks for the review @mridulm , this approach is trying the move the overhead from user to cluster. User just need to specify the requirement file and spark will set up the virtualenv automatically. This is consistent with the usage of single machine virtualenv setup. `SPARK-16367` use another approach to distribute the dependencies via spark-submit, but require the cluster to be homogeneous. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15669: [SPARK-18160][CORE][YARN] spark.files should not ...
Github user zjffdu commented on a diff in the pull request: https://github.com/apache/spark/pull/15669#discussion_r85872343 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1716,29 +1716,12 @@ class SparkContext(config: SparkConf) extends Logging { key = uri.getScheme match { // A JAR file which exists only on the driver node case null | "file" => -if (master == "yarn" && deployMode == "cluster") { - // In order for this to work in yarn cluster mode the user must specify the - // --addJars option to the client to upload the file into the distributed cache - // of the AM to make it show up in the current working directory. - val fileName = new Path(uri.getPath).getName() - try { -env.rpcEnv.fileServer.addJar(new File(fileName)) - } catch { -case e: Exception => - // For now just log an error but allow to go through so spark examples work. - // The spark examples don't really need the jar distributed since its also - // the app jar. - logError("Error adding jar (" + e + "), was the --addJars option used?") - null - } -} else { - try { -env.rpcEnv.fileServer.addJar(new File(uri.getPath)) - } catch { -case exc: FileNotFoundException => - logError(s"Jar not found at $path") - null - } --- End diff -- These are obsoleted code IMO, so I remove them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] SparkContext.addFile doesn't w...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/15669 that's correct, it is due to `spark.files`, jira has been updated. Will update the PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15669: [SPARK-18160][CORE][YARN] SparkContext.addFile doesn't w...
Github user zjffdu commented on the issue: https://github.com/apache/spark/pull/15669 spark.files would still be passed to driver even in yarn-cluster if you check the following code. https://github.com/apache/spark/blob/7bf8a4049866b2ec7fdf0406b1ad0c3a12488645/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L609 ``` // Load any properties specified through --conf and the default properties file for ((k, v) <- args.sparkProperties) { sysProps.getOrElseUpdate(k, v) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org