Re: Add python library with native code
Not totally sure it's gonna help your use case, but I'd recommend that you consider these too: * pex <https://github.com/pantsbuild/pex> A library and tool for generating .pex (Python EXecutable) files * cluster-pack <https://github.com/criteo/cluster-pack> cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster. Masood ______ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 6/5/20 4:29 AM, Stone Zhong wrote: Thanks Dark. Looked at that article. I think the article described approach B, let me summary both approach A and approach B A) Put libraries in a network share, mount on each node, and in your code, manually set PYTHONPATH B) In your code, manually install the necessary package using "pip install -r " I think approach B is very similar to approach A, both has pros and cons. With B), your cluster need to have internet access (which in my case, our cluster runs in an isolated environment for security reason), but you can set a private pip server anyway and stage those needed packages, while for A, you need to have admin permission to be able to mount the network share which is also a devop burden. I am wondering if spark can create some new API to tackle this scenario instead of these workaround, which I suppose would be more clean and elegant. Regards, Stone On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader mailto:relinquisheddra...@gmail.com>> wrote: Hi Stone, Have you looked into this article? https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987 I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies. I Hope it helps. On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong mailto:stone.zh...@gmail.com>> wrote: Hi, So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call *sc.addPyFile("libs.zip")* and it works pretty well for a while. Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python /*limitation*/. So I got a workaround: 1) Do not call sc.addPyFile, instead extract the libs.zip into current directory 2) When my python code starts, manually call *sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: def do_something(p): ... rdd = sc.parallelize([ {"x": 1, "y": 2}, {"x": 2, "y": 3}, {"x": 3, "y": 4}, ]) a = rdd.map(do_something) What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do *sys.path.insert(0, f"share_mount_dir/libs") *in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution? Thanks, Stone
Re: spark-submit exit status on k8s
Another, simpler solution that I just thought of: just add an operation at the end of your Spark program to write an empty file somewhere, with filename SUCCESS for example. Add a stage to your AirFlow graph to check the existence of this file after running spark-submit. If the file is absent, then the Spark app must have failed. The above should work if you want to avoid dealing with the REST API for monitoring. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 4/4/20 10:54 AM, Masood Krohy wrote: I'm not in the Spark dev team, so cannot tell you why that priority was chosen for the JIRA issue or if anyone is about to finish the work on that; I'll let others jump in if they know. Just wanted to offer a potential solution so that you can move ahead in the meantime. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 4/4/20 7:49 AM, Marshall Markham wrote: Thank you very much Masood for your fast response. Last question, is the current status in Jira representative of the status of the ticket within the project team? This seems like a big deal for the K8s implementation and we were surprised to find it marked as priority low. Is there any discussion of picking up this work in the near future? Thanks, Marshall *From:*Masood Krohy *Sent:* Friday, April 3, 2020 9:34 PM *To:* Marshall Markham ; user *Subject:* Re: spark-submit exit status on k8s While you wait for a fix on that JIRA ticket, you may be able to add an intermediary step in your AirFlow graph, calling Spark's REST API after submitting the job, and dig into the actual status of the application, and make a success/fail decision accordingly. You can make repeated calls in a loop to the REST API with few seconds delay between each call while the execution is in progress until the application fails or succeeds. https://spark.apache.org/docs/latest/monitoring.html#rest-api <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fmonitoring.html%23rest-api=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=VeYtrGQ2yfkYvxuEvqgaTVoTf2ap5krWlmtR8OJBcr0%3D=0> Hope this helps. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.analytical.works%2F=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=1e07VVnMzpaUTR4ppvZxY5XCEcfRzCX7gA6YgdlWWaU%3D=0> On 4/3/20 8:23 AM, Marshall Markham wrote: Hi Team, My team recently conducted a POC of Kubernetes/Airflow/Spark with great success. The major concern we have about this system, after the completion of our POC is a behavior of spark-submit. When called with a Kubernetes API endpoint as master spark-submit seems to always return exit status 0. This is obviously a major issue preventing us from conditioning job graphs on the success or failure of our Spark jobs. I found Jira ticket SPARK-27697 under the Apache issues covering this bug. The ticket is listed as minor and does not seem to have any activity recently. I would like to up vote it and ask if there is anything I can do to move this forward. This could be the one thing standing between my team and our preferred batch workload implementation. Thank you. *Marshall Markham* Data Engineer PrecisionLender, a Q2 Company NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message. NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message.
Re: spark-submit exit status on k8s
I'm not in the Spark dev team, so cannot tell you why that priority was chosen for the JIRA issue or if anyone is about to finish the work on that; I'll let others jump in if they know. Just wanted to offer a potential solution so that you can move ahead in the meantime. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 4/4/20 7:49 AM, Marshall Markham wrote: Thank you very much Masood for your fast response. Last question, is the current status in Jira representative of the status of the ticket within the project team? This seems like a big deal for the K8s implementation and we were surprised to find it marked as priority low. Is there any discussion of picking up this work in the near future? Thanks, Marshall *From:*Masood Krohy *Sent:* Friday, April 3, 2020 9:34 PM *To:* Marshall Markham ; user *Subject:* Re: spark-submit exit status on k8s While you wait for a fix on that JIRA ticket, you may be able to add an intermediary step in your AirFlow graph, calling Spark's REST API after submitting the job, and dig into the actual status of the application, and make a success/fail decision accordingly. You can make repeated calls in a loop to the REST API with few seconds delay between each call while the execution is in progress until the application fails or succeeds. https://spark.apache.org/docs/latest/monitoring.html#rest-api <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fmonitoring.html%23rest-api=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=VeYtrGQ2yfkYvxuEvqgaTVoTf2ap5krWlmtR8OJBcr0%3D=0> Hope this helps. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.analytical.works%2F=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=1e07VVnMzpaUTR4ppvZxY5XCEcfRzCX7gA6YgdlWWaU%3D=0> On 4/3/20 8:23 AM, Marshall Markham wrote: Hi Team, My team recently conducted a POC of Kubernetes/Airflow/Spark with great success. The major concern we have about this system, after the completion of our POC is a behavior of spark-submit. When called with a Kubernetes API endpoint as master spark-submit seems to always return exit status 0. This is obviously a major issue preventing us from conditioning job graphs on the success or failure of our Spark jobs. I found Jira ticket SPARK-27697 under the Apache issues covering this bug. The ticket is listed as minor and does not seem to have any activity recently. I would like to up vote it and ask if there is anything I can do to move this forward. This could be the one thing standing between my team and our preferred batch workload implementation. Thank you. *Marshall Markham* Data Engineer PrecisionLender, a Q2 Company NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message. NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message.
Re: spark-submit exit status on k8s
While you wait for a fix on that JIRA ticket, you may be able to add an intermediary step in your AirFlow graph, calling Spark's REST API after submitting the job, and dig into the actual status of the application, and make a success/fail decision accordingly. You can make repeated calls in a loop to the REST API with few seconds delay between each call while the execution is in progress until the application fails or succeeds. https://spark.apache.org/docs/latest/monitoring.html#rest-api Hope this helps. Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 4/3/20 8:23 AM, Marshall Markham wrote: Hi Team, My team recently conducted a POC of Kubernetes/Airflow/Spark with great success. The major concern we have about this system, after the completion of our POC is a behavior of spark-submit. When called with a Kubernetes API endpoint as master spark-submit seems to always return exit status 0. This is obviously a major issue preventing us from conditioning job graphs on the success or failure of our Spark jobs. I found Jira ticket SPARK-27697 under the Apache issues covering this bug. The ticket is listed as minor and does not seem to have any activity recently. I would like to up vote it and ask if there is anything I can do to move this forward. This could be the one thing standing between my team and our preferred batch workload implementation. Thank you. *Marshall Markham* Data Engineer PrecisionLender, a Q2 Company NOTE: This communication and any attachments are for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by replying to this email, and destroy all copies of the original message.
Re: [Pyspark 2.3+] Timeseries with Spark
Hi Rishi, Spark and Flint are useful during the data engineering phase, but you'd need to look elsewhere after that. I'm not aware of any active Spark-native project to do ML/forecast on time series data. If the data that you want to train the model on can fit in one node's memory, you can use libs and models like ARIMA, Prophet, or LSTM-based NN to train a model and use them for forecasting. You can then use Spark to parallelize the grid search over the space of hyperparameters to get the optimal model faster, as the grid search would be a perfectly-parallel job (a.k.a, embarrassingly parallel). I gave a talk on this which you may find useful: https://www.analytical.works/Talk-spark-ml.html Masood __ Masood Krohy, Ph.D. Data Science Advisor|Platform Architect https://www.analytical.works On 12/29/19 11:30 AM, Rishi Shah wrote: Hi All, Checking in to see if anyone had input around time series libraries using Spark. I in interested in financial forecasting model & regression mainly at this point. Input is a bunch of pricing data points. I have read a lot of spark-timeseries and flint libraries but I am not sure of the best way/use cases to use these libraries for or if there's any other preferred way of tackling time series problems at scale. Thanks, -Shraddha On Sun, Jun 16, 2019 at 9:17 AM Rishi Shah <mailto:rishishah.s...@gmail.com>> wrote: Thanks Jorn. I am interested in timeseries forecasting for now but in general I was unable to find a good way to work with different time series methods using spark.. On Fri, Jun 14, 2019 at 1:55 AM Jörn Franke mailto:jornfra...@gmail.com>> wrote: Time series can mean a lot of different things and algorithms. Can you describe more what you mean by time series use case, ie what is the input, what do you like to do with the input and what is the output? > Am 14.06.2019 um 06:01 schrieb Rishi Shah mailto:rishishah.s...@gmail.com>>: > > Hi All, > > I have a time series use case which I would like to implement in Spark... What would be the best way to do so? Any built in libraries? > > -- > Regards, > > Rishi Shah -- Regards, Rishi Shah -- Regards, Rishi Shah
RE: build models in parallel
You can use your groupId as a grid parameter, filter your dataset using this id in a pipeline stage, before feeding it to the model. The following may help: http://spark.apache.org/docs/latest/ml-tuning.html http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder The above should work ,but I haven't tried it myself. What I have tried is the following Embarrassingly Parallel architecture (as TensorFlow was a requirement in the use case): See a PySpark/TensorFlow example here: https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html A relevant excerpt from the notebook mentioned above: http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html num_nodes = 4 n = max(2, int(len(all_experiments) // num_nodes)) grouped_experiments = [all_experiments[i:i+n] for i in range(0, len(all_experiments), n)] all_exps_rdd = sc.parallelize(grouped_experiments, numSlices=len(grouped_experiments)) results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect() Again, like above, you use your groupId as a parameter in the grid search; it works if your full dataset fits in the memory of a single machine. You can broadcast the dataset in a compressed format and do the preprocessing and feature engineering after you've done the filtering on groupId to maximize the size of the dataset that can use this modeling approach. Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Xiaomeng Wan <shawn...@gmail.com> A : User <user@spark.apache.org> Date : 2016-11-29 11:54 Objet : build models in parallel I want to divide big data into groups (eg groupby some id), and build one model for each group. I am wondering whether I can parallelize the model building process by implementing a UDAF (eg running linearregression in its evaluate mothod). is it good practice? anybody has experience? Thanks! Regards, Shawn
Re: Cluster deploy mode driver location
You may also try distributing your JARS along with your Spark app; see options below. You put on the client node whatever that is necessary and submit them all in each run. There is also a --files option which you can remove below, but may be helpful for some configs. You do not need to specify all the arguments; the default values are picked up when not explicitly given. spark-submit \ --master yarn \ --deploy-mode cluster \ --num-executors 2 \ --driver-memory 4g \ --executor-memory 8g \ --files /usr/hdp/current/spark-client/conf/hive-site.xml \ --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \ --class "SparkApp" \ /pathToAppOnTheClientNode/SparkApp.jar (if any, arguments passed to the Spark App here) Masood ------ Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Silvio Fiorito <silvio.fior...@granturing.com> A : "saif.a.ell...@wellsfargo.com" <saif.a.ell...@wellsfargo.com>, "user@spark.apache.org" <user@spark.apache.org> Date : 2016-11-22 08:02 Objet : Re: Cluster deploy mode driver location Hi Saif! Unfortunately, I don't think this is possible for YARN driver-cluster mode. Regarding the JARs you're referring to, can you place them on HDFS so you can then have them in a central location and refer to them that way for dependencies? http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management Thanks, Silvio From: saif.a.ell...@wellsfargo.com <saif.a.ell...@wellsfargo.com> Sent: Monday, November 21, 2016 2:04:06 PM To: user@spark.apache.org Subject: Cluster deploy mode driver location Hello there, I have a Spark program in 1.6.1, however, when I submit it to cluster, it randomly picks the driver. I know there is a driver specification option, but along with it it is mandatory to define many other options I am not familiar with. The trouble is, the .jars I am launching need to be available at the driver host, and I would like to have this jars in just a specific host, which I like it to be the driver. Any help? Thanks! Saif
Re: LinearRegressionWithSGD and Rank Features By Importance
No, you do not scale back the predicted value. The output values (labels) were never scaled; only input features were scaled. For prediction on new samples, you scale the new sample first using the avg/std that you calculated for each feature when you trained your model, then feed it to the trained model. If it's a classification problem, then you're done here, a class is predicted based on the trained model. If it's a regression problem, then the predicted value does not need scaling back; it is in the same scale as your original output values you used when you trained your model. This is now becoming more of a Data Science/ML problem and not a Spark issue and is probably best kept off this list. Do some reading on the topic and get back to me direct; I'll respond when possible. Hope this has helped. Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Carlo.Allocca <carlo.allo...@open.ac.uk> A : Masood Krohy <masood.kr...@intact.net> Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, Mohit Jaggi <mohitja...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> Date : 2016-11-08 11:02 Objet : Re: LinearRegressionWithSGD and Rank Features By Importance Hi Masood, Thank you again for your suggestion. I have got a question about the following: For prediction on new samples, you need to scale each sample first before making predictions using your trained model. When applying the ML linear model as suggested above, it means that the predicted value is scaled. My question: Does it need be scaled-back? I mean to apply the inverse of "calculate the average and std for each feature, deduct the avg, then divide by std.” to the predicted-value? In practice, (predicted-value * std) + avg? Is that correct? Am I missing anything? Many Thanks in advance. Best Regards, Carlo On 7 Nov 2016, at 17:14, carlo allocca <ca6...@open.ac.uk> wrote: I found it just google http://sebastianraschka.com/Articles/2014_about_feature_scaling.html Thanks. Carlo On 7 Nov 2016, at 17:12, carlo allocca <ca6...@open.ac.uk> wrote: Hi Masood, Thank you very much for your insight. I am going to scale all my features as you described. As I am beginners, Is there any paper/book that would explain the suggested approaches? I would love to read. Many Thanks, Best Regards, Carlo On 7 Nov 2016, at 16:27, Masood Krohy <masood.kr...@intact.net> wrote: Yes, you would want to scale those features before feeding into any algorithm, one typical way would be to calculate the average and std for each feature, deduct the avg, then divide by std. Dividing by "max - min" is also a good option if you're sure there is no outlier shooting up your max or lowering your min significantly for each feature. After you have scaled each feature, then you can feed the data into the algo for training. For prediction on new samples, you need to scale each sample first before making predictions using your trained model. It's not too complicated to implement manually, but Spark API has some support for this already: ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler MLlib: http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Carlo.Allocca <carlo.allo...@open.ac.uk> A :Masood Krohy <masood.kr...@intact.net> Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, Mohit Jaggi < mohitja...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> Date :2016-11-07 10:50 Objet :Re: LinearRegressionWithSGD and Rank Features By Importance Hi Masood, thank you very much for the reply. It is very a good point as I am getting very bed result so far. If I understood well what you suggest is to scale the date below (it is part of my dataset) before applying linear regression SGD. is it correct? Many Thanks in advance. Best Regards, Carlo On 7 Nov 2016, at 15:31, Masood Krohy <masood.kr...@intact.net> wrote: If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand. Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Carlo.Allocca <carlo.allo...@open.ac.uk> A :Mohit Jaggi <moh
Re: Live data visualisations with Spark
+1 for Zeppelin. See https://community.hortonworks.com/articles/10365/apache-zeppelin-and-sparkr.html -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Vadim Semenov <vadim.seme...@datadoghq.com> A : Andrew Holway <andrew.hol...@otternetworks.de> Cc :user <user@spark.apache.org> Date : 2016-11-08 11:17 Objet : Re: Live data visualisations with Spark Take a look at https://zeppelin.apache.org On Tue, Nov 8, 2016 at 11:13 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: Hello, A colleague and I are trying to work out the best way to provide live data visualisations based on Spark. Is it possible to explore a dataset in spark from a web browser? Set up pre defined functions that the user can click on which return datsets. We are using a lot of R here. Is this something that could be accomplished with shiny server for instance? Thanks, Andrew Holway
Re: LinearRegressionWithSGD and Rank Features By Importance
If you go down this route (look at actual coefficients/weights), then make sure your features are scaled first and have more or less the same mean when feeding them into the algo. If not, then actual coefficients/weights wouldn't tell you much. In any case, SGD performs badly with unscaled features, so you gain if you scale the features beforehand. Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation http://ca.linkedin.com/in/masoodkh De :Carlo.Allocca <carlo.allo...@open.ac.uk> A : Mohit Jaggi <mohitja...@gmail.com> Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, "user@spark.apache.org" <user@spark.apache.org> Date : 2016-11-04 03:39 Objet : Re: LinearRegressionWithSGD and Rank Features By Importance Hi Mohit, Thank you for your reply. OK. it means coefficient with high score are more important that other with low score… Many Thanks, Best Regards, Carlo > On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitja...@gmail.com> wrote: > > For linear regression, it should be fairly easy. Just sort the co-efficients :) > > Mohit Jaggi > Founder, > Data Orchard LLC > www.dataorchardllc.com > > > > >> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allo...@open.ac.uk> wrote: >> >> Hi All, >> >> I am using SPARK and in particular the MLib library. >> >> import org.apache.spark.mllib.regression.LabeledPoint; >> import org.apache.spark.mllib.regression.LinearRegressionModel; >> import org.apache.spark.mllib.regression.LinearRegressionWithSGD; >> >> For my problem I am using the LinearRegressionWithSGD and I would like to perform a “Rank Features By Importance”. >> >> I checked the documentation and it seems that does not provide such methods. >> >> Am I missing anything? Please, could you provide any help on this? >> Should I change the approach? >> >> Many Thanks in advance, >> >> Best Regards, >> Carlo >> >> >> -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Deep learning libraries for scala
If you need ConvNets and RNNs and want to stay in Scala/Java, then Deep Learning for Java (DL4J) might be the most mature option. If you want ConvNets and RNNs, as implemented in TensorFlow, along with all the bells and whistles, then you might want to switch to PySpark + TensorFlow and write the entire pipeline in Python. You'd do the data preparation/ingestion in PySpark and pass the data to TensorFlow for the ML part. There are 2 supported modes here: 1) Simultaneous multi-model training (a.k.a. embarrassingly parallel: each node has the entire data and model): https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html 2) Data parallelism (data is distributed, each node has the entire model): There are some prototypes out there and TensorSpark seems to be most mature: https://github.com/adatao/tensorspark It implements Downpour/Asynchronous SGD for the distributed training; it remains to be stress-tested with large datasets, however. More info: https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/ TensorFrames does not allow distributed training and I did not see any performance benchmarks last time I checked. Alexander Ulanov of HP made a presentation of the options few months ago: https://www.oreilly.com/learning/distributed-deep-learning-on-spark Masood -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation De :Benjamin Kim <bbuil...@gmail.com> A : janardhan shetty <janardhan...@gmail.com> Cc :Gourav Sengupta <gourav.sengu...@gmail.com>, user <user@spark.apache.org> Date : 2016-11-01 13:14 Objet : Re: Deep learning libraries for scala To add, I see that Databricks has been busy integrating deep learning more into their product and put out a new article about this. https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html An interesting tidbit is at the bottom of the article mentioning TensorFrames. https://github.com/databricks/tensorframes Seems like an interesting direction… Cheers, Ben On Oct 19, 2016, at 9:05 AM, janardhan shetty <janardhan...@gmail.com> wrote: Agreed. But as it states deeper integration with (scala) is yet to be developed. Any thoughts on how to use tensorflow with scala ? Need to write wrappers I think. On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: On that note, here is an article that Databricks made regarding using Tensorflow in conjunction with Spark. https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Cheers, Ben On Oct 19, 2016, at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: while using Deep Learning you might want to stay as close to tensorflow as possible. There is very less translation loss, you get to access stable, scalable and tested libraries from the best brains in the industry and as far as Scala goes, it helps a lot to think about using the language as a tool to access algorithms in this instance unless you want to start developing algorithms from grounds up ( and in which case you might not require any libraries at all). On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty <janardhan...@gmail.com> wrote: Hi, Are there any good libraries which can be used for scala deep learning models ? How can we integrate tensorflow with scala ML ?
Re: Getting the IP address of Spark Driver in yarn-cluster mode
Thanks Steve. Here is the Python pseudo code that got it working for me: import time; import urllib2 nodes= ({'worker1_hostname':'worker1_ip', ... }) YARN_app_queue = 'default' YARN_address = 'http://YARN_IP:8088' YARN_app_startedTimeBegin = str(int(time.time() - 3600)) # We allow 3,600 sec from start of the app up to this point requestedURL = (YARN_address + '/ws/v1/cluster/apps?states=RUNNING=SPARK=1' + '=' + YARN_app_queue + '=' + YARN_app_startedTimeBegin) print 'Sent request to YARN: ' + requestedURL response = urllib2.urlopen(requestedURL) html = response.read() amHost_start = html.find('amHostHttpAddress') + len('amHostHttpAddress":"') amHost_length = len('worker1_hostname') amHost = html[amHost_start : amHost_start + amHost_length] print 'amHostHttpAddress is: ' + amHost try: self.websock = ... print 'Connected to server running on %s' % nodes[amHost] except: print 'Could not connect to server on %s' % nodes[amHost] ------ Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation De :Steve Loughran <ste...@hortonworks.com> A : Masood Krohy <masood.kr...@intact.net> Cc :"user@spark.apache.org" <user@spark.apache.org> Date : 2016-10-24 17:09 Objet : Re: Getting the IP address of Spark Driver in yarn-cluster mode On 24 Oct 2016, at 19:34, Masood Krohy <masood.kr...@intact.net> wrote: Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)? I do not see an option for this. If not, is there a way to get this IP address after the Spark app has started running? (through an API call at the beginning of the program to be used in the rest of the program). spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console (and changes on every run due to yarn cluster mode) and I am wondering if this can be accessed within the program. It does not seem to me that a YARN node label can be used to tie the Spark Driver/AM to a node, while allowing the Executors to run on all the nodes. you can grab it off the YARN API itself; there's a REST view as well as a fussier RPC level. That is, assuming you want the web view, which is what is registered. If you know the application ID, you can also construct a URL through the YARN proxy; any attempt to talk direct to the AM is going to get 302'd back there anyway so any kerberos credentials can be verified.
Getting the IP address of Spark Driver in yarn-cluster mode
Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)? I do not see an option for this. If not, is there a way to get this IP address after the Spark app has started running? (through an API call at the beginning of the program to be used in the rest of the program). spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console (and changes on every run due to yarn cluster mode) and I am wondering if this can be accessed within the program. It does not seem to me that a YARN node label can be used to tie the Spark Driver/AM to a node, while allowing the Executors to run on all the nodes. I am running a parameter server along with the Spark Driver that needs to be contacted during the program execution; I need the Driver’s IP so that other executors can call back to this server. I need to stick to the yarn-cluster mode. Thanks for any hints in advance. Masood PS: the closest code I was able to write is this which is not outputting what I need: print sc.statusTracker().getJobInfo( sc.statusTracker().getActiveJobsIds()[0] ) # output in YARN stdout log: SparkJobInfo(jobId=4, stageIds=JavaObject id=o101, status='SUCCEEDED') -- Masood Krohy, Ph.D. Data Scientist, Intact Lab-R Intact Financial Corporation