Re: Add python library with native code

2020-06-05 Thread Masood Krohy
Not totally sure it's gonna help your use case, but I'd recommend that 
you consider these too:


 * pex <https://github.com/pantsbuild/pex> A library and tool for
   generating .pex (Python EXecutable) files
 * cluster-pack <https://github.com/criteo/cluster-pack> cluster-pack
   is a library on top of either pex or conda-pack to make your Python
   code easily available on a cluster.

Masood

______

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 6/5/20 4:29 AM, Stone Zhong wrote:
Thanks Dark. Looked at that article. I think the article described 
approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your 
code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip 
install -r "


I think approach B is very similar to approach A, both has pros and 
cons. With B), your cluster need to have internet access (which in my 
case, our cluster runs in an isolated environment for security 
reason), but you can set a private pip server anyway and stage those 
needed packages, while for A, you need to have admin permission to be 
able to mount the network share which is also a devop burden.


I am wondering if spark can create some new API to tackle this 
scenario instead of these workaround, which I suppose would be more 
clean and elegant.


Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader 
mailto:relinquisheddra...@gmail.com>> 
wrote:


Hi Stone,

Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987


I haven't tried it with .so files however I did use the approach
he recommends to install my other dependencies.
I Hope it helps.

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong mailto:stone.zh...@gmail.com>> wrote:

Hi,

So my pyspark app depends on some python libraries, it is not
a problem, I pack all the dependencies into a file libs.zip,
and then call *sc.addPyFile("libs.zip")* and it works pretty
well for a while.

Then I encountered a problem, if any of my library has any
binary file dependency (like .so files), this approach does
not work. Mainly because when you set PYTHONPATH to a zip
file, python does not look up needed binary library (e.g. a
.so file) inside the zip file, this is a python
/*limitation*/. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into
current directory
2) When my python code starts, manually call
*sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH

This workaround works well for me. Then I got another problem:
what if my code in executor need python library that has
binary code? Below is am example:

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library
that has binary code? My current solution is, extract libs.zip
into a NFS share (or a SMB share) and manually do
*sys.path.insert(0, f"share_mount_dir/libs") *in my
"do_something" function, but adding such code in each function
looks ugly, is there any better/elegant solution?

Thanks,
Stone



Re: spark-submit exit status on k8s

2020-04-05 Thread Masood Krohy
Another, simpler solution that I just thought of: just add an operation 
at the end of your Spark program to write an empty file somewhere, with 
filename SUCCESS for example. Add a stage to your AirFlow graph to check 
the existence of this file after running spark-submit. If the file is 
absent, then the Spark app must have failed.


The above should work if you want to avoid dealing with the REST API for 
monitoring.


Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 4/4/20 10:54 AM, Masood Krohy wrote:


I'm not in the Spark dev team, so cannot tell you why that priority 
was chosen for the JIRA issue or if anyone is about to finish the work 
on that; I'll let others jump in if they know.


Just wanted to offer a potential solution so that you can move ahead 
in the meantime.


Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works
On 4/4/20 7:49 AM, Marshall Markham wrote:


Thank you very much Masood for your fast response. Last question, is 
the current status in Jira representative of the status of the ticket 
within the project team? This seems like a big deal for the K8s 
implementation and we were surprised to find it marked as priority 
low. Is there any discussion of picking up this work in the near future?


Thanks,

Marshall

*From:*Masood Krohy 
*Sent:* Friday, April 3, 2020 9:34 PM
*To:* Marshall Markham ; user 


*Subject:* Re: spark-submit exit status on k8s

While you wait for a fix on that JIRA ticket, you may be able to add 
an intermediary step in your AirFlow graph, calling Spark's REST API 
after submitting the job, and dig into the actual status of the 
application, and make a success/fail decision accordingly. You can 
make repeated calls in a loop to the REST API with few seconds delay 
between each call while the execution is in progress until the 
application fails or succeeds.


https://spark.apache.org/docs/latest/monitoring.html#rest-api 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fmonitoring.html%23rest-api=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=VeYtrGQ2yfkYvxuEvqgaTVoTf2ap5krWlmtR8OJBcr0%3D=0>


Hope this helps.

Masood

__
Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works  
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.analytical.works%2F=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=1e07VVnMzpaUTR4ppvZxY5XCEcfRzCX7gA6YgdlWWaU%3D=0>

On 4/3/20 8:23 AM, Marshall Markham wrote:

Hi Team,

My team recently conducted a POC of Kubernetes/Airflow/Spark with
great success. The major concern we have about this system, after
the completion of our POC is a behavior of spark-submit. When
called with a Kubernetes API endpoint as master spark-submit
seems to always return exit status 0. This is obviously a major
issue preventing us from conditioning job graphs on the success
or failure of our Spark jobs. I found Jira ticket SPARK-27697
under the Apache issues covering this bug. The ticket is listed
as minor and does not seem to have any activity recently. I would
like to up vote it and ask if there is anything I can do to move
this forward. This could be the one thing standing between my
team and our preferred batch workload implementation. Thank you.

*Marshall Markham*

Data Engineer

PrecisionLender, a Q2 Company

NOTE: This communication and any attachments are for the sole use
of the intended recipient(s) and may contain confidential and/or
privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended
recipient, please contact the sender by replying to this email,
and destroy all copies of the original message.

NOTE: This communication and any attachments are for the sole use of 
the intended recipient(s) and may contain confidential and/or 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, 
please contact the sender by replying to this email, and destroy all 
copies of the original message. 


Re: spark-submit exit status on k8s

2020-04-04 Thread Masood Krohy
I'm not in the Spark dev team, so cannot tell you why that priority was 
chosen for the JIRA issue or if anyone is about to finish the work on 
that; I'll let others jump in if they know.


Just wanted to offer a potential solution so that you can move ahead in 
the meantime.


Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 4/4/20 7:49 AM, Marshall Markham wrote:


Thank you very much Masood for your fast response. Last question, is 
the current status in Jira representative of the status of the ticket 
within the project team? This seems like a big deal for the K8s 
implementation and we were surprised to find it marked as priority 
low. Is there any discussion of picking up this work in the near future?


Thanks,

Marshall

*From:*Masood Krohy 
*Sent:* Friday, April 3, 2020 9:34 PM
*To:* Marshall Markham ; user 


*Subject:* Re: spark-submit exit status on k8s

While you wait for a fix on that JIRA ticket, you may be able to add 
an intermediary step in your AirFlow graph, calling Spark's REST API 
after submitting the job, and dig into the actual status of the 
application, and make a success/fail decision accordingly. You can 
make repeated calls in a loop to the REST API with few seconds delay 
between each call while the execution is in progress until the 
application fails or succeeds.


https://spark.apache.org/docs/latest/monitoring.html#rest-api 
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fmonitoring.html%23rest-api=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=VeYtrGQ2yfkYvxuEvqgaTVoTf2ap5krWlmtR8OJBcr0%3D=0>


Hope this helps.

Masood

__
Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works  
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.analytical.works%2F=02%7C01%7Cmmarkham%40precisionlender.com%7C5de463febcd142287ba208d7d8384f1c%7Cf06d459bd9354ad7a9d3a82343c4c9da%7C0%7C1%7C637215608668550345=1e07VVnMzpaUTR4ppvZxY5XCEcfRzCX7gA6YgdlWWaU%3D=0>

On 4/3/20 8:23 AM, Marshall Markham wrote:

Hi Team,

My team recently conducted a POC of Kubernetes/Airflow/Spark with
great success. The major concern we have about this system, after
the completion of our POC is a behavior of spark-submit. When
called with a Kubernetes API endpoint as master spark-submit seems
to always return exit status 0. This is obviously a major issue
preventing us from conditioning job graphs on the success or
failure of our Spark jobs. I found Jira ticket SPARK-27697 under
the Apache issues covering this bug. The ticket is listed as minor
and does not seem to have any activity recently. I would like to
up vote it and ask if there is anything I can do to move this
forward. This could be the one thing standing between my team and
our preferred batch workload implementation. Thank you.

*Marshall Markham*

Data Engineer

PrecisionLender, a Q2 Company

NOTE: This communication and any attachments are for the sole use
of the intended recipient(s) and may contain confidential and/or
privileged information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended
recipient, please contact the sender by replying to this email,
and destroy all copies of the original message.

NOTE: This communication and any attachments are for the sole use of 
the intended recipient(s) and may contain confidential and/or 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, 
please contact the sender by replying to this email, and destroy all 
copies of the original message. 


Re: spark-submit exit status on k8s

2020-04-03 Thread Masood Krohy
While you wait for a fix on that JIRA ticket, you may be able to add an 
intermediary step in your AirFlow graph, calling Spark's REST API after 
submitting the job, and dig into the actual status of the application, 
and make a success/fail decision accordingly. You can make repeated 
calls in a loop to the REST API with few seconds delay between each call 
while the execution is in progress until the application fails or succeeds.


https://spark.apache.org/docs/latest/monitoring.html#rest-api

Hope this helps.

Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 4/3/20 8:23 AM, Marshall Markham wrote:


Hi Team,

My team recently conducted a POC of Kubernetes/Airflow/Spark with 
great success. The major concern we have about this system, after the 
completion of our POC is a behavior of spark-submit. When called with 
a Kubernetes API endpoint as master spark-submit seems to always 
return exit status 0. This is obviously a major issue preventing us 
from conditioning job graphs on the success or failure of our Spark 
jobs. I found Jira ticket SPARK-27697 under the Apache issues covering 
this bug. The ticket is listed as minor and does not seem to have any 
activity recently. I would like to up vote it and ask if there is 
anything I can do to move this forward. This could be the one thing 
standing between my team and our preferred batch workload 
implementation. Thank you.


*Marshall Markham*

Data Engineer

PrecisionLender, a Q2 Company

NOTE: This communication and any attachments are for the sole use of 
the intended recipient(s) and may contain confidential and/or 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, 
please contact the sender by replying to this email, and destroy all 
copies of the original message. 


Re: [Pyspark 2.3+] Timeseries with Spark

2019-12-29 Thread Masood Krohy

Hi Rishi,

Spark and Flint are useful during the data engineering phase, but you'd 
need to look elsewhere after that. I'm not aware of any active 
Spark-native project to do ML/forecast on time series data.


If the data that you want to train the model on can fit in one node's 
memory, you can use libs and models like ARIMA, Prophet, or LSTM-based 
NN to train a model and use them for forecasting. You can then use Spark 
to parallelize the grid search over the space of hyperparameters to get 
the optimal model faster, as the grid search would be a 
perfectly-parallel job (a.k.a, embarrassingly parallel). I gave a talk 
on this which you may find useful: 
https://www.analytical.works/Talk-spark-ml.html


Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 12/29/19 11:30 AM, Rishi Shah wrote:

Hi All,

Checking in to see if anyone had input around time series libraries 
using Spark. I in interested in financial forecasting model & 
regression mainly at this  point. Input is a bunch of pricing data points.


I have read a lot of spark-timeseries and flint libraries but I am not 
sure of the best way/use cases to use these libraries for or if 
there's any other preferred way of tackling time series problems at scale.


Thanks,
-Shraddha

On Sun, Jun 16, 2019 at 9:17 AM Rishi Shah <mailto:rishishah.s...@gmail.com>> wrote:


Thanks Jorn. I am interested in timeseries forecasting for now but
in general I was unable to find a good way to work with different
time series methods using spark..

On Fri, Jun 14, 2019 at 1:55 AM Jörn Franke mailto:jornfra...@gmail.com>> wrote:

Time series can mean a lot of different things and algorithms.
Can you describe more what you mean by time series use case,
ie what is the input, what do you like to do with the input
and what is the output?

> Am 14.06.2019 um 06:01 schrieb Rishi Shah
mailto:rishishah.s...@gmail.com>>:
>
> Hi All,
>
> I have a time series use case which I would like to
implement in Spark... What would be the best way to do so? Any
built in libraries?
>
> --
> Regards,
>
> Rishi Shah



-- 
Regards,


Rishi Shah



--
Regards,

Rishi Shah


RE: build models in parallel

2016-12-01 Thread Masood Krohy
You can use your groupId as a grid parameter, filter your dataset using 
this id in a pipeline stage, before feeding it to the model.
The following may help:
http://spark.apache.org/docs/latest/ml-tuning.html
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder

The above should work ,but I haven't tried it myself. What I have tried is 
the following Embarrassingly Parallel architecture (as TensorFlow was a 
requirement in the use case):

See a PySpark/TensorFlow example here:
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

A relevant excerpt from the notebook mentioned above:
http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_distributed_processing_of_images_using_TensorFlow.html
num_nodes = 4
n = max(2, int(len(all_experiments) // num_nodes))
grouped_experiments = [all_experiments[i:i+n] for i in range(0, 
len(all_experiments), n)]
all_exps_rdd = sc.parallelize(grouped_experiments, 
numSlices=len(grouped_experiments))
results = all_exps_rdd.flatMap(lambda z: [run(*y) for y in z]).collect()

Again, like above, you use your groupId as a parameter in the grid search; 
it works if your full dataset fits in the memory of a single machine. You 
can broadcast the dataset in a compressed format and do the preprocessing 
and feature engineering after you've done the filtering on groupId to 
maximize the size of the dataset that can use this modeling approach.
Masood


--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Xiaomeng Wan <shawn...@gmail.com>
A : User <user@spark.apache.org>
Date :  2016-11-29 11:54
Objet : build models in parallel



I want to divide big data into groups (eg groupby some id), and build one 
model for each group. I am wondering whether I can parallelize the model 
building process by implementing a UDAF (eg running linearregression in 
its evaluate mothod). is it good practice? anybody has experience? Thanks!

Regards,
Shawn



Re: Cluster deploy mode driver location

2016-11-22 Thread Masood Krohy
You may also try distributing your JARS along with your Spark app; see 
options below. You put on the client node whatever that is necessary and 
submit them all in each run. There is also a --files option which you can 
remove below, but may be helpful for some configs.

You do not need to specify all the arguments; the default values are 
picked up when not explicitly given.

spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--driver-memory 4g \
--executor-memory 8g \
--files /usr/hdp/current/spark-client/conf/hive-site.xml \
--jars 
/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar
 
\
--class "SparkApp"  \
/pathToAppOnTheClientNode/SparkApp.jar (if any, arguments passed to the 
Spark App here)

Masood


------
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Silvio Fiorito <silvio.fior...@granturing.com>
A : "saif.a.ell...@wellsfargo.com" <saif.a.ell...@wellsfargo.com>, 
"user@spark.apache.org" <user@spark.apache.org>
Date :  2016-11-22 08:02
Objet : Re: Cluster deploy mode driver location



Hi Saif!

Unfortunately, I don't think this is possible for YARN driver-cluster 
mode. Regarding the JARs you're referring to, can you place them on HDFS 
so you can then have them in a central location and refer to them that way 
for dependencies?

http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

Thanks,
Silvio

From: saif.a.ell...@wellsfargo.com <saif.a.ell...@wellsfargo.com>
Sent: Monday, November 21, 2016 2:04:06 PM
To: user@spark.apache.org
Subject: Cluster deploy mode driver location 
 
Hello there,
 
I have a Spark program in 1.6.1, however, when I submit it to cluster, it 
randomly picks the driver.
 
I know there is a driver specification option, but along with it it is 
mandatory to define many other options I am not familiar with. The trouble 
is, the .jars I am launching need to be available at the driver host, and 
I would like to have this jars in just a specific host, which I like it to 
be the driver.
 
Any help?
 
Thanks!
Saif
 



Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-08 Thread Masood Krohy
No, you do not scale back the predicted value. The output values (labels) 
were never scaled; only input features were scaled.

For prediction on new samples, you scale the new sample first using the 
avg/std that you calculated for each feature when you trained your model, 
then feed it to the trained model. If it's a classification problem, then 
you're done here, a class is predicted based on the trained model. If it's 
a regression problem, then the predicted value does not need scaling back; 
it is in the same scale as your original output values you used when you 
trained your model.

This is now becoming more of a Data Science/ML problem and not a Spark 
issue and is probably best kept off this list. Do some reading on the 
topic and get back to me direct; I'll respond when possible.

Hope this has helped.

Masood

--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Carlo.Allocca <carlo.allo...@open.ac.uk>
A : Masood Krohy <masood.kr...@intact.net>
Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, Mohit Jaggi 
<mohitja...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org>
Date :  2016-11-08 11:02
Objet : Re: LinearRegressionWithSGD and Rank Features By Importance



Hi Masood, 

Thank you again for your suggestion. 
I have got a question about the following: 

For prediction on new samples, you need to scale each sample first before 
making predictions using your trained model. 


When applying the ML linear model as suggested above, it means that the 
predicted value is scaled. My question: Does it need be scaled-back? I 
mean to apply  the inverse of "calculate the average and std for each 
feature, deduct the avg, then divide by std.” to the predicted-value?
In practice, (predicted-value * std) + avg? 

Is that correct? Am I missing anything?

Many Thanks in advance. 
Best Regards,
Carlo


On 7 Nov 2016, at 17:14, carlo allocca <ca6...@open.ac.uk> wrote:

I found it just google 
http://sebastianraschka.com/Articles/2014_about_feature_scaling.html 

Thanks.
Carlo
On 7 Nov 2016, at 17:12, carlo allocca <ca6...@open.ac.uk> wrote:

Hi Masood, 

Thank you very much for your insight. 
I am going to scale all my features as you described. 

As I am beginners, Is there any paper/book that would explain the 
suggested approaches? I would love to read. 

Many Thanks,
Best Regards,
Carlo

 



On 7 Nov 2016, at 16:27, Masood Krohy <masood.kr...@intact.net> wrote:

Yes, you would want to scale those features before feeding into any 
algorithm, one typical way would be to calculate the average and std for 
each feature, deduct the avg, then divide by std. Dividing by "max - min" 
is also a good option if you're sure there is no outlier shooting up your 
max or lowering your min significantly for each feature. After you have 
scaled each feature, then you can feed the data into the algo for 
training. 

For prediction on new samples, you need to scale each sample first before 
making predictions using your trained model. 

It's not too complicated to implement manually, but Spark API has some 
support for this already: 
ML: http://spark.apache.org/docs/latest/ml-features.html#standardscaler 
MLlib: 
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#standardscaler
 

Masood 


--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Carlo.Allocca <carlo.allo...@open.ac.uk> 
A :Masood Krohy <masood.kr...@intact.net> 
Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, Mohit Jaggi <
mohitja...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> 
Date :2016-11-07 10:50 
Objet :Re: LinearRegressionWithSGD and Rank Features By Importance 




Hi Masood, 

thank you very much for the reply. It is very a good point as I am getting 
very bed result so far. 

If I understood well what you suggest is to scale the date below (it is 
part of my dataset) before applying linear regression SGD. 

is it correct? 

Many Thanks in advance. 

Best Regards, 
Carlo 

 

On 7 Nov 2016, at 15:31, Masood Krohy <masood.kr...@intact.net> wrote: 

If you go down this route (look at actual coefficients/weights), then make 
sure your features are scaled first and have more or less the same mean 
when feeding them into the algo. If not, then actual coefficients/weights 
wouldn't tell you much. In any case, SGD performs badly with unscaled 
features, so you gain if you scale the features beforehand. 
Masood 

--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Carlo.Allocca <carlo.allo...@open.ac.uk> 
A :Mohit Jaggi <moh

Re: Live data visualisations with Spark

2016-11-08 Thread Masood Krohy
+1 for Zeppelin.

See 
https://community.hortonworks.com/articles/10365/apache-zeppelin-and-sparkr.html


--
Masood Krohy, Ph.D. 
Data Scientist, Intact Lab-R 
Intact Financial Corporation 
http://ca.linkedin.com/in/masoodkh 



De :Vadim Semenov <vadim.seme...@datadoghq.com>
A : Andrew Holway <andrew.hol...@otternetworks.de>
Cc :user <user@spark.apache.org>
Date :  2016-11-08 11:17
Objet : Re: Live data visualisations with Spark



Take a look at https://zeppelin.apache.org

On Tue, Nov 8, 2016 at 11:13 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:
Hello,

A colleague and I are trying to work out the best way to provide live data 
visualisations based on Spark. Is it possible to explore a dataset in 
spark from a web browser? Set up pre defined functions that the user can 
click on which return datsets.

We are using a lot of R here. Is this something that could be accomplished 
with shiny server for instance?

Thanks,

Andrew Holway




Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Masood Krohy
If you go down this route (look at actual coefficients/weights), then make 
sure your features are scaled first and have more or less the same mean 
when feeding them into the algo. If not, then actual coefficients/weights 
wouldn't tell you much. In any case, SGD performs badly with unscaled 
features, so you gain if you scale the features beforehand.
Masood

--
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R
Intact Financial Corporation
http://ca.linkedin.com/in/masoodkh



De :Carlo.Allocca <carlo.allo...@open.ac.uk>
A : Mohit Jaggi <mohitja...@gmail.com>
Cc :Carlo.Allocca <carlo.allo...@open.ac.uk>, "user@spark.apache.org" 
<user@spark.apache.org>
Date :  2016-11-04 03:39
Objet : Re: LinearRegressionWithSGD and Rank Features By Importance



Hi Mohit, 

Thank you for your reply. 
OK. it means coefficient with high score are more important that other 
with low score…

Many Thanks,
Best Regards,
Carlo


> On 3 Nov 2016, at 20:41, Mohit Jaggi <mohitja...@gmail.com> wrote:
> 
> For linear regression, it should be fairly easy. Just sort the 
co-efficients :)
> 
> Mohit Jaggi
> Founder,
> Data Orchard LLC
> www.dataorchardllc.com
> 
> 
> 
> 
>> On Nov 3, 2016, at 3:35 AM, Carlo.Allocca <carlo.allo...@open.ac.uk> 
wrote:
>> 
>> Hi All,
>> 
>> I am using SPARK and in particular the MLib library.
>> 
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.regression.LinearRegressionModel;
>> import org.apache.spark.mllib.regression.LinearRegressionWithSGD;
>> 
>> For my problem I am using the LinearRegressionWithSGD and I would like 
to perform a “Rank Features By Importance”.
>> 
>> I checked the documentation and it seems that does not provide such 
methods.
>> 
>> Am I missing anything?  Please, could you provide any help on this?
>> Should I change the approach?
>> 
>> Many Thanks in advance,
>> 
>> Best Regards,
>> Carlo
>> 
>> 
>> -- The Open University is incorporated by Royal Charter (RC 000391), an 
exempt charity in England & Wales and a charity registered in Scotland (SC 
038302). The Open University is authorised and regulated by the Financial 
Conduct Authority.
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org






Re: Deep learning libraries for scala

2016-11-04 Thread Masood Krohy
If you need ConvNets and RNNs and want to stay in Scala/Java, then Deep 
Learning for Java (DL4J) might be the most mature option.

If you want  ConvNets and RNNs, as implemented in TensorFlow, along with 
all the bells and whistles, then you might want to switch to PySpark + 
TensorFlow and write the entire pipeline in Python. You'd do the data 
preparation/ingestion in PySpark and pass the data to TensorFlow for the 
ML part. There are 2 supported modes here:
1) Simultaneous multi-model training (a.k.a. embarrassingly parallel: each 
node has the entire data and model):
https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

2) Data parallelism (data is distributed, each node has the entire model): 

There are some prototypes out there and TensorSpark seems to be most 
mature: https://github.com/adatao/tensorspark
It implements Downpour/Asynchronous SGD for the distributed training; it 
remains to be stress-tested with large datasets, however.
More info: 
https://arimo.com/machine-learning/deep-learning/2016/arimo-distributed-tensorflow-on-spark/

TensorFrames does not allow distributed training and I did not see any 
performance benchmarks last time I checked.

Alexander Ulanov of HP made a presentation of the options few months ago:
https://www.oreilly.com/learning/distributed-deep-learning-on-spark

Masood


--
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R
Intact Financial Corporation




De :Benjamin Kim <bbuil...@gmail.com>
A : janardhan shetty <janardhan...@gmail.com>
Cc :Gourav Sengupta <gourav.sengu...@gmail.com>, user 
<user@spark.apache.org>
Date :  2016-11-01 13:14
Objet : Re: Deep learning libraries for scala



To add, I see that Databricks has been busy integrating deep learning more 
into their product and put out a new article about this.

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html

An interesting tidbit is at the bottom of the article mentioning 
TensorFrames.

https://github.com/databricks/tensorframes

Seems like an interesting direction…

Cheers,
Ben


On Oct 19, 2016, at 9:05 AM, janardhan shetty <janardhan...@gmail.com> 
wrote:

Agreed. But as it states deeper integration with (scala) is yet to be 
developed. 
Any thoughts on how to use tensorflow with scala ? Need to write wrappers 
I think. 

On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
On that note, here is an article that Databricks made regarding using 
Tensorflow in conjunction with Spark.

https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

Cheers,
Ben


On Oct 19, 2016, at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
wrote:

while using Deep Learning you might want to stay as close to tensorflow as 
possible. There is very less translation loss, you get to access stable, 
scalable and tested libraries from the best brains in the industry and as 
far as Scala goes, it helps a lot to think about using the language as a 
tool to access algorithms in this instance unless you want to start 
developing algorithms from grounds up ( and in which case you might not 
require any libraries at all).

On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty <janardhan...@gmail.com> 
wrote:
Hi,

Are there any good libraries which can be used for scala deep learning 
models ?
How can we integrate tensorflow with scala ML ?







Re: Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-25 Thread Masood Krohy
Thanks Steve.

Here is the Python pseudo code that got it working for me:

  import time; 
  import urllib2
  nodes= ({'worker1_hostname':'worker1_ip', ... })
  YARN_app_queue = 'default'
  YARN_address = 'http://YARN_IP:8088'

  YARN_app_startedTimeBegin = str(int(time.time() - 3600)) # We allow 
3,600 sec from start of the app up to this point

  requestedURL = (YARN_address + 
 '/ws/v1/cluster/apps?states=RUNNING=SPARK=1' + 
  '=' + YARN_app_queue + 
  '=' + YARN_app_startedTimeBegin)
  print 'Sent request to YARN: ' + requestedURL
  response = urllib2.urlopen(requestedURL)
  html = response.read()
  amHost_start = html.find('amHostHttpAddress') + 
len('amHostHttpAddress":"')
  amHost_length = len('worker1_hostname')
  amHost = html[amHost_start : amHost_start + amHost_length]
  print 'amHostHttpAddress is: ' + amHost
  try:
  self.websock = ...
  print 'Connected to server running on %s' % nodes[amHost] 
  except:
  print 'Could not connect to server on %s' % nodes[amHost]



------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R
Intact Financial Corporation




De :Steve Loughran <ste...@hortonworks.com>
A : Masood Krohy <masood.kr...@intact.net>
Cc :"user@spark.apache.org" <user@spark.apache.org>
Date :  2016-10-24 17:09
Objet : Re: Getting the IP address of Spark Driver in yarn-cluster mode




On 24 Oct 2016, at 19:34, Masood Krohy <masood.kr...@intact.net> wrote:

Hi everyone, 
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)? 
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes. 



you can grab it off the YARN API itself; there's a REST view as well as a 
fussier RPC level. That is, assuming you want the web view, which is what 
is registered. 

If you know the application ID, you can also construct a URL through the 
YARN proxy; any attempt to talk direct to the AM is going to get 302'd 
back there anyway so any kerberos credentials can be verified.





Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-24 Thread Masood Krohy
Hi everyone,
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)?
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes.
I am running a parameter server along with the Spark Driver that needs to 
be contacted during the program execution; I need the Driver’s IP so that 
other executors can call back to this server. I need to stick to the 
yarn-cluster mode.
Thanks for any hints in advance.
Masood
PS: the closest code I was able to write is this which is not outputting 
what I need:
print sc.statusTracker().getJobInfo( 
sc.statusTracker().getActiveJobsIds()[0] )
# output in YARN stdout log: SparkJobInfo(jobId=4, stageIds=JavaObject 
id=o101, status='SUCCEEDED')

--
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R
Intact Financial Corporation