Hi,

I am currently experimenting with linear regression (SGD) (Spark + MLlib,
ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
do this (for now) by an exhaustive grid search of the step size and the
number of iterations. Currently I am on a dual core that acts as a master
(local mode for now but will be adding spark worker later). In order to
maximize throughput I need to execute each execution of the linear
regression algorithm in parallel.

According to the documentation it seems like parallel jobs may be scheduled
if they are executed in separate threads [1]. So this brings me to my first
question: does this mean I am CPU bound by the Spark master? In other words
the maximum number of jobs = maximum number of threads of the OS?

I searched the mailing list but did not find anything regarding MLlib
itself. I even peaked into the new MLlib API that uses pipelines and has
support for parameter tuning. However, it looks like each job (instance of
the learning algorithm) is executed in sequence. Can anyone confirm this?
This brings me to my 2ndo question: is their any example that shows how one
can execute MLlib algorithms as parallel jobs?

Finally, is their any general technique I can use to execute an algorithm in
a distributed manner using Spark? More specifically I would like to have
several MLlib algorithms run in parallel. Can anyone show me an example of
sorts to do this?

TIA.
Hugo F.







[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parallel-parameter-tuning-distributed-execution-of-MLlib-algorithms-tp23031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to