Hi,
I am currently experimenting with linear regression (SGD) (Spark +
MLlib, ver. 1.2). At this point in time I need to fine-tune the
hyper-parameters. I do this (for now) by an exhaustive grid search of
the step size and the number of iterations. Currently I am on a dual
core that acts as a master (local mode for now but will be adding spark
worker later). In order to maximize throughput I need to execute each
execution of the linear regression algorithm in parallel.
According to the documentation it seems like parallel jobs may be
scheduled if they are executed in separate threads [1]. So this brings
me to my first question: does this mean I am CPU bound by the Spark
master? In other words the maximum number of jobs = maximum number of
threads of the OS?
I searched the mailing list but did not find anything regarding MLlib
itself. I even peaked into the new MLlib API that uses pipelines and has
support for parameter tuning. However, it looks like each job (instance
of the learning algorithm) is executed in sequence. Can anyone confirm
this? This brings me to my 2ndo question: is their any example that
shows how one can execute MLlib algorithms as parallel jobs?
Finally, is their any general technique I can use to execute an
algorithm in a distributed manner using Spark? More specifically I would
like to have several MLlib algorithms run in parallel. Can anyone show
me an example of sorts to do this?
TIA.
Hugo F.
[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org