Re: Running K-Means on a cluster setup

Kal El Wed, 22 Jan 2014 07:06:24 -0800

@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some 
basic presentation non related with actual running the algorithm


@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials



On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <[email protected]> 
wrote:
 
How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ?
Regards
Mayur


Mayur Rustagi
Ph: +919632149971
http://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <[email protected]> 
wrote:

Hello,
>
>I have found that you generally need two separate pools of knowledge to be 
>successful in this game :). One is to have enough knowledge of network 
>topologies, systems, java, scala and whatever else to actually set up the 
>whole system (esp. if your requirements are different than running on a local 
>machine or in the ec2 cluster supported by the scripts that come with spark).
>
>The other is actual knowledge of the API and how it works and how to express 
>and solve your problems using the primitives offered by spark.
>
>There is also a third: since you can supply any function to a spark primitive, 
>you generally need to know scala or java (or python?) to actually solve your 
>problem.
>
>I am not sure this list is viewed as appropriate place to offer advice on how 
>to actually solve these problems. Not that I would mind seeing various 
>solutions to various problems :) and also optimizations.
>
>For example, I am trying to do rudimentary retention analysis. I am a total 
>beginner in the whole map/reduce way of solving problems. I have come up with 
>a solution that is pretty slow but implemented in 5 or 6 lines of code for the 
>simplest problem. However, my files are 20 GB in size each, all json strings. 
>Figuring out what the limiting factor is (network bandwidth is my suspicion 
>since I am accessing things via S3 is my guess) is somewhat of a black magic 
>to me at this point. I think for most of this stuff you will have to read the 
>code. The bigger question after that is optimizing your solutions to be faster 
>:). I would love to see practical tutorials on doing such things and I am 
>willing to put my attempts at solving problems out there to eventually get 
>cannibalized, ridiculed and reimplemented properly :).
>
>Sorry for this long winded email, it did not really answer your question 
>anyway :)
>
>Ognen
>
>
>
>
>On Wed, Jan 22, 2014 at 2:35 PM, Kal El <[email protected]> wrote:
>
>I have created a cluster setup with 2 workers (one of them is also the master)
>>
>>
>>Can anyone help me with a tutorial on how to run K-Means for example on this 
>>cluster (it would be better to run it from outside the cluster command line)?
>>
>>
>>I am mostly interested on how do I initiate the sparkcontext (what jars do I 
>>need to add ? :
>>newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I 
>>need to run.
>>
>>
>>I am using the standalone spark cluster.
>>
>>
>>Thanks
>>
>>
>>
>>
>

Re: Running K-Means on a cluster setup

Reply via email to