@Mayur: I do not see any tutorial about how to run mlib on a cluster, just some basic presentation non related with actual running the algorithm
@Ognen: Thanks, I have figured that out :)) that's why I need some tutorials On Wednesday, January 22, 2014 4:59 PM, Mayur Rustagi <[email protected]> wrote: How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ? Regards Mayur Mayur Rustagi Ph: +919632149971 http://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <[email protected]> wrote: Hello, > >I have found that you generally need two separate pools of knowledge to be >successful in this game :). One is to have enough knowledge of network >topologies, systems, java, scala and whatever else to actually set up the >whole system (esp. if your requirements are different than running on a local >machine or in the ec2 cluster supported by the scripts that come with spark). > >The other is actual knowledge of the API and how it works and how to express >and solve your problems using the primitives offered by spark. > >There is also a third: since you can supply any function to a spark primitive, >you generally need to know scala or java (or python?) to actually solve your >problem. > >I am not sure this list is viewed as appropriate place to offer advice on how >to actually solve these problems. Not that I would mind seeing various >solutions to various problems :) and also optimizations. > >For example, I am trying to do rudimentary retention analysis. I am a total >beginner in the whole map/reduce way of solving problems. I have come up with >a solution that is pretty slow but implemented in 5 or 6 lines of code for the >simplest problem. However, my files are 20 GB in size each, all json strings. >Figuring out what the limiting factor is (network bandwidth is my suspicion >since I am accessing things via S3 is my guess) is somewhat of a black magic >to me at this point. I think for most of this stuff you will have to read the >code. The bigger question after that is optimizing your solutions to be faster >:). I would love to see practical tutorials on doing such things and I am >willing to put my attempts at solving problems out there to eventually get >cannibalized, ridiculed and reimplemented properly :). > >Sorry for this long winded email, it did not really answer your question >anyway :) > >Ognen > > > > >On Wed, Jan 22, 2014 at 2:35 PM, Kal El <[email protected]> wrote: > >I have created a cluster setup with 2 workers (one of them is also the master) >> >> >>Can anyone help me with a tutorial on how to run K-Means for example on this >>cluster (it would be better to run it from outside the cluster command line)? >> >> >>I am mostly interested on how do I initiate the sparkcontext (what jars do I >>need to add ? : >>newSparkContext(master,appName,[sparkHome],[jars])) and what other steps I >>need to run. >> >> >>I am using the standalone spark cluster. >> >> >>Thanks >> >> >> >> >
