How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ? Regards Mayur
Mayur Rustagi Ph: +919632149971 h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <[email protected]>wrote: > Hello, > > I have found that you generally need two separate pools of knowledge to be > successful in this game :). One is to have enough knowledge of network > topologies, systems, java, scala and whatever else to actually set up the > whole system (esp. if your requirements are different than running on a > local machine or in the ec2 cluster supported by the scripts that come with > spark). > > The other is actual knowledge of the API and how it works and how to > express and solve your problems using the primitives offered by spark. > > There is also a third: since you can supply any function to a spark > primitive, you generally need to know scala or java (or python?) to > actually solve your problem. > > I am not sure this list is viewed as appropriate place to offer advice on > how to actually solve these problems. Not that I would mind seeing various > solutions to various problems :) and also optimizations. > > For example, I am trying to do rudimentary retention analysis. I am a > total beginner in the whole map/reduce way of solving problems. I have come > up with a solution that is pretty slow but implemented in 5 or 6 lines of > code for the simplest problem. However, my files are 20 GB in size each, > all json strings. Figuring out what the limiting factor is (network > bandwidth is my suspicion since I am accessing things via S3 is my guess) > is somewhat of a black magic to me at this point. I think for most of this > stuff you will have to read the code. The bigger question after that is > optimizing your solutions to be faster :). I would love to see practical > tutorials on doing such things and I am willing to put my attempts at > solving problems out there to eventually get cannibalized, ridiculed and > reimplemented properly :). > > Sorry for this long winded email, it did not really answer your question > anyway :) > Ognen > > > On Wed, Jan 22, 2014 at 2:35 PM, Kal El <[email protected]> wrote: > >> I have created a cluster setup with 2 workers (one of them is also the >> master) >> >> Can anyone help me with a tutorial on how to run K-Means for example on >> this cluster (it would be better to run it from outside the cluster command >> line)? >> >> I am mostly interested on how do I initiate the sparkcontext (what jars >> do I need to add ? : >> new SparkContext(master, appName, [sparkHome], [jars])) and what other >> steps I need to run. >> >> I am using the standalone spark cluster. >> >> Thanks >> >> >> >
