Nice!
On Wed, Jan 22, 2014 at 2:58 PM, Mayur Rustagi <[email protected]>wrote: > How about http://spark.incubator.apache.org/docs/latest/mllib-guide.html ? > Regards > Mayur > > Mayur Rustagi > Ph: +919632149971 > h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com > https://twitter.com/mayur_rustagi > > > > On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski <[email protected] > > wrote: > >> Hello, >> >> I have found that you generally need two separate pools of knowledge to >> be successful in this game :). One is to have enough knowledge of network >> topologies, systems, java, scala and whatever else to actually set up the >> whole system (esp. if your requirements are different than running on a >> local machine or in the ec2 cluster supported by the scripts that come with >> spark). >> >> The other is actual knowledge of the API and how it works and how to >> express and solve your problems using the primitives offered by spark. >> >> There is also a third: since you can supply any function to a spark >> primitive, you generally need to know scala or java (or python?) to >> actually solve your problem. >> >> I am not sure this list is viewed as appropriate place to offer advice on >> how to actually solve these problems. Not that I would mind seeing various >> solutions to various problems :) and also optimizations. >> >> For example, I am trying to do rudimentary retention analysis. I am a >> total beginner in the whole map/reduce way of solving problems. I have come >> up with a solution that is pretty slow but implemented in 5 or 6 lines of >> code for the simplest problem. However, my files are 20 GB in size each, >> all json strings. Figuring out what the limiting factor is (network >> bandwidth is my suspicion since I am accessing things via S3 is my guess) >> is somewhat of a black magic to me at this point. I think for most of this >> stuff you will have to read the code. The bigger question after that is >> optimizing your solutions to be faster :). I would love to see practical >> tutorials on doing such things and I am willing to put my attempts at >> solving problems out there to eventually get cannibalized, ridiculed and >> reimplemented properly :). >> >> Sorry for this long winded email, it did not really answer your question >> anyway :) >> Ognen >> >> >> On Wed, Jan 22, 2014 at 2:35 PM, Kal El <[email protected]> wrote: >> >>> I have created a cluster setup with 2 workers (one of them is also the >>> master) >>> >>> Can anyone help me with a tutorial on how to run K-Means for example on >>> this cluster (it would be better to run it from outside the cluster command >>> line)? >>> >>> I am mostly interested on how do I initiate the sparkcontext (what jars >>> do I need to add ? : >>> new SparkContext(master, appName, [sparkHome], [jars])) and what other >>> steps I need to run. >>> >>> I am using the standalone spark cluster. >>> >>> Thanks >>> >>> >>> >> >
