Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply. And I also understand that I should be a little more patient. When I myself is only not able to reply within next 5 hours how can I expect question to be answered in that time.
And yes the Idea of using a separate Clustering library sounds correct. Although I am using python so I will be using Scikit learn istead of Weka. Thanks,, On Tue, Jul 15, 2014 at 12:51 AM, Jerry Lam <chiling...@gmail.com> wrote: > Hi there, > > I think the question is interesting; a spark of sparks = spark > I wonder if you can use the spark job server ( > https://github.com/ooyala/spark-jobserver)? > > So in the spark task that requires a new spark context, instead of > creating it in the task, contact the job server to create one and use the > data in the task as the data source either via hdfs/tachyon/s3. Wait until > the sub-task is done then continue. Since the job server has the notion of > job id, you might use it as a reference to the sub-task. > > I don't know if this is a good idea or bad one. Maybe this is an > anti-pattern of spark, but maybe not. > > HTH, > > Jerry > > > > On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> You currently can't use SparkContext inside a Spark task, so in this case >> you'd have to call some kind of local K-means library. One example you can >> try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then >> load your text files as an RDD of strings with SparkContext.wholeTextFiles >> and call Weka on each one. >> >> Matei >> >> On Jul 14, 2014, at 11:30 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> >> wrote: >> >> I understand that the question is very unprofessional, but I am a newbie. >> If you could share some link where I can ask such questions, if not here. >> >> But please answer. >> >> >> On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani < >> rahulbhojwani2...@gmail.com> wrote: >> >>> Hey, My question is for this situation: >>> Suppose we have 100000 files each containing list of features in each >>> row. >>> >>> Task is that for each file cluster the features in that file and write >>> the corresponding cluster along with it in a new file. So we have to >>> generate 100000 more files by applying clustering in each file >>> individually. >>> >>> So can I do it this way, that get rdd of list of files and apply map. >>> Inside the mapper function which will be handling each file, get another >>> spark context and use Mllib kmeans to get the clustered output file. >>> >>> Please suggest the appropriate method to tackle this problem. >>> >>> Thanks, >>> Rahul Kumar Bhojwani >>> 3rd year, B.Tech >>> Computer Science Engineering >>> National Institute Of Technology, Karnataka >>> 9945197359 >>> >> >> >> >> -- >> Rahul K Bhojwani >> 3rd Year B.Tech >> Computer Science and Engineering >> National Institute of Technology, Karnataka >> >> >> > -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka