Thanks a lot Sean, Daniel, Matei and Jerry. I really appreciate your reply.
And I also understand that I should be a little more patient. When I myself
is only not able to reply within next 5 hours how can I expect question to
be answered in that time.

And yes the Idea of using a separate Clustering library sounds correct.
Although I am using python so I will be using Scikit learn istead of Weka.

Thanks,,


On Tue, Jul 15, 2014 at 12:51 AM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi there,
>
> I think the question is interesting; a spark of sparks = spark
> I wonder if you can use the spark job server (
> https://github.com/ooyala/spark-jobserver)?
>
> So in the spark task that requires a new spark context, instead of
> creating it in the task, contact the job server to create one and use the
> data in the task as the data source either via hdfs/tachyon/s3. Wait until
> the sub-task is done then continue. Since the job server has the notion of
> job id, you might use it as a reference to the sub-task.
>
> I don't know if this is a good idea or bad one. Maybe this is an
> anti-pattern of spark, but maybe not.
>
> HTH,
>
> Jerry
>
>
>
> On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> You currently can't use SparkContext inside a Spark task, so in this case
>> you'd have to call some kind of local K-means library. One example you can
>> try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then
>> load your text files as an RDD of strings with SparkContext.wholeTextFiles
>> and call Weka on each one.
>>
>> Matei
>>
>> On Jul 14, 2014, at 11:30 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com>
>> wrote:
>>
>> I understand that the question is very unprofessional, but I am a newbie.
>> If you could share some link where I can ask such questions, if not here.
>>
>> But please answer.
>>
>>
>> On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani <
>> rahulbhojwani2...@gmail.com> wrote:
>>
>>> Hey, My question is for this situation:
>>> Suppose we have 100000 files each containing list of features in each
>>> row.
>>>
>>> Task is that for each file cluster the features in that file and write
>>> the corresponding cluster along with it in a new file.  So we have to
>>> generate 100000 more files by applying clustering in each file
>>> individually.
>>>
>>> So can I do it this way, that get rdd of list of files and apply map.
>>> Inside the mapper function which will be handling each file, get another
>>> spark context and use Mllib kmeans to get the clustered output file.
>>>
>>> Please suggest the appropriate method to tackle this problem.
>>>
>>> Thanks,
>>> Rahul Kumar Bhojwani
>>> 3rd year, B.Tech
>>> Computer Science Engineering
>>> National Institute Of Technology, Karnataka
>>> 9945197359
>>>
>>
>>
>>
>> --
>> Rahul K Bhojwani
>> 3rd Year B.Tech
>> Computer Science and Engineering
>> National Institute of Technology, Karnataka
>>
>>
>>
>


-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Reply via email to