Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui

On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming <mailingl...@ltsai.com> wrote:
> Thanks again.
>
>> If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master,
>
> The "master" here is the app/driver/spark-shell?
>
> Thanks!
>
> On 25 Mar, 2014, at 1:03 am, Xiangrui Meng <men...@gmail.com> wrote:
>
>> Number of rows doesn't matter much as long as you have enough workers
>> to distribute the work. K-means has complexity O(n * d * k), where n
>> is number of points, d is the dimension, and k is the number of
>> clusters. If you use the KMeans implementation from MLlib, the
>> initialization stage is done on master, so a large k would slow down
>> the initialization stage. If your data is sparse, the latest change to
>> KMeans will help with the speed, depending on how sparse your data is.
>> -Xiangrui
>>
>> On Mon, Mar 24, 2014 at 12:44 AM, Tsai Li Ming <mailingl...@ltsai.com> wrote:
>>> Thanks, Let me try with a smaller K.
>>>
>>> Does the size of the input data matters for the example? Currently I have 
>>> 50M rows. What is a reasonable size to demonstrate the capability of Spark?
>>>
>>>
>>>
>>>
>>>
>>> On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>>> K = 500000 is certainly a large number for k-means. If there is no
>>>> particular reason to have 500000 clusters, could you try to reduce it
>>>> to, e.g, 100 or 1000? Also, the example code is not for large-scale
>>>> problems. You should use the KMeans algorithm in mllib clustering for
>>>> your problem.
>>>>
>>>> -Xiangrui
>>>>
>>>> On Sun, Mar 23, 2014 at 11:53 PM, Tsai Li Ming <mailingl...@ltsai.com> 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> This is on a 4 nodes cluster each with 32 cores/256GB Ram.
>>>>>
>>>>> (0.9.0) is deployed in a stand alone mode.
>>>>>
>>>>> Each worker is configured with 192GB. Spark executor memory is also 192GB.
>>>>>
>>>>> This is on the first iteration. K=500000. Here's the code I use:
>>>>> http://pastebin.com/2yXL3y8i , which is a copy-and-paste of the example.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> On 24 Mar, 2014, at 2:46 pm, Xiangrui Meng <men...@gmail.com> wrote:
>>>>>
>>>>>> Hi Tsai,
>>>>>>
>>>>>> Could you share more information about the machine you used and the
>>>>>> training parameters (runs, k, and iterations)? It can help solve your
>>>>>> issues. Thanks!
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming <mailingl...@ltsai.com> 
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> At the reduceBuyKey stage, it takes a few minutes before the tasks 
>>>>>>> start working.
>>>>>>>
>>>>>>> I have -Dspark.default.parallelism=127 cores (n-1).
>>>>>>>
>>>>>>> CPU/Network/IO is idling across all nodes when this is happening.
>>>>>>>
>>>>>>> And there is nothing particular on the master log file. From the 
>>>>>>> spark-shell:
>>>>>>>
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:124 as TID 538 
>>>>>>> on executor 2: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:124 as 
>>>>>>> 38765155 bytes in 193 ms
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:125 as TID 539 
>>>>>>> on executor 1: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:125 as 
>>>>>>> 38765155 bytes in 96 ms
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Starting task 3.0:126 as TID 540 
>>>>>>> on executor 0: XXX (PROCESS_LOCAL)
>>>>>>> 14/03/23 18:13:50 INFO TaskSetManager: Serialized task 3.0:126 as 
>>>>>>> 38765155 bytes in 100 ms
>>>>>>>
>>>>>>> But it stops there for some significant time before any movement.
>>>>>>>
>>>>>>> In the stage detail of the UI, I can see that there are 127 tasks 
>>>>>>> running but the duration each is at least a few minutes.
>>>>>>>
>>>>>>> I'm working off local storage (not hdfs) and the kmeans data is about 
>>>>>>> 6.5GB (50M rows).
>>>>>>>
>>>>>>> Is this a normal behaviour?
>>>>>>>
>>>>>>> Thanks!
>>>>>
>>>
>

Reply via email to