Hey Xi,

Have you tried Spark 1.3.0? The initialization happens on the driver node
and we fixed an issue with the initialization in 1.3.0. Again, please start
with a smaller k, and increase it gradually, Let us know at what k the
problem happens.

Best,
Xiangrui

On Sat, Mar 28, 2015 at 3:11 AM, Xi Shen <davidshe...@gmail.com> wrote:

> My vector dimension is like 360 or so. The data count is about 270k. My
> driver has 2.9G memory. I attache a screenshot of current executor status.
> I submitted this job with "--master yarn-cluster". I have a total of 7
> worker node, one of them acts as the driver. In the screenshot, you can see
> all worker nodes have loaded some data, but the driver is not loaded with
> any data.
>
> But the funny thing is, when I log on to the driver, and check its CPU &
> memory status. I saw one java process using about 18% of CPU, and is using
> about 1.6 GB memory.
>
> [image: Inline image 1]
>
>
> On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh <r...@databricks.com> wrote:
>
>> How many dimensions does your data have? The size of the k-means model is
>> k * d, where d is the dimension of the data.
>>
>> Since you're using k=1000, if your data has dimension higher than say,
>> 10,000, you will have trouble, because k*d doubles have to fit in the
>> driver.
>>
>> Reza
>>
>> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen <davidshe...@gmail.com> wrote:
>>
>>> I have put more detail of my problem at http://stackoverflow.com/
>>> questions/29295420/spark-kmeans-computation-cannot-be-distributed
>>>
>>> It is really appreciate if you can help me take a look at this problem.
>>> I have tried various settings and ways to load/partition my data, but I
>>> just cannot get rid that long pause.
>>>
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>
>>>
>>>
>>> [image: --]
>>> Xi Shen
>>> [image: http://]about.me/davidshen
>>> <http://about.me/davidshen?promo=email_sig>
>>>   <http://about.me/davidshen>
>>>
>>> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshe...@gmail.com> wrote:
>>>
>>>> Yes, I have done repartition.
>>>>
>>>> I tried to repartition to the number of cores in my cluster. Not
>>>> helping...
>>>> I tried to repartition to the number of centroids (k value). Not
>>>> helping...
>>>>
>>>>
>>>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com>
>>>> wrote:
>>>>
>>>>> Can you try specifying the number of partitions when you load the data
>>>>> to equal the number of executors?  If your ETL changes the number of
>>>>> partitions, you can also repartition before calling KMeans.
>>>>>
>>>>>
>>>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a large data set, and I expects to get 5000 clusters.
>>>>>>
>>>>>> I load the raw data, convert them into DenseVector; then I did
>>>>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>>>>>
>>>>>> Now the job is running, and data are loaded. But according to the
>>>>>> Spark UI, all data are loaded onto one executor. I checked that executor,
>>>>>> and its CPU workload is very low. I think it is using only 1 of the 8
>>>>>> cores. And all other 3 executors are at rest.
>>>>>>
>>>>>> Did I miss something? Is it possible to distribute the workload to
>>>>>> all 4 executors?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>
>>>
>>

Reply via email to