Re: high GC in the Kmeans algorithm

2015-02-20 Thread Xiangrui Meng
A single vector of size 10^7 won't hit that bound. How many clusters
did you set? The broadcast variable size is 10^7 * k and you can
calculate the amount of memory it needs. Try to reduce the number of
tasks and see whether it helps. -Xiangrui

On Tue, Feb 17, 2015 at 7:20 PM, lihu  wrote:
> Thanks for your answer. Yes, I cached the data, I can observed from the
> WebUI that all the data is cached in the memory.
>
> What I worry is that the dimension,  not the total size.
>
> Sean Owen ever answered me that the Broadcast support the maximum array size
> is 2GB, so 10^7 is a little huge?
>
> On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng  wrote:
>>
>> Did you cache the data? Was it fully cached? The k-means
>> implementation doesn't create many temporary objects. I guess you need
>> more RAM to avoid GC triggered frequently. Please monitor the memory
>> usage using YourKit or VisualVM. -Xiangrui
>>
>> On Wed, Feb 11, 2015 at 1:35 AM, lihu  wrote:
>> > I just want to make the best use of CPU,  and test the performance of
>> > spark
>> > if there is a lot of task in a single node.
>> >
>> > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen  wrote:
>> >>
>> >> Good, worth double-checking that's what you got. That's barely 1GB per
>> >> task though. Why run 48 if you have 24 cores?
>> >>
>> >> On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
>> >> > I give 50GB to the executor,  so it seem that  there is no reason the
>> >> > memory
>> >> > is not enough.
>> >> >
>> >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen 
>> >> > wrote:
>> >> >>
>> >> >> Meaning, you have 128GB per machine but how much memory are you
>> >> >> giving
>> >> >> the executors?
>> >> >>
>> >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
>> >> >> > What do you mean?  Yes,I an see there  is some data put in the
>> >> >> > memory
>> >> >> > from
>> >> >> > the web ui.
>> >> >> >
>> >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen 
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Are you actually using that memory for executors?
>> >> >> >>
>> >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
>> >> >> >> > Hi,
>> >> >> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.
>> >> >> >> > Every
>> >> >> >> > work
>> >> >> >> > own a
>> >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data
>> >> >> >> > is
>> >> >> >> > just
>> >> >> >> > 40GB.
>> >> >> >> >
>> >> >> >> >When the dimension of the data set is about 10^7, for every
>> >> >> >> > task
>> >> >> >> > the
>> >> >> >> > duration is about 30s, but the cost for GC is about 20s.
>> >> >> >> >
>> >> >> >> >When I reduce the dimension to 10^4, then the gc is small.
>> >> >> >> >
>> >> >> >> > So why gc is so high when the dimension is larger? or this
>> >> >> >> > is
>> >> >> >> > the
>> >> >> >> > reason
>> >> >> >> > caused by MLlib?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best Wishes!
>> >> >> >
>> >> >> > Li Hu(李浒) | Graduate Student
>> >> >> > Institute for Interdisciplinary Information Sciences(IIIS)
>> >> >> > Tsinghua University, China
>> >> >> >
>> >> >> > Email: lihu...@gmail.com
>> >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Wishes!
>> >> >
>> >> > Li Hu(李浒) | Graduate Student
>> >> > Institute for Interdisciplinary Information Sciences(IIIS)
>> >> > Tsinghua University, China
>> >> >
>> >> > Email: lihu...@gmail.com
>> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Best Wishes!
>> >
>> > Li Hu(李浒) | Graduate Student
>> > Institute for Interdisciplinary Information Sciences(IIIS)
>> > Tsinghua University, China
>> >
>> > Email: lihu...@gmail.com
>> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >
>> >
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
Thanks for your answer. Yes, I cached the data, I can observed from the
WebUI that all the data is cached in the memory.

What I worry is that the dimension,  not the total size.

Sean Owen ever answered me that the Broadcast support the maximum array
size is 2GB, so 10^7 is a little huge?

On Wed, Feb 18, 2015 at 5:43 AM, Xiangrui Meng  wrote:

> Did you cache the data? Was it fully cached? The k-means
> implementation doesn't create many temporary objects. I guess you need
> more RAM to avoid GC triggered frequently. Please monitor the memory
> usage using YourKit or VisualVM. -Xiangrui
>
> On Wed, Feb 11, 2015 at 1:35 AM, lihu  wrote:
> > I just want to make the best use of CPU,  and test the performance of
> spark
> > if there is a lot of task in a single node.
> >
> > On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen  wrote:
> >>
> >> Good, worth double-checking that's what you got. That's barely 1GB per
> >> task though. Why run 48 if you have 24 cores?
> >>
> >> On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
> >> > I give 50GB to the executor,  so it seem that  there is no reason the
> >> > memory
> >> > is not enough.
> >> >
> >> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen 
> wrote:
> >> >>
> >> >> Meaning, you have 128GB per machine but how much memory are you
> giving
> >> >> the executors?
> >> >>
> >> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
> >> >> > What do you mean?  Yes,I an see there  is some data put in the
> memory
> >> >> > from
> >> >> > the web ui.
> >> >> >
> >> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen 
> >> >> > wrote:
> >> >> >>
> >> >> >> Are you actually using that memory for executors?
> >> >> >>
> >> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
> >> >> >> > Hi,
> >> >> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.
> Every
> >> >> >> > work
> >> >> >> > own a
> >> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data
> is
> >> >> >> > just
> >> >> >> > 40GB.
> >> >> >> >
> >> >> >> >When the dimension of the data set is about 10^7, for every
> >> >> >> > task
> >> >> >> > the
> >> >> >> > duration is about 30s, but the cost for GC is about 20s.
> >> >> >> >
> >> >> >> >When I reduce the dimension to 10^4, then the gc is small.
> >> >> >> >
> >> >> >> > So why gc is so high when the dimension is larger? or this
> is
> >> >> >> > the
> >> >> >> > reason
> >> >> >> > caused by MLlib?
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best Wishes!
> >> >> >
> >> >> > Li Hu(李浒) | Graduate Student
> >> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> >> > Tsinghua University, China
> >> >> >
> >> >> > Email: lihu...@gmail.com
> >> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Wishes!
> >> >
> >> > Li Hu(李浒) | Graduate Student
> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> > Tsinghua University, China
> >> >
> >> > Email: lihu...@gmail.com
> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Li Hu(李浒) | Graduate Student
> > Institute for Interdisciplinary Information Sciences(IIIS)
> > Tsinghua University, China
> >
> > Email: lihu...@gmail.com
> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >
> >
>


Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means
implementation doesn't create many temporary objects. I guess you need
more RAM to avoid GC triggered frequently. Please monitor the memory
usage using YourKit or VisualVM. -Xiangrui

On Wed, Feb 11, 2015 at 1:35 AM, lihu  wrote:
> I just want to make the best use of CPU,  and test the performance of spark
> if there is a lot of task in a single node.
>
> On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen  wrote:
>>
>> Good, worth double-checking that's what you got. That's barely 1GB per
>> task though. Why run 48 if you have 24 cores?
>>
>> On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
>> > I give 50GB to the executor,  so it seem that  there is no reason the
>> > memory
>> > is not enough.
>> >
>> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen  wrote:
>> >>
>> >> Meaning, you have 128GB per machine but how much memory are you giving
>> >> the executors?
>> >>
>> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
>> >> > What do you mean?  Yes,I an see there  is some data put in the memory
>> >> > from
>> >> > the web ui.
>> >> >
>> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen 
>> >> > wrote:
>> >> >>
>> >> >> Are you actually using that memory for executors?
>> >> >>
>> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
>> >> >> > Hi,
>> >> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
>> >> >> > work
>> >> >> > own a
>> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data is
>> >> >> > just
>> >> >> > 40GB.
>> >> >> >
>> >> >> >When the dimension of the data set is about 10^7, for every
>> >> >> > task
>> >> >> > the
>> >> >> > duration is about 30s, but the cost for GC is about 20s.
>> >> >> >
>> >> >> >When I reduce the dimension to 10^4, then the gc is small.
>> >> >> >
>> >> >> > So why gc is so high when the dimension is larger? or this is
>> >> >> > the
>> >> >> > reason
>> >> >> > caused by MLlib?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best Wishes!
>> >> >
>> >> > Li Hu(李浒) | Graduate Student
>> >> > Institute for Interdisciplinary Information Sciences(IIIS)
>> >> > Tsinghua University, China
>> >> >
>> >> > Email: lihu...@gmail.com
>> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Best Wishes!
>> >
>> > Li Hu(李浒) | Graduate Student
>> > Institute for Interdisciplinary Information Sciences(IIIS)
>> > Tsinghua University, China
>> >
>> > Email: lihu...@gmail.com
>> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >
>> >
>
>
>
>
> --
> Best Wishes!
>
> Li Hu(李浒) | Graduate Student
> Institute for Interdisciplinary Information Sciences(IIIS)
> Tsinghua University, China
>
> Email: lihu...@gmail.com
> Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: high GC in the Kmeans algorithm

2015-02-11 Thread lihu
I just want to make the best use of CPU,  and test the performance of spark
if there is a lot of task in a single node.

On Wed, Feb 11, 2015 at 5:29 PM, Sean Owen  wrote:

> Good, worth double-checking that's what you got. That's barely 1GB per
> task though. Why run 48 if you have 24 cores?
>
> On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
> > I give 50GB to the executor,  so it seem that  there is no reason the
> memory
> > is not enough.
> >
> > On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen  wrote:
> >>
> >> Meaning, you have 128GB per machine but how much memory are you giving
> >> the executors?
> >>
> >> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
> >> > What do you mean?  Yes,I an see there  is some data put in the memory
> >> > from
> >> > the web ui.
> >> >
> >> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen 
> wrote:
> >> >>
> >> >> Are you actually using that memory for executors?
> >> >>
> >> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
> >> >> > Hi,
> >> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
> >> >> > work
> >> >> > own a
> >> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data is
> >> >> > just
> >> >> > 40GB.
> >> >> >
> >> >> >When the dimension of the data set is about 10^7, for every task
> >> >> > the
> >> >> > duration is about 30s, but the cost for GC is about 20s.
> >> >> >
> >> >> >When I reduce the dimension to 10^4, then the gc is small.
> >> >> >
> >> >> > So why gc is so high when the dimension is larger? or this is
> the
> >> >> > reason
> >> >> > caused by MLlib?
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Wishes!
> >> >
> >> > Li Hu(李浒) | Graduate Student
> >> > Institute for Interdisciplinary Information Sciences(IIIS)
> >> > Tsinghua University, China
> >> >
> >> > Email: lihu...@gmail.com
> >> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Best Wishes!
> >
> > Li Hu(李浒) | Graduate Student
> > Institute for Interdisciplinary Information Sciences(IIIS)
> > Tsinghua University, China
> >
> > Email: lihu...@gmail.com
> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
> >
> >
>



-- 
*Best Wishes!*

*Li Hu(李浒) | Graduate Student*

*Institute for Interdisciplinary Information Sciences(IIIS
)*
*Tsinghua University, China*

*Email: lihu...@gmail.com *
*Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
*


Re: high GC in the Kmeans algorithm

2015-02-11 Thread Sean Owen
Good, worth double-checking that's what you got. That's barely 1GB per
task though. Why run 48 if you have 24 cores?

On Wed, Feb 11, 2015 at 9:03 AM, lihu  wrote:
> I give 50GB to the executor,  so it seem that  there is no reason the memory
> is not enough.
>
> On Wed, Feb 11, 2015 at 4:50 PM, Sean Owen  wrote:
>>
>> Meaning, you have 128GB per machine but how much memory are you giving
>> the executors?
>>
>> On Wed, Feb 11, 2015 at 8:49 AM, lihu  wrote:
>> > What do you mean?  Yes,I an see there  is some data put in the memory
>> > from
>> > the web ui.
>> >
>> > On Wed, Feb 11, 2015 at 4:25 PM, Sean Owen  wrote:
>> >>
>> >> Are you actually using that memory for executors?
>> >>
>> >> On Wed, Feb 11, 2015 at 8:17 AM, lihu  wrote:
>> >> > Hi,
>> >> > I  run the kmeans(MLlib) in a cluster with 12 workers.  Every
>> >> > work
>> >> > own a
>> >> > 128G RAM, 24Core. I run 48 task in one machine. the total data is
>> >> > just
>> >> > 40GB.
>> >> >
>> >> >When the dimension of the data set is about 10^7, for every task
>> >> > the
>> >> > duration is about 30s, but the cost for GC is about 20s.
>> >> >
>> >> >When I reduce the dimension to 10^4, then the gc is small.
>> >> >
>> >> > So why gc is so high when the dimension is larger? or this is the
>> >> > reason
>> >> > caused by MLlib?
>> >> >
>> >> >
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Best Wishes!
>> >
>> > Li Hu(李浒) | Graduate Student
>> > Institute for Interdisciplinary Information Sciences(IIIS)
>> > Tsinghua University, China
>> >
>> > Email: lihu...@gmail.com
>> > Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>> >
>> >
>
>
>
>
> --
> Best Wishes!
>
> Li Hu(李浒) | Graduate Student
> Institute for Interdisciplinary Information Sciences(IIIS)
> Tsinghua University, China
>
> Email: lihu...@gmail.com
> Homepage: http://iiis.tsinghua.edu.cn/zh/lihu/
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org