Re: Data growth vs Cluster Size planning

2019-02-12 Thread Phillip Henry
Too little information to give an answer, if indeed an answer a priori is possible. However, I would do the following on your test instances: - Run jstat -gc on all your nodes. It might be that the GC is taking a lot of time. - Poll with jstack semi frequently. I can give you a fairly good idea

Data growth vs Cluster Size planning

2019-02-11 Thread Aakash Basu
Hi, I ran a dataset of *200 columns and 0.2M records* in a cluster of *1 master 18 GB, 2 slaves 32 GB each, **16 cores/slave*, took around *772 minutes* for a *very large ML tuning based job* (training). Now, my requirement is to run the *same operation on 3M records*. Any idea on how we should