Can you please share all available logs?
On Fri, Jul 8, 2016 at 5:57 AM, Saliya Ekanayake <esal...@gmail.com> wrote: > Hi, > > I've been trying to run the provided KMeans example on a 16 node cluster. I > was testing with 2 Task Managers (TM) per node because each node has 2 > sockets (CPUs). A socket contains 12 cores, so I've set the number of slots > per TM as 12.The total parallelism is 384 (12 slots x 2 TMs x 16 nodes). > > However, Flink TMs keep failing time to time causing KMeans to fail. The > only explanation I could find from logs is that TMs unregister from Job > Manager. I've increased Akka timeout to 1000s as well. > > Any suggestions on this? > > The data sizes I tried were 10k points, 250k points, and 1mil points. Number > of centers were 100 to 1000. None of these sizes completed. > > Thank you, > Saliya > > -- > Saliya Ekanayake > Ph.D. Candidate | Research Assistant > School of Informatics and Computing | Digital Science Center > Indiana University, Bloomington >