Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Saliya Ekanayake
Yes, I've password-less SSH to the job manager node. On Mon, Jul 11, 2016 at 4:53 PM, Greg Hogan wrote: > pdsh is only used for starting taskmanagers. How did you work around this? > You are able to passwordless-ssh to the jobmanager? > > The error looks to be from config.sh:318 in rotateLogFile

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Greg Hogan
pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager? The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:10

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Saliya Ekanayake
Looking at what happens with pdsh, there are two things that go wrong. 1. pdsh is installed in a node other than where the job manager would run, so invoking *start-cluster *from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that i

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Saliya Ekanayake
I meant, I'll check when current jobs are done and will let you know. On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake wrote: > I am running some jobs now. I'll stop and restart using pdsh to see what > was the issue again > > On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan wrote: > >> I'd defin

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Saliya Ekanayake
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan wrote: > I'd definitely be interested to hear any insight into what failed when > starting the taskmanagers with pdsh. Did the command fail, or fallback to >

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-11 Thread Greg Hogan
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file? I'm wondering if we need to escape PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS as PDSH_SSH_ARGS_APPEND="${FL

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-10 Thread Saliya Ekanayake
pdsh is available in head node only, but when I tried to do *start-cluster *from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts. Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-10 Thread Greg Hogan
Hi Saliya, Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel. As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 ( https://issues.apache.org/jira/br

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-10 Thread Saliya Ekanayake
Thank you. Yes, the previous format is still supported. If a number is specified after the hostname then only it'll kick in this change. On Sun, Jul 10, 2016 at 5:42 PM, Gyula Fóra wrote: > Hi, > > I think this would be a nice addition especially for Flink clusters > running on big machines wh

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

2016-07-10 Thread Gyula Fóra
Hi, I think this would be a nice addition especially for Flink clusters running on big machines where you might want to run multiple task managers just to split the memory between multiple java processes. In any case the previous config format should also be supported as the default. I am curiou