So the JobManager was running on host1. This also explains why I didn't see the problem until I had asked for a sizeable degree of parallelism since it probably never assigned a task to host3.
Thanks for your help On Thu, Jun 25, 2015 at 3:34 AM, Stephan Ewen <se...@apache.org> wrote: > Nice! > > TaskManagers need to announce where they listen for connections. > > We do not yet block "localhost" as an acceptable address, to not prohibit > local test setups. > > There are some routines that try to select an interface that can > communicate with the outside world. > > Is host3 running on the same machine as the JobManager? Or did you > experience a long delay until TaskManager 3 was registered? > > Thanks for helping us debug this, > Stephan > > > > > > > On Wed, Jun 24, 2015 at 11:58 PM, Aaron Jackson <ajack...@pobox.com> > wrote: > >> That was it. host3 was showing localhost - looked a little further and >> it was missing an entry in /etc/hosts. >> >> Thanks for looking into this. >> >> Aaron >> >> On Wed, Jun 24, 2015 at 2:13 PM, Stephan Ewen <se...@apache.org> wrote: >> >>> Aaron, >>> >>> Can you check how the TaskManagers register at the JobManager? When you >>> look at the 'TaskManagers' section in the JobManager's web Interface (at >>> port 8081), what does it say as the TaskManager host names? >>> >>> Does it list "host1", "host2", "host3"...? >>> >>> Thanks, >>> Stephan >>> Am 24.06.2015 20:31 schrieb "Ufuk Celebi" <u...@apache.org>: >>> >>>> On 24 Jun 2015, at 16:22, Aaron Jackson <ajack...@pobox.com> wrote: >>>> >>>> > Thanks. My setup is actually 3 task managers x 4 slots. I played >>>> with the parallelism and found that at low values, the error did not >>>> occur. I can only conclude that there is some form of data shuffling that >>>> is occurring that is sensitive to the data source. Yes, seems a little odd >>>> to me as well. OOC, did you load the file into HDFS or use it from a local >>>> file system (e.g. file:///tmp/data.csv) - my results have shown that so >>>> far, HDFS does not appear to be sensitive to this issue. >>>> > >>>> > I updated the example to include my configuration and slaves, but for >>>> brevity, I'll include the configurable bits here: >>>> > >>>> > jobmanager.rpc.address: host01 >>>> > jobmanager.rpc.port: 6123 >>>> > jobmanager.heap.mb: 512 >>>> > taskmanager.heap.mb: 2048 >>>> > taskmanager.numberOfTaskSlots: 4 >>>> > parallelization.degree.default: 1 >>>> > jobmanager.web.port: 8081 >>>> > webclient.port: 8080 >>>> > taskmanager.network.numberOfBuffers: 8192 >>>> > taskmanager.tmp.dirs: /datassd/flink/tmp >>>> > >>>> > And the slaves ... >>>> > >>>> > host01 >>>> > host02 >>>> > host03 >>>> > >>>> > I did notice an extra empty line at the end of the slaves. And while >>>> I highly doubt it makes ANY difference, I'm still going to re-run with it >>>> removed. >>>> > >>>> > Thanks for looking into it. >>>> >>>> Thank you for being so helpful. I've tried it with the local filesystem. >>>> >>>> On 23 Jun 2015, at 07:11, Aaron Jackson <ajack...@pobox.com> wrote: >>>> >>>> > I have 12 task managers across 3 machines - so it's a small setup. >>>> >>>> Sorry for my misunderstanding. I've tried it with both 12 task managers >>>> and 3 as well now. What's odd is that the stack trace shows that it is >>>> trying to connect to "localhost" for the remote channel although localhost >>>> is not configured anywhere. Let me think about that. ;) >>>> >>>> – Ufuk >>>> >>>> >>>> >>>> >>>> >>>> >> >