Hi,
I'm having some problems with an storm cluster that produces that nimbus
rebalance the topology too often because thinks that some workers are down.
My setup is using apache-storm 0.9.3:
2 storm nodes running storm-supervisor (storm-1 and storm-2)
1 server running storm-nimbus and storm-ui
a cluster of 3 servers with zookeeper and kafka.
Looking at the logs, I don't see any obvious error, however I see entries
like:
2015-03-29T01:13:26.138+0000 b.s.m.n.Client [INFO] connection established
to a remote host Netty-Client-localhost/127.0.0.1:6700, [id: 0xc64a1aa8, /
127.0.0.1:41471 => localhost/127.0.0.1:6700]
2015-03-29T01:13:26.139+0000 b.s.m.n.Client [INFO] Closing Netty Client
Netty-Client-localhost/127.0.0.1:6700
2015-03-29T01:13:26.139+0000 b.s.m.n.Client [INFO] Waiting for pending
batchs to be sent with Netty-Client-localhost/127.0.0.1:6700..., timeout:
600000ms, pendings: 0
which only appear on the worker's logs when nimbus starts rebalancing the
topology over and over again (every 4-5 minutes).
If I kill the topology, restart storm-supervisor and deploy the topology
again, everything works again for a while (it may be hours or even days).
My first guess was that we had a problem with the server hostname, because
it resolves to 127.0.0.1, that's how our infrastructure works and didn't
want to change it, if possible, and thus, found that I could
use storm.local.hostname on storm.yaml to use a different hostname that
always points to our private IP instead of localhost. I had the feeling
that such change gave us what we needed, however, 4 days after it, we start
having problems again.
This is the nimbus output when it rebalances the topology:
015-04-01T16:42:56.859+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[4 4] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[36 36] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[68 68] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[8 8] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[40 40] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[12 12] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[44 44] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[16 16] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[48 48] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[20 20] not alive
2015-04-01T16:42:56.860+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[52 52] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[24 24] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[56 56] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[28 28] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[60 60] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[32 32] not alive
2015-04-01T16:42:56.861+0000 b.s.d.nimbus [INFO] Executor
payload-processing-13-1427853690:[64 64] not alive
2015-04-01T16:42:56.869+0000 b.s.s.EvenScheduler [INFO] Available slots:
(["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6700]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6703]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6700])
2015-04-01T16:42:56.870+0000 b.s.d.nimbus [INFO] Reassigning
payload-processing-13-1427853690 to 4 slots
2015-04-01T16:42:56.870+0000 b.s.d.nimbus [INFO] Reassign executors: [[4 4]
[36 36] [68 68] [8 8] [40 40] [12 12] [44 44] [16 16] [48 48] [20 20] [52
52] [24 24] [56 56] [28 28] [60 60] [32 32] [64 64]]
2015-04-01T16:42:56.876+0000 b.s.d.nimbus [INFO] Setting new assignment for
topology id payload-processing-13-1427853690:
#backtype.storm.daemon.common.Assignment{:master-code-dir
"/usr/lib/storm/storm-local/nimbus/stormdist/payload-processing-13-1427853690",
:node->host {"e4495dd1-2d2e-468d-b0a5-18bb55be79d5" "
storm-1.wdc.sl.serverdensity.net", "90d0157f-aecb-46aa-a824-99e62a33f9b3" "
storm-2.wdc.sl.serverdensity.net"}, :executor->node+port {[2 2]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [34 34]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [66 66]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [3 3]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [35 35]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [67 67]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [4 4]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [36 36]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [68 68]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [5 5]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [37 37]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [69 69]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [6 6]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [38 38]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [70 70]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [7 7]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [39 39]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [8 8]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [40 40]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [9 9]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [41 41]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [10 10]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [42 42]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [11 11]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [43 43]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [12 12]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [44 44]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [13 13]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [45 45]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [14 14]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [46 46]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [15 15]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [47 47]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [16 16]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [48 48]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [17 17]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [49 49]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [18 18]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [50 50]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [19 19]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [51 51]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [20 20]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [52 52]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [21 21]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [53 53]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [22 22]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [54 54]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [23 23]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [55 55]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [24 24]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [56 56]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [25 25]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [57 57]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [26 26]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [58 58]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [27 27]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [59 59]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [28 28]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [60 60]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [29 29]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [61 61]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [30 30]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [62 62]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6703], [31 31]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [63 63]
["e4495dd1-2d2e-468d-b0a5-18bb55be79d5" 6702], [32 32]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [64 64]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6701], [1 1]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [33 33]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702], [65 65]
["90d0157f-aecb-46aa-a824-99e62a33f9b3" 6702]}, :executor->start-time-secs
{[2 2] 1427906144, [34 34] 1427906144, [66 66] 1427906144, [3 3]
1427906375, [35 35] 1427906375, [67 67] 1427906375, [4 4] 1427906576, [36
36] 1427906576, [68 68] 1427906576, [5 5] 1427906375, [37 37] 1427906375,
[69 69] 1427906375, [6 6] 1427906144, [38 38] 1427906144, [70 70]
1427906144, [7 7] 1427906375, [39 39] 1427906375, [8 8] 1427906576, [40 40]
1427906576, [9 9] 1427906375, [41 41] 1427906375, [10 10] 1427906144, [42
42] 1427906144, [11 11] 1427906375, [43 43] 1427906375, [12 12] 1427906576,
[44 44] 1427906576, [13 13] 1427906375, [45 45] 1427906375, [14 14]
1427906144, [46 46] 1427906144, [15 15] 1427906375, [47 47] 1427906375, [16
16] 1427906576, [48 48] 1427906576, [17 17] 1427906375, [49 49] 1427906375,
[18 18] 1427906144, [50 50] 1427906144, [19 19] 1427906375, [51 51]
1427906375, [20 20] 1427906576, [52 52] 1427906576, [21 21] 1427906375, [53
53] 1427906375, [22 22] 1427906144, [54 54] 1427906144, [23 23] 1427906375,
[55 55] 1427906375, [24 24] 1427906576, [56 56] 1427906576, [25 25]
1427906375, [57 57] 1427906375, [26 26] 1427906144, [58 58] 1427906144, [27
27] 1427906375, [59 59] 1427906375, [28 28] 1427906576, [60 60] 1427906576,
[29 29] 1427906375, [61 61] 1427906375, [30 30] 1427906144, [62 62]
1427906144, [31 31] 1427906375, [63 63] 1427906375, [32 32] 1427906576, [64
64] 1427906576, [1 1] 1427906375, [33 33] 1427906375, [65 65] 1427906375}}
Is there anything I'm missing?
Thanks in advance.
--
Carlos Perelló Marínhttps://www.serverdensity.com