It was suggested to me that I try running scrub on the other nodes in the cluster, as the runtime exceptions I was seeing might be relevant to some bad data. I am going to try that this morning and see how things go. Not sure how long is long enough for nodetool scrub to run on a box though.
As for the load... Here's the spread on the current cluster: [stan.lemon@cass-d101 ~]$ nodetool status Note: Ownership information does not include topology; for complete information, specify a keyspace Datacenter: DALLAS ================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.86.123.86 276.19 GB 256 4.0% 2386b94b-fe99-4cb0-8053-321c0540db45 RAC1 UN 10.81.122.66 261.38 GB 256 4.4% b4533802-83c3-4e57-bbea-6b63294ba377 RAC1 UN 10.81.122.64 266.85 GB 256 4.3% 391a6dfc-254a-43cf-8f25-5518e8ab6511 RAC1 UN 10.86.123.84 290.27 GB 256 4.2% 14979aeb-e0a8-4f7d-866e-0e701a4f774f RAC1 UN 10.86.123.82 289.96 GB 256 4.5% 65df8d81-0ec1-4f67-81c1-06e86e48593a RAC1 UN 10.86.123.80 290.81 GB 256 4.4% c4276398-0c76-4802-b92e-e08a3a0e319f RAC1 UN 10.84.78.120 290.74 GB 256 4.5% fce37c3d-c142-40b5-978c-ab8e59939b2f RAC1 UN 10.84.78.118 287.85 GB 256 4.3% cfd64c76-fb08-4a3a-b88e-bc19c45115c6 RAC1 UN 10.86.123.78 290.96 GB 256 4.1% 32cc866f-7b5f-4310-ac4a-e0f5dd650b78 RAC1 UN 10.86.123.76 295.52 GB 256 4.1% bb1b80ba-28bf-4a39-9623-16e326eaaf09 RAC1 UN 10.81.122.62 286.81 GB 256 4.1% ef255fd1-beee-4dc0-80f5-9ae2271c6398 RAC1 UN 10.86.123.74 303.25 GB 256 4.3% 041d7ab7-d1bd-4a79-afb7-9c6ab1857ee9 RAC1 Datacenter: SEATTLE =================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.29.82.80 297.11 GB 256 4.3% a0e61c1e-e48f-4ccd-afa4-5069d5671382 RAC1 UN 10.29.82.156 304.74 GB 256 4.3% d17abc57-eb47-41de-8cd5-a341a38b16de RAC1 UN 10.29.82.158 289.63 GB 256 4.4% f47d4019-7fd9-4620-9465-d1199311de36 RAC1 UN 10.29.82.152 285.99 GB 256 4.1% 23ee0c6f-5ac7-475a-be13-7d0536619da3 RAC1 UN 10.29.82.168 285.39 GB 256 3.8% f5f2f55c-e316-4281-b472-f572601c7618 RAC1 UN 10.29.82.154 287.8 GB 256 4.0% 29cd9781-985a-49ed-9910-46279f50bbba RAC1 UN 10.29.82.166 282.9 GB 256 4.1% 627b0a9e-c0d0-4a90-9cbe-22f7fbb81f9f RAC1 UN 10.29.82.148 291.17 GB 256 4.0% c52b467f-8960-4c4f-951a-b4232bbd25ee RAC1 UN 10.29.82.164 269.74 GB 256 3.9% 7fba7779-c705-45bb-a0ae-26a5dff93374 RAC1 UN 10.29.82.150 281.93 GB 256 4.1% 63165266-bfda-4bd5-b339-e103546bb853 RAC1 UN 10.29.82.162 294.11 GB 256 3.9% 933a495f-4ed7-4bf9-97d7-2ce2c58f5200 RAC1 UN 10.29.82.160 261.22 GB 256 4.0% 7baaeb81-b46b-441a-bb29-914247ec3fac RAC1 On Wed, Aug 5, 2015 at 9:54 PM, Sebastian Estevez < sebastian.este...@datastax.com> wrote: > What's your average data per node? Is 230gb close? > > All the best, > > > [image: datastax_logo.png] <http://www.datastax.com/> > > Sebastián Estévez > > Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com > > [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image: > facebook.png] <https://www.facebook.com/datastax> [image: twitter.png] > <https://twitter.com/datastax> [image: g+.png] > <https://plus.google.com/+Datastax/about> > <http://feeds.feedburner.com/datastax> > > > <http://cassandrasummit-datastax.com/?utm_campaign=summit15&utm_medium=summiticon&utm_source=emailsignature> > > DataStax is the fastest, most scalable distributed database technology, > delivering Apache Cassandra to the world’s most innovative enterprises. > Datastax is built to be agile, always-on, and predictably scalable to any > size. With more than 500 customers in 45 countries, DataStax is the > database technology and transactional backbone of choice for the worlds > most innovative companies such as Netflix, Adobe, Intuit, and eBay. > > On Wed, Aug 5, 2015 at 8:33 AM, Stan Lemon <sle...@salesforce.com> wrote: > >> I set the stream timeout to 1 hour this morning and started fresh trying >> to join this node. It took about an hour to stream over 230gb of data, and >> then into hour 2 I wound up back where I was yesterday, the node's load is >> slowly reducing and the netstats does not show sending or receiving >> anything. I'm not sure how long I should wait before I throw the towel in >> on this attempt. I'm also not really sure what to try next... >> >> The only thing in the logs currently are three entries like this: >> >> ERROR 07:39:44,447 Exception in thread >> Thread[CompactionExecutor:31,1,main] >> java.lang.RuntimeException: Last written key >> DecoratedKey(8633837336094175369, >> 003076697369746f725f706167655f766965623936636232346331623661313935313634346638303838393465313132373700004930303030663264632d303030302d303033302d343030302d3030303030303030663264633a66376436366166382d383564352d313165342d383030302d30303030303035343764623600) >> >= current key DecoratedKey(-6568345298384940765, >> 003076697369746f725f706167655f766965623936636232346331623661313935313634346638303838393465313132373700004930303030376464652d303030302d303033302d343030302d3030303030303030376464653a64633930336533382d643766342d313165342d383030302d30303030303730626338386300) >> writing into >> /var/lib/cassandra/data/pi/__shardindex/pi-__shardindex-tmp-jb-644-Data.db >> at >> org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:143) >> at >> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:166) >> at >> org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:170) >> at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >> at >> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) >> at >> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) >> at >> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> >> >> ANY help is greatly appreciated. >> >> Thanks, >> Stan >> >> >> >> >> >> On Tue, Aug 4, 2015 at 2:23 PM, Sebastian Estevez < >> sebastian.este...@datastax.com> wrote: >> >>> That's the one. I set it to an hour to be safe (if a stream goes above >>> the timeout it will get restarted) but it can probably be lower. >>> >>> All the best, >>> >>> >>> [image: datastax_logo.png] <http://www.datastax.com/> >>> >>> Sebastián Estévez >>> >>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com >>> >>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image: >>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png] >>> <https://twitter.com/datastax> [image: g+.png] >>> <https://plus.google.com/+Datastax/about> >>> <http://feeds.feedburner.com/datastax> >>> >>> >>> <http://cassandrasummit-datastax.com/?utm_campaign=summit15&utm_medium=summiticon&utm_source=emailsignature> >>> >>> DataStax is the fastest, most scalable distributed database technology, >>> delivering Apache Cassandra to the world’s most innovative enterprises. >>> Datastax is built to be agile, always-on, and predictably scalable to any >>> size. With more than 500 customers in 45 countries, DataStax is the >>> database technology and transactional backbone of choice for the worlds >>> most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>> >>> On Tue, Aug 4, 2015 at 2:21 PM, Stan Lemon <sle...@salesforce.com> >>> wrote: >>> >>>> Sebastian, >>>> You're referring to streaming_socket_timeout_in_ms correct? What value >>>> do you recommend? All of my nodes are currently at the default 0. >>>> >>>> Thanks, >>>> Stan >>>> >>>> >>>> On Tue, Aug 4, 2015 at 2:16 PM, Sebastian Estevez < >>>> sebastian.este...@datastax.com> wrote: >>>> >>>>> It helps to set stream socket timeout in the yaml so that you don't >>>>> hang forever on a lost / broken stream. >>>>> >>>>> All the best, >>>>> >>>>> >>>>> [image: datastax_logo.png] <http://www.datastax.com/> >>>>> >>>>> Sebastián Estévez >>>>> >>>>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com >>>>> >>>>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image: >>>>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png] >>>>> <https://twitter.com/datastax> [image: g+.png] >>>>> <https://plus.google.com/+Datastax/about> >>>>> <http://feeds.feedburner.com/datastax> >>>>> >>>>> >>>>> <http://cassandrasummit-datastax.com/?utm_campaign=summit15&utm_medium=summiticon&utm_source=emailsignature> >>>>> >>>>> DataStax is the fastest, most scalable distributed database >>>>> technology, delivering Apache Cassandra to the world’s most innovative >>>>> enterprises. Datastax is built to be agile, always-on, and predictably >>>>> scalable to any size. With more than 500 customers in 45 countries, >>>>> DataStax >>>>> is the database technology and transactional backbone of choice for the >>>>> worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. >>>>> >>>>> On Tue, Aug 4, 2015 at 2:14 PM, Robert Coli <rc...@eventbrite.com> >>>>> wrote: >>>>> >>>>>> On Tue, Aug 4, 2015 at 11:02 AM, Stan Lemon <sle...@salesforce.com> >>>>>> wrote: >>>>>> >>>>>>> I am attempting to add a 13th node in one of the datacenters. I have >>>>>>> been monitoring this process from the node itself with nodetool netstats >>>>>>> and from one of the existing nodes using nodetool status. >>>>>>> >>>>>>> On the existing node I see the new node as UJ. I have watched the >>>>>>> load steadily climb up to about 203.4gb, and then over the last two >>>>>>> hours >>>>>>> it has fluctuated a bit and has been steadily dropping to about 203.1gb >>>>>>> >>>>>> >>>>>> It's probably hung. If I were you I'd probably wipe the node and >>>>>> re-bootstrap. >>>>>> >>>>>> (what version of cassandra/what network are you on (AWS?)/etc.) >>>>>> >>>>>> =Rob >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >