Re: endless full gc on one node

2016-01-17 Thread Kai Wang
DuyHai,

In this case I didn't use batch, just bind a single PreparedStatement and
execute. Nor did I see any warning/error about batch being too large in the
log.

Thanks.

On Sat, Jan 16, 2016 at 6:27 PM, DuyHai Doan  wrote:

> "As soon as inserting started, one node started non-stop full GC. The
> other two nodes were totally fine"
>
> Just a guest, how did you insert data ? Did you use Batch statements ?
>
> On Sat, Jan 16, 2016 at 10:12 PM, Kai Wang  wrote:
>
>> Hi,
>>
>> Recently I saw some strange behavior on one of the nodes of a 3-node
>> cluster. A while ago I created a table and put some data (about 150M) in it
>> for testing. A few days ago I started to import full data into that table
>> using normal cql INSERT statements. As soon as inserting started, one node
>> started non-stop full GC. The other two nodes were totally fine. I stopped
>> the inserting process, restarted C* on all the nodes. All nodes are fine.
>> But once I started inserting again, full GC kicked in on that node within a
>> minute.The insertion speed is moderate. Again, the other two nodes were
>> fine. I tried this process a couple of times. Every time the same node
>> jumped into full GC. I even rebooted all the boxes. I checked system.log
>> but found no errors or warnings before full GC started.
>>
>> Finally I deleted and recreated the table. All of sudden the problem went
>> away. The only thing I can think of is that table was created using STCS.
>> After I inserted 150M data into it, I switched it to LCS. Then I ran
>> incremental repair a couple of times. I saw validation and normal
>> compaction on that table as expected. When I recreated the table, I created
>> it with LCS.
>>
>> I don't have the problem any more but just want to share the experience.
>> Maybe someone has an theory on this? BTW I am running C* 2.2.4 with CentOS
>> 7 and Java 8. All boxes have the identical configurations.
>>
>> Thanks.
>>
>
>


Re: In UJ status for over a week trying to rejoin cluster in Cassandra 3.0.1

2016-01-17 Thread Kai Wang
Carlos,

so you essentially replace the 33 node. Did you follow this
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html?
The link is for 2.x not sure about 3.x. What if you change the new node to
.34?



On Mon, Jan 11, 2016 at 12:57 AM, Carlos A  wrote:

> Hello all,
>
> I have a small dev environment with 4 machines. One of them, I had it
> removed (.33) from the cluster because I wanted to upgrade its HD to a SSD.
> I then reinstalled it and tried to join. It is on UJ status for a week now
> and no changes.
>
> I had tried node-repair etc but nothing.
>
> nodetool status output
>
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address   Load   Tokens   OwnsHost ID
>   Rack
> UN  192.168.1.30  16.13 MB   256  ?
> 0e524b1c-b254-45d0-98ee-63b8f34a8531  RAC1
> UN  192.168.1.31  20.12 MB   256  ?
> 1f8000f5-026c-42c7-8189-cf19fbede566  RAC1
> UN  192.168.1.32  17.73 MB   256  ?
> 7b06f9e9-7c41-4364-ab18-f6976fd359e4  RAC1
> UJ  192.168.1.33  877.6 KB   256  ?
> 7a1507b5-198e-4a3a-a9fd-7af9e588fde2  RAC1
>
> Note: Non-system keyspaces don't have the same replication settings,
> effective ownership information is meaningless
>
> Any tips on fixing this?
>
> Thanks,
>
> C.
>


Re: In UJ status for over a week trying to rejoin cluster in Cassandra 3.0.1

2016-01-17 Thread daemeon reiydelle
What do the logs say on the seed node (and on the UJ node)?

Look for timeout messages.

This problem has occurred for me when there was high network utilization
between the seed and the joining node, also routing issues.



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Jan 17, 2016 at 2:24 PM, Kai Wang  wrote:

> Carlos,
>
> so you essentially replace the 33 node. Did you follow this
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html?
> The link is for 2.x not sure about 3.x. What if you change the new node to
> .34?
>
>
>
> On Mon, Jan 11, 2016 at 12:57 AM, Carlos A  wrote:
>
>> Hello all,
>>
>> I have a small dev environment with 4 machines. One of them, I had it
>> removed (.33) from the cluster because I wanted to upgrade its HD to a SSD.
>> I then reinstalled it and tried to join. It is on UJ status for a week now
>> and no changes.
>>
>> I had tried node-repair etc but nothing.
>>
>> nodetool status output
>>
>> Datacenter: DC1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   OwnsHost ID
>>   Rack
>> UN  192.168.1.30  16.13 MB   256  ?
>> 0e524b1c-b254-45d0-98ee-63b8f34a8531  RAC1
>> UN  192.168.1.31  20.12 MB   256  ?
>> 1f8000f5-026c-42c7-8189-cf19fbede566  RAC1
>> UN  192.168.1.32  17.73 MB   256  ?
>> 7b06f9e9-7c41-4364-ab18-f6976fd359e4  RAC1
>> UJ  192.168.1.33  877.6 KB   256  ?
>> 7a1507b5-198e-4a3a-a9fd-7af9e588fde2  RAC1
>>
>> Note: Non-system keyspaces don't have the same replication settings,
>> effective ownership information is meaningless
>>
>> Any tips on fixing this?
>>
>> Thanks,
>>
>> C.
>>
>
>


broadcast_address in multi data center setups

2016-01-17 Thread Francisco Reyes

Setting up my first Cassandra cluster.

Does one need to setup broadcast_address to public in all the nodes like 
this?

node 1 - colo 1 - broadcast points to public IP
node 2 - colo 1 - broadcast points to public IP
.
node n - colo 1 - broadcast points to public IP

node 4 - colo 2 - broadcast points to public IP
node 5 - colo 2 - broadcast points to public IP

Or can it be like:
node 1 - colo 1 - broadcast points to internal
node 2 - colo 1 - broadcast points to internal
.
node n - colo 1 - broadcast points to public IP

node 4 - colo 2 - broadcast points to internal
node 5 - colo 2 - broadcast points to public IP

Is there a way to restrict what IPs are allowed to connect to the DB at 
the Cassandra level or one has to setup a firewall at the OS level?


Re: New node has high network and disk usage.

2016-01-17 Thread Kai Wang
James,

Thanks for sharing. Anyway, good to know there's one more thing to add to
the checklist.

On Sun, Jan 17, 2016 at 12:23 PM, James Griffin <
james.grif...@idioplatform.com> wrote:

> Hi all,
>
> Just to let you know, we finally figured this out on Friday. It turns out
> the new nodes had an older version of the kernel installed. Upgrading the
> kernel solved our issues. For reference, the "bad" kernel was
> 3.2.0-75-virtual, upgrading to 3.2.0-86-virtual resolved the issue. We
> still don't fully understand why this kernel bug didn't affect *all *our
> nodes (in the end we had three nodes with that kernel, only two of them
> exhibited this issue), but there we go.
>
> Thanks everyone for your help
>
> Cheers,
> Griff
>
> On 14 January 2016 at 15:14, James Griffin  > wrote:
>
>> Hi Kai,
>>
>> Well observed - running `nodetool status` without specifying keyspace
>> does report ~33% on each node. We have two keyspaces on this cluster - if I
>> specify either of them the ownership reported by each node is 100%, so I
>> believe the repair completed successfully.
>>
>> Best wishes,
>>
>> Griff
>>
>> [image: idioplatform] James "Griff" Griffin
>> CTO
>> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 |
>> Twitter: @imaginaryroots  | Skype:
>> j.s.griffin
>> idio helps major brands and publishers to build closer relationships with
>> their customers and prospects by learning from their content consumption
>> and acting on that insight. We call it Content Intelligence, and it
>> integrates with your existing marketing technology to provide detailed
>> customer interest profiles in real-time across all channels, and to
>> personalize content into every channel for every customer. See
>> http://idioplatform.com
>> 
>>  for
>> more information.
>>
>> On 14 January 2016 at 15:08, Kai Wang  wrote:
>>
>>> James,
>>>
>>> I may miss something. You mentioned your cluster had RF=3. Then why
>>> does "nodetool status" show each node owns 1/3 of the data especially after
>>> a full repair?
>>>
>>> On Thu, Jan 14, 2016 at 9:56 AM, James Griffin <
>>> james.grif...@idioplatform.com> wrote:
>>>
 Hi Kai,

 Below - nothing going on that I can see

 $ nodetool netstats
 Mode: NORMAL
 Not sending any streams.
 Read Repair Statistics:
 Attempted: 0
 Mismatch (Blocking): 0
 Mismatch (Background): 0
 Pool NameActive   Pending  Completed
 Commandsn/a 0   6326
 Responses   n/a 0 219356



 Best wishes,

 Griff

 [image: idioplatform] James "Griff" Griffin
 CTO
 Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 |
 Twitter: @imaginaryroots  | Skype:
 j.s.griffin
 idio helps major brands and publishers to build closer relationships
 with their customers and prospects by learning from their content
 consumption and acting on that insight. We call it Content Intelligence,
 and it integrates with your existing marketing technology to provide
 detailed customer interest profiles in real-time across all channels, and
 to personalize content into every channel for every customer. See
 http://idioplatform.com
 
  for
 more information.

 On 14 January 2016 at 14:22, Kai Wang  wrote:

> James,
>
> Can you post the result of "nodetool netstats" on the bad node?
>
> On Thu, Jan 14, 2016 at 9:09 AM, James Griffin <
> james.grif...@idioplatform.com> wrote:
>
>> A summary of what we've done this morning:
>>
>>- Noted that there are no GCInspector lines in system.log on bad
>>node (there are GCInspector logs on other healthy nodes)
>>- Turned on GC logging, noted that we had logs which stated out
>>total time for which application threads were stopped was high - ~10s.
>>- Not seeing failures or any kind (promotion or concurrent mark)
>>- Attached Visual VM: noted that heap usage was very low (~5%
>>usage and stable) and it didn't display hallmarks GC of activity. 
>> PermGen
>>also very stable
>>- Downloaded GC logs and examined in GC Viewer. Noted that:
>>- We had lots of pauses (again around 10s), but no full GC.
>>   - From a 2,300s sample, just over 2,000s were spent with
>>   threads paused
>>   - 

Re:Re: endless full gc on one node

2016-01-17 Thread xutom
Hi Kai Wang,
I also encounter such issue a few days ago. I have 6 nodes, and I found 2 
nodes do endless full gc when I export ALL datas from C* using "Select * from 
table". I remove all datas of the 2 nodes and install Cassandra again, and the 
problem gone away.



At 2016-01-18 06:18:46, "Kai Wang"  wrote:

DuyHai,


In this case I didn't use batch, just bind a single PreparedStatement and 
execute. Nor did I see any warning/error about batch being too large in the log.


Thanks.



On Sat, Jan 16, 2016 at 6:27 PM, DuyHai Doan  wrote:

"As soon as inserting started, one node started non-stop full GC. The other two 
nodes were totally fine"


Just a guest, how did you insert data ? Did you use Batch statements ?


On Sat, Jan 16, 2016 at 10:12 PM, Kai Wang  wrote:

Hi,


Recently I saw some strange behavior on one of the nodes of a 3-node cluster. A 
while ago I created a table and put some data (about 150M) in it for testing. A 
few days ago I started to import full data into that table using normal cql 
INSERT statements. As soon as inserting started, one node started non-stop full 
GC. The other two nodes were totally fine. I stopped the inserting process, 
restarted C* on all the nodes. All nodes are fine. But once I started inserting 
again, full GC kicked in on that node within a minute.The insertion speed is 
moderate. Again, the other two nodes were fine. I tried this process a couple 
of times. Every time the same node jumped into full GC. I even rebooted all the 
boxes. I checked system.log but found no errors or warnings before full GC 
started.


Finally I deleted and recreated the table. All of sudden the problem went away. 
The only thing I can think of is that table was created using STCS. After I 
inserted 150M data into it, I switched it to LCS. Then I ran incremental repair 
a couple of times. I saw validation and normal compaction on that table as 
expected. When I recreated the table, I created it with LCS.


I don't have the problem any more but just want to share the experience. Maybe 
someone has an theory on this? BTW I am running C* 2.2.4 with CentOS 7 and Java 
8. All boxes have the identical configurations.



Thanks.






Re: Too many compactions, maybe keyspace system?

2016-01-17 Thread Shuo Chen
I documented this on JIRA. Please see CASSANDRA-11025


On Sun, Jan 17, 2016 at 11:48 PM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> I agree that this may be worth a jira.
>
> Can you clarify this statement?
>
> >>5 keyspaces and about 100 cfs months
>
> How many total empty tables did you create? Creating hundreds of tables is
> a bad practice in Cassandra but I was not aware of a compaction impact like
> what you're describing.
>
> all the best,
>
> Sebastián
> On Jan 16, 2016 4:43 AM, "DuyHai Doan"  wrote:
>
>> Interesting, maybe it worths filing a JIRA. Empty tables should not slow
>> down compaction of other tables
>>
>> On Sat, Jan 16, 2016 at 10:33 AM, Shuo Chen 
>> wrote:
>>
>>> Hi, Robert,
>>>
>>> I think I found the cause of the too many compactions. I used jmap to
>>> dump the heap and used Eclipse memory analyzer plugin to extract the heap.
>>>
>>> In previous reply, It shows the there are too many pending jobs in the
>>> Blocking queue. I checked the cf of the compaction task object. There are
>>> many cfs concerning some empty cfs I created before.
>>>
>>> I created 5 keyspaces and about 100 cfs months by cassandra-cli ago and
>>> didnot put any data yet. In  fact, there is only 1 keypaces I created
>>> containing data and the other 5 keyspaces are empty.
>>>
>>> When I droped these 5 keyspaces and restarted the high compaction node,
>>> It runs normally with normal mount of compactions.
>>>
>>> So maybe there are some bugs of compaction for empty columnfamily?
>>>
>>> On Wed, Jan 13, 2016 at 2:39 AM, Robert Coli 
>>> wrote:
>>>
 On Mon, Jan 11, 2016 at 9:12 PM, Shuo Chen 
 wrote:

> I have a assumption that, lots of pending compaction tasks jam the
> memory and raise full gc. The full chokes the process and slows down
> compaction. And this causes more pending compaction tasks and more 
> pressure
> on memory.
>

 The question is why there are so many pending compactions, because your
 log doesn't show that much compaction is happening. What keyspaces /
 columnfamilies do you expect to be compacting, and how many SSTables do
 they contain?


> Is there a method to list the concrete details of pending compaction
> tasks?
>

 Nope.

 For the record, this type of extended operational debugging is often
 best carried out interactively on #cassandra on freenode IRC.. :)

 =Rob

>>>
>>>
>>>
>>> --
>>> *陈硕* *Shuo Chen*
>>> chenatu2...@gmail.com
>>> chens...@whaty.com
>>>
>>
>>


-- 
*陈硕* *Shuo Chen*
chenatu2...@gmail.com
chens...@whaty.com


Re: Basic query in setting up secure inter-dc cluster

2016-01-17 Thread Ajay Garg
Hi All.

A gentle query-reminder.

I will be grateful if I could be given a brief technical overview, as to
how secure-communication occurs between two nodes in a cluster.

Please note that I wish for some information on the "how it works below the
hood", and NOT "how to set it up".



Thanks and Regards,
Ajay

On Wed, Jan 6, 2016 at 4:16 PM, Ajay Garg  wrote:

> Thanks everyone for the reply.
>
> I actually have a fair bit of questions, but it will be nice if someone
> could please tell me the flow (implementation-wise), as to how node-to-node
> encryption works in a cluster.
>
> Let's say node1 from DC1, wishes to talk securely to node 2 from DC2 (with 
> *"require_client_auth:
> false*").
> I presume it would be like below (please correct me if am wrong) ::
>
> a)
> node1 tries to connect to node2, using the certificate *as defined on
> node1* in cassandra.yaml.
>
> b)
> node2 will confirm if the certificate being offered by node1 is in the
> truststore *as defined on node2* in cassandra.yaml.
> if it is, secure-communication is allowed.
>
>
> Is my thinking right?
> I
>
> On Wed, Jan 6, 2016 at 1:55 PM, Neha Dave  wrote:
>
>> Hi Ajay,
>> Have a look here :
>> https://docs.datastax.com/en/cassandra/1.2/cassandra/security/secureSSLNodeToNode_t.html
>>
>> You can configure for DC level Security:
>>
>> Procedure
>>
>> On each node under sever_encryption_options:
>>
>>- Enable internode_encryption.
>>The available options are:
>>   - all
>>   - none
>>   - dc: Cassandra encrypts the traffic between the data centers.
>>   - rack: Cassandra encrypts the traffic between the racks.
>>
>> regards
>>
>> Neha
>>
>>
>>
>> On Wed, Jan 6, 2016 at 12:48 PM, Singh, Abhijeet > > wrote:
>>
>>> Security is a very wide concept. What exactly do you want to achieve ?
>>>
>>>
>>>
>>> *From:* Ajay Garg [mailto:ajaygargn...@gmail.com]
>>> *Sent:* Wednesday, January 06, 2016 11:27 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Basic query in setting up secure inter-dc cluster
>>>
>>>
>>>
>>> Hi All.
>>>
>>> We have a 2*2 cluster deployed, but no security as of now.
>>>
>>> As a first stage, we wish to implement inter-dc security.
>>>
>>> Is it possible to enable security one machine at a time?
>>>
>>> For example, let's say the machines are DC1M1, DC1M2, DC2M1, DC2M2.
>>>
>>> If I make the changes JUST IN DC2M2 and restart it, will the traffic
>>> between DC1M1/DC1M2 and DC2M2 be secure? Or security will kick in ONLY
>>> AFTER the changes are made in all the 4 machines?
>>>
>>> Asking here, because I don't want to screw up a live cluster due to my
>>> lack of experience.
>>>
>>> Looking forward to some pointers.
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Ajay
>>>
>>
>>
>
>
> --
> Regards,
> Ajay
>



-- 
Regards,
Ajay


Re: Too many compactions, maybe keyspace system?

2016-01-17 Thread Sebastian Estevez
I agree that this may be worth a jira.

Can you clarify this statement?

>>5 keyspaces and about 100 cfs months

How many total empty tables did you create? Creating hundreds of tables is
a bad practice in Cassandra but I was not aware of a compaction impact like
what you're describing.

all the best,

Sebastián
On Jan 16, 2016 4:43 AM, "DuyHai Doan"  wrote:

> Interesting, maybe it worths filing a JIRA. Empty tables should not slow
> down compaction of other tables
>
> On Sat, Jan 16, 2016 at 10:33 AM, Shuo Chen  wrote:
>
>> Hi, Robert,
>>
>> I think I found the cause of the too many compactions. I used jmap to
>> dump the heap and used Eclipse memory analyzer plugin to extract the heap.
>>
>> In previous reply, It shows the there are too many pending jobs in the
>> Blocking queue. I checked the cf of the compaction task object. There are
>> many cfs concerning some empty cfs I created before.
>>
>> I created 5 keyspaces and about 100 cfs months by cassandra-cli ago and
>> didnot put any data yet. In  fact, there is only 1 keypaces I created
>> containing data and the other 5 keyspaces are empty.
>>
>> When I droped these 5 keyspaces and restarted the high compaction node,
>> It runs normally with normal mount of compactions.
>>
>> So maybe there are some bugs of compaction for empty columnfamily?
>>
>> On Wed, Jan 13, 2016 at 2:39 AM, Robert Coli 
>> wrote:
>>
>>> On Mon, Jan 11, 2016 at 9:12 PM, Shuo Chen 
>>> wrote:
>>>
 I have a assumption that, lots of pending compaction tasks jam the
 memory and raise full gc. The full chokes the process and slows down
 compaction. And this causes more pending compaction tasks and more pressure
 on memory.

>>>
>>> The question is why there are so many pending compactions, because your
>>> log doesn't show that much compaction is happening. What keyspaces /
>>> columnfamilies do you expect to be compacting, and how many SSTables do
>>> they contain?
>>>
>>>
 Is there a method to list the concrete details of pending compaction
 tasks?

>>>
>>> Nope.
>>>
>>> For the record, this type of extended operational debugging is often
>>> best carried out interactively on #cassandra on freenode IRC.. :)
>>>
>>> =Rob
>>>
>>
>>
>>
>> --
>> *陈硕* *Shuo Chen*
>> chenatu2...@gmail.com
>> chens...@whaty.com
>>
>
>


Re: New node has high network and disk usage.

2016-01-17 Thread James Griffin
Hi all,

Just to let you know, we finally figured this out on Friday. It turns out
the new nodes had an older version of the kernel installed. Upgrading the
kernel solved our issues. For reference, the "bad" kernel was
3.2.0-75-virtual, upgrading to 3.2.0-86-virtual resolved the issue. We
still don't fully understand why this kernel bug didn't affect *all *our
nodes (in the end we had three nodes with that kernel, only two of them
exhibited this issue), but there we go.

Thanks everyone for your help

Cheers,
Griff

On 14 January 2016 at 15:14, James Griffin 
wrote:

> Hi Kai,
>
> Well observed - running `nodetool status` without specifying keyspace does
> report ~33% on each node. We have two keyspaces on this cluster - if I
> specify either of them the ownership reported by each node is 100%, so I
> believe the repair completed successfully.
>
> Best wishes,
>
> Griff
>
> [image: idioplatform] James "Griff" Griffin
> CTO
> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 | Twitter:
> @imaginaryroots  | Skype: j.s.griffin
> idio helps major brands and publishers to build closer relationships with
> their customers and prospects by learning from their content consumption
> and acting on that insight. We call it Content Intelligence, and it
> integrates with your existing marketing technology to provide detailed
> customer interest profiles in real-time across all channels, and to
> personalize content into every channel for every customer. See
> http://idioplatform.com
> 
>  for
> more information.
>
> On 14 January 2016 at 15:08, Kai Wang  wrote:
>
>> James,
>>
>> I may miss something. You mentioned your cluster had RF=3. Then why does
>> "nodetool status" show each node owns 1/3 of the data especially after a
>> full repair?
>>
>> On Thu, Jan 14, 2016 at 9:56 AM, James Griffin <
>> james.grif...@idioplatform.com> wrote:
>>
>>> Hi Kai,
>>>
>>> Below - nothing going on that I can see
>>>
>>> $ nodetool netstats
>>> Mode: NORMAL
>>> Not sending any streams.
>>> Read Repair Statistics:
>>> Attempted: 0
>>> Mismatch (Blocking): 0
>>> Mismatch (Background): 0
>>> Pool NameActive   Pending  Completed
>>> Commandsn/a 0   6326
>>> Responses   n/a 0 219356
>>>
>>>
>>>
>>> Best wishes,
>>>
>>> Griff
>>>
>>> [image: idioplatform] James "Griff" Griffin
>>> CTO
>>> Switchboard: +44 (0)20 3540 1920 | Direct: +44 (0)7763 139 206 |
>>> Twitter: @imaginaryroots  | Skype:
>>> j.s.griffin
>>> idio helps major brands and publishers to build closer relationships
>>> with their customers and prospects by learning from their content
>>> consumption and acting on that insight. We call it Content Intelligence,
>>> and it integrates with your existing marketing technology to provide
>>> detailed customer interest profiles in real-time across all channels, and
>>> to personalize content into every channel for every customer. See
>>> http://idioplatform.com
>>> 
>>>  for
>>> more information.
>>>
>>> On 14 January 2016 at 14:22, Kai Wang  wrote:
>>>
 James,

 Can you post the result of "nodetool netstats" on the bad node?

 On Thu, Jan 14, 2016 at 9:09 AM, James Griffin <
 james.grif...@idioplatform.com> wrote:

> A summary of what we've done this morning:
>
>- Noted that there are no GCInspector lines in system.log on bad
>node (there are GCInspector logs on other healthy nodes)
>- Turned on GC logging, noted that we had logs which stated out
>total time for which application threads were stopped was high - ~10s.
>- Not seeing failures or any kind (promotion or concurrent mark)
>- Attached Visual VM: noted that heap usage was very low (~5%
>usage and stable) and it didn't display hallmarks GC of activity. 
> PermGen
>also very stable
>- Downloaded GC logs and examined in GC Viewer. Noted that:
>- We had lots of pauses (again around 10s), but no full GC.
>   - From a 2,300s sample, just over 2,000s were spent with
>   threads paused
>   - Spotted many small GCs in the new space - realised that Xmn
>   value was very low (200M against a heap size of 3750M). Increased 
> Xmn to
>   937M - no change in server behaviour (high load, high reads/s on 
> disk, high
>   CPU wait)
>
> Current output of jstat:
>
>   S0