Re: Datafile Corruption

2019-10-09 Thread Philip Ó Condúin
Just to follow up on this issue as others may see it in the future, we
cracked it!

Ou datafile corruption issues were a problem with the OS wrongly taking one
block belonging to a C* data file thinking it was no longer used and
treating it as a free block that would later be used.

For example:
C* deletes file after compaction, OS collects all blocks which are free now
and sends TRIM command to SSD, but SSD from time to time picks the wrong
block, not the one reported by OS - does the trim - causing zeroized blocks
to be seen in the datafile and later use it for different file.
So the symptom is - we suddenly see 4096 zeroes in the datafile- it means
SSD just trimmed the block, after some time we can see some data written to
those blocks - it means the block is used by other file and therefore gives
us a corrupt file.

We turned off the scheduled TRIM function on the OS and we are no longer
getting corruptions.

This was very difficult to pinpoint.

On Thu, 15 Aug 2019 at 00:09, Patrick McFadin  wrote:

> If you hadn't mentioned the fact you are using physical disk I would have
> guessed you were using virtual disks on a SAN. I've seen this sort of thing
> happen a lot there. Are there any virtual layers between the cassandra
> process and the hardware? Just a reminder, fsync can be a liar and the
> virtual layer can mock the response back to user land while the actual bits
> can be dropped before hitting the disk.
>
> If not, you should be looking hard at your disk options. fstab,
> schedulers, etc. In that case, you need this:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>
>
> Patrick
>
> On Wed, Aug 14, 2019 at 2:03 PM Forkalsrud, Erik 
> wrote:
>
>> The dmesg command will usually show information about hardware errors.
>>
>> An example from a spinning disk:
>> sd 0:0:10:0: [sdi] Unhandled sense code
>> sd 0:0:10:0: [sdi] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> sd 0:0:10:0: [sdi] Sense Key : Medium Error [current]
>> Info fld=0x6fc72
>> sd 0:0:10:0: [sdi] Add. Sense: Unrecovered read error
>> sd 0:0:10:0: [sdi] CDB: Read(10): 28 00 00 06 fc 70 00 00 08 00
>>
>>
>> Also, you can read the file like
>> "cat  /data/ssd2/data/KeyspaceMetadata/x-x/lb-26203-big-Data.db >
>> /dev/null"
>> If you get an error message, it's probably a hardware issue.
>>
>> - Erik -
>>
>> --
>> *From:* Philip Ó Condúin 
>> *Sent:* Thursday, August 8, 2019 09:58
>> *To:* user@cassandra.apache.org 
>> *Subject:* Re: Datafile Corruption
>>
>> Hi Jon,
>>
>> Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme
>> but we could still be using it.
>> We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec.
>>
>> Our Kernal is the following, we're using REDHAT so I'm told we can't
>> upgrade the version until the next major release anyway.
>> root@cass 0 17:32:28 ~ # uname -r
>> 3.10.0-957.5.1.el7.x86_64
>>
>> Cheers,
>> Phil
>>
>> On Thu, 8 Aug 2019 at 17:35, Jon Haddad  wrote:
>>
>> Any chance you're using NVMe with an older Linux kernel?  I've seen a
>> *lot* filesystem errors from using older CentOS versions.  You'll want to
>> be using a version > 4.15.
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
>> wrote:
>>
>> *@Jeff *- If it was hardware that would explain it all, but do you think
>> it's possible to have every server in the cluster with a hardware issue?
>> The data is sensitive and the customer would lose their mind if I sent it
>> off-site which is a pity cause I could really do with the help.
>> The corruption is occurring irregularly on every server and instance and
>> column family in the cluster.  Out of 72 instances, we are getting maybe 10
>> corrupt files per day.
>> We are using vnodes (256) and it is happening in both DC's
>>
>> *@Asad *- internode compression is set to ALL on every server.  I have
>> checked the packets for the private interconnect and I can't see any
>> dropped packets, there are dropped packets for other interfaces, but not
>> for the private ones, I will get the network team to double-check this.
>> The corruption is only on the application schema, we are not getting
>> corruption on any system or cass keyspaces.  Corruption is happening in
>> both DC's.  We are getting corruption for the 1 application schema we have
>> across all tables in the keyspace, it's not limited to one table.
>> Im not sure why the app team decided to not use default compression, I
>> must ask them.
>>
>>
>>
>> I have

Re: Datafile Corruption

2019-08-09 Thread Philip Ó Condúin
lumnIndex$Builder.buildForCompaction(ColumnIndex.java:174)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.db.compaction.LazilyCompactedRow.update(LazilyCompactedRow.java:187)
~[apache-cassandra-2.2.13.jar:2.2.13]

at org.apache.cassandra.repair.Validator.rowHash(Validator.java:201)
~[apache-cassandra-2.2.13.jar:2.2.13]

at org.apache.cassandra.repair.Validator.add(Validator.java:150)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:1166)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:76)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.db.compaction.CompactionManager$10.call(CompactionManager.java:736)
~[apache-cassandra-2.2.13.jar:2.2.13]

at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_172]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[na:1.8.0_172]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[na:1.8.0_172]

at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]

Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException:
Corrupted: /data/ssd2/data/KeyspaceMetadata/CF_ToIndex/lb-26203-big-Data.db

at
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:216)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:226)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42)
~[apache-cassandra-2.2.13.jar:2.2.13]

at
org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:352)
~[apache-cassandra-2.2.13.jar:2.2.13]

... 27 common frames omitted

Caused by: org.apache.cassandra.io.compress.CorruptBlockException:
(/data/ssd2/data/KeyspaceMetadata/CF_ToIndex/lb-26203-big-Data.db):
corruption detected, chunk at 1173600152 of length 20802.

at
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap(CompressedRandomAccessReader.java:185)
~[apache-cassandra-2.2.13.jar:2.2.13]

... 30 common frames omitted

INFO  21:30:33 Not a global repair, will not do anticompaction

ERROR 21:30:33 Stopping gossiper

WARN  21:30:33 Stopping gossip by operator request

INFO  21:30:33 Announcing shutdown

INFO  21:30:33 Node /x.x.x.x state jump to shutdown

INFO  21:30:34 [Stream #961933a1-b95a-11e9-b642-255c22db0481] Session with /
10.2.57.48 is complete

INFO  21:30:34 [Stream #961933a1-b95a-11e9-b642-255c22db0481] All sessions
completed

INFO  21:30:34 [Stream #96190c90-b95a-11e9-8a18-dbd9268d5b6a] Session with /
10.2.57.47 is complete

INFO  21:30:34 [Stream #96190c90-b95a-11e9-8a18-dbd9268d5b6a] All sessions
completed

INFO  21:30:34 [repair #9587a200-b95a-11e9-8920-9f72868b8375] streaming
task succeed, returning response to /x.x.x.x

INFO  21:30:34 [repair #9587a200-b95a-11e9-8920-9f72868b8375] Sending
completed merkle tree to /x.x.x.x for KeyspaceMetadata.CF_CcIndex

ERROR 21:30:35 Stopping RPC server

INFO  21:30:35 Stop listening to thrift clients

ERROR 21:30:35 Stopping native transport

INFO  21:30:35 Stop listening for CQL clients



On Thu, 8 Aug 2019 at 17:58, Philip Ó Condúin 
wrote:

> Hi Jon,
>
> Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme but
> we could still be using it.
> We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec.
>
> Our Kernal is the following, we're using REDHAT so I'm told we can't
> upgrade the version until the next major release anyway.
> root@cass 0 17:32:28 ~ # uname -r
> 3.10.0-957.5.1.el7.x86_64
>
> Cheers,
> Phil
>
> On Thu, 8 Aug 2019 at 17:35, Jon Haddad  wrote:
>
>> Any chance you're using NVMe with an older Linux kernel?  I've seen a
>> *lot* filesystem errors from using older CentOS versions.  You'll want to
>> be using a version > 4.15.
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
>> wrote:
>>
>>> *@Jeff *- If it was hardware that would explain it all, but do you
>>> think it's possible to have every server in the cluster with a hardware
>>> issue?
>>> The data is sensitive and the customer would lose their mind if I sent
>>> it off-site which is a pity cause I could really do with the help.
>>> The corruption is occurring irregularly on every server and instance and
>>> column family in the cluster.  Out of 72 instances, we are getting maybe 10
>>> corrupt files per day.
>>> We are using vnodes (256) and it is happening in both DC's
>>>
>>> *@Asad *- internode compression is set to ALL on every server.  I have
>>> checked the packets for the private in

Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
Hi Jon,

Good question, I'm not sure if we're using NVMe, I don't see /dev/nvme but
we could still be using it.
We using *Cisco UCS C220 M4 SFF* so I'm just going to check the spec.

Our Kernal is the following, we're using REDHAT so I'm told we can't
upgrade the version until the next major release anyway.
root@cass 0 17:32:28 ~ # uname -r
3.10.0-957.5.1.el7.x86_64

Cheers,
Phil

On Thu, 8 Aug 2019 at 17:35, Jon Haddad  wrote:

> Any chance you're using NVMe with an older Linux kernel?  I've seen a
> *lot* filesystem errors from using older CentOS versions.  You'll want to
> be using a version > 4.15.
>
> On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
> wrote:
>
>> *@Jeff *- If it was hardware that would explain it all, but do you think
>> it's possible to have every server in the cluster with a hardware issue?
>> The data is sensitive and the customer would lose their mind if I sent it
>> off-site which is a pity cause I could really do with the help.
>> The corruption is occurring irregularly on every server and instance and
>> column family in the cluster.  Out of 72 instances, we are getting maybe 10
>> corrupt files per day.
>> We are using vnodes (256) and it is happening in both DC's
>>
>> *@Asad *- internode compression is set to ALL on every server.  I have
>> checked the packets for the private interconnect and I can't see any
>> dropped packets, there are dropped packets for other interfaces, but not
>> for the private ones, I will get the network team to double-check this.
>> The corruption is only on the application schema, we are not getting
>> corruption on any system or cass keyspaces.  Corruption is happening in
>> both DC's.  We are getting corruption for the 1 application schema we have
>> across all tables in the keyspace, it's not limited to one table.
>> Im not sure why the app team decided to not use default compression, I
>> must ask them.
>>
>>
>>
>> I have been checking the /var/log/messages today going back a few weeks
>> and can see a serious amount of broken pipe errors across all servers and
>> instances.
>> Here is a snippet from one server but most pipe errors are similar:
>>
>> Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
>> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072
>> ops, 0%/0% of on/off-heap limit)
>> Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during
>> write!
>> Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>> ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>> ~[na:1.8.0_172]
>> Jul  9 03:00:22  cassandra: at
>> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
>> ~[libthrift-0.9.2.jar:0.9.2]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.Message.write(Message.java:222)
>> ~[thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:22  cassandra: at
>> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
>> [thrift-server-0.3.7.jar:na]
>> Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
>> Jul  9 03:00:30  cassandra: ERROR 02:00:30 

Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
: at
com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:30  cassandra: at
com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
[thrift-server-0.3.7.jar:na]
Jul  9 03:00:31  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:37  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
Jul  9 03:00:43  kernel: fnic_handle_fip_timer: 8 callbacks suppressed



On Thu, 8 Aug 2019 at 15:42, ZAIDI, ASAD A  wrote:

> Did you check if packets are NOT being dropped for network interfaces
> Cassandra instances are consuming (ifconfig –a) internode compression is
> set for all endpoint – may be network is playing any role here?
>
> is this corruption limited so certain keyspace/table | DCs or is that wide
> spread – the log snippet you shared it looked like only specific
> keyspace/table is affected – is that correct?
>
> When you remove corrupted sstable of a certain table, I guess you verifies
> all nodes for corrupted sstables for same table (may be with with nodetool
> scrub tool) so to limit spread of corruptions – right?
>
> Just curious to know – you’re not using lz4/default compressor for all
> tables there must be some reason for it.
>
>
>
>
>
>
>
> *From:* Philip Ó Condúin [mailto:philipocond...@gmail.com]
> *Sent:* Thursday, August 08, 2019 6:20 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Datafile Corruption
>
>
>
> Hi All,
>
> Thank you so much for the replies.
>
> Currently, I have the following list that can potentially cause some sort
> of corruption in a Cassandra cluster.
>
>- Sudden Power cut  -  *We have had no power cuts in the datacenters*
>- Network Issues - *no network issues from what I can tell*
>- Disk full - *I don't think this is an issue for us, see disks below.*
>- An issue in Casandra version like Cassandra-13752 -* couldn't find
>any Jira issues similar to ours.*
>- Bit Flips -* we have compression enabled so I don't think this
>should be an issue.*
>- Repair during upgrade has caused corruption too -* we have not
>upgraded*
>- Dropping and adding columns with the same name but a different type
>- *I will need to ask the apps team how they are using the database.*
>
>
>
> Ok, let me try and explain the issue we are having, I am under a lot of
> pressure from above to get this fixed and I can't figure it out.
>
> This is a PRE-PROD environment.
>
>- 2 datacenters.
>- 9 physical servers in each datacenter
>- 4 Cassandra instances on each server
>- 72 Cassandra instances across the 2 data centres, 36 in site A, 36
>in site B.
>
>
> We also have 2 Reaper Nodes we use for repair.  One reaper node in each
> datacenter each running with its own Cassandra back end in a cluster
> together.
>
> OS Details [Red Hat Linux]
> cass_a@x 0 10:53:01 ~ $ uname -a
> Linux x 3.10.0-957.5.1.el7.x86_64 #1 SMP Wed Dec 19 10:46:58 EST 2018
> x86_64 x86_64 x86_64 GNU/Linux
>
> cass_a@x 0 10:57:31 ~ $ cat /etc/*release
> NAME="Red Hat Enterprise Linux Server"
> VERSION="7.6 (Maipo)"
> ID="rhel"
>
> Storage Layout
> cass_a@xx 0 10:46:28 ~ $ df -h
> Filesystem Size  Used Avail Use% Mounted on
> /dev/mapper/vg01-lv_root20G  2.2G   18G  11% /
> devtmpfs63G 0   63G   0% /dev
> tmpfs   63G 0   63G   0% /dev/shm
> tmpfs   63G  4.1G   59G   7% /run
> tmpfs   63G 0   63G   0% /sys/fs/cgroup
> >> 4 cassandra instances
> /dev/sdd   1.5T  802G  688G  54% /data/ssd4
> /dev/sda   1.5T  798G  692G  54% /data/ssd1
> /dev/sdb   1.5T  681G  810G  46% /data/ssd2
> /dev/sdc   1.5T  558G  932G  38% /data/ssd3
>
> Cassandra load is about 200GB and the rest of the space is snapshots
>
> CPU
> cass_a@x 127 10:58:47 ~ $ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
> CPU(s):64
> Thread(s) per core:2
> Core(s) per socket:16
> Socket(s): 2
>
> *Description of problem:*
> During repair of the cluster, we are seeing multiple corruptions in the
> log files on a lot of instances.  There seems to be no pattern to the
> corruption.  It seems that the repair job is finding all the corrupted
> files for us.  The repair will hang on the node where the corrupted file is
> f

Re: Datafile Corruption

2019-08-08 Thread Philip Ó Condúin
ble-compactions_in_progress@1455399453(0.281KiB serialized bytes, 16
ops, 0%/0% of on/off-heap limit)
Jul 30 15:56:04 x tag_audit_log: type=USER_CMD
msg=audit(1564498555.190:457951): pid=19294 uid=509 auid=4294967295
ses=4294967295 msg='cwd="/"
cmd=72756E75736572202D73202F62696E2F62617368202D6C20636173735F62202D632063617373616E6472612D6D6574612F63617373616E6472612F62696E2F6E6F6465746F6F6C2074707374617473
terminal=? res=success'



We have checked a number of other things like NTP setting etc but nothing
is telling us what could cause so many corruptions across the entire
cluster.
Things were healthy with this cluster for months, the only thing I can
think is that we started loading data from a load of 20GB per instance up
to 200GB where it sits now, maybe this just highlighted the issue.



Compaction and Compression on Keyspace CL's [mixture]
All CF's are using compression.

AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*',
'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.
*SnappyCompressor*'}

AND compaction = {'min_threshold': '4', 'class':
'org.apache.cassandra.db.compaction.*SizeTieredCompactionStrategy*',
'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.
*LZ4Compressor*'}

AND compaction = {'class': 'org.apache.cassandra.db.compaction.
*LeveledCompactionStrategy*'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.
*SnappyCompressor*'}

--We are also using internode network compression:
internode_compression: all



Does anyone have any idea what I should check next?
Our next theory is that there may be an issue with Checksum, but I'm not
sure where to go with this.

Any help would be very much appreciated before I lose the last bit of hair
I have on my head.

Kind Regards,
Phil

On Wed, 7 Aug 2019 at 20:51, Nitan Kainth  wrote:

> Repair during upgrade have caused corruption too.
>
> Also, dropping and adding columns with same name but different type
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On Aug 7, 2019, at 2:42 PM, Jeff Jirsa  wrote:
>
> Is compression enabled?
>
> If not, bit flips on disk can corrupt data files and reads + repair may
> send that corruption to other hosts in the cluster
>
>
> On Aug 7, 2019, at 3:46 AM, Philip Ó Condúin 
> wrote:
>
> Hi All,
>
> I am currently experiencing multiple datafile corruptions across most
> nodes in my cluster, there seems to be no pattern to the corruption.  I'm
> starting to think it might be a bug, we're using Cassandra 2.2.13.
>
> Without going into detail about the issue I just want to confirm something.
>
> Can someone share with me a list of scenarios that would cause corruption?
>
> 1. OS failure
> 2. Cassandra disturbed during the writing
>
> etc etc.
>
> I need to investigate each scenario and don't want to leave any out.
>
> --
> Regards,
> Phil
>
>

-- 
Regards,
Phil


Datafile Corruption

2019-08-07 Thread Philip Ó Condúin
Hi All,

I am currently experiencing multiple datafile corruptions across most nodes
in my cluster, there seems to be no pattern to the corruption.  I'm
starting to think it might be a bug, we're using Cassandra 2.2.13.

Without going into detail about the issue I just want to confirm something.

Can someone share with me a list of scenarios that would cause corruption?

1. OS failure
2. Cassandra disturbed during the writing

etc etc.

I need to investigate each scenario and don't want to leave any out.

-- 
Regards,
Phil


Change LISTEN_ADDRESS

2019-05-27 Thread Philip Ó Condúin
Hi All,

I currently have one node in a DC.  I am trying to add a second node into
the cluster which is in a different DC.

Obviously, Cassandra on node 1 will need to be able to talk to node 2 over
port 7000 and therefore the firewall rules will need to be correct.

I have been told by the team responsible for the comms that I need to use a
different interface on node 1 to be able to communicate on port 7000 with
node 2 (node 2 is not set up yet)

Problem:  I need to change the LISTEN_ADDRESS on node 1.
I've tried to find steps online how to do this without messing things up
internally, but can't.

Would someone be able to point me in the right direction?

Kind Regards,
Phil


Re: Insert from Select - CQL

2018-10-25 Thread Philip Ó Condúin
Hi Alain,

That is exactly what I did yesterday in the end.  I ran the selects and
output the results to a file, I ran some greps on that file to leave myself
with just the data rows removing any white space and headers.
I then copied this data into a notepad on my local machine and saved it as
a csv.  Luckily the results of the selects were delimited by pipe "|" so I
imported the csv into a spreadsheet and was able to separate the values
into columns.

>From here I was able to build up the insert statements and now have 4K
insert statements as a backup.

Thanks a lot for your reply.

Kind regards,
Phil

On Thu, 25 Oct 2018 at 11:59, Alain RODRIGUEZ  wrote:

>
> Does anyone have any ideas of what I can do to generate inserts based on
>> primary key numbers in an excel spreadsheet?
>
>
> A quick thought:
>
> What about using a column of the spreadsheet to actually store the SELECT
> result and generate the INSERT statement (and I would probably do the
> DELETE too) corresponding to each row using the power of the spreadsheet to
> write the query once and have it for all the partitions with the proper
> values?
>
> The spreadsheet would then be your backup somehow.
>
> We are a bit far from any Cassandra advice, but that's my first thought on
> your problem, use the spreadsheet :).
> Another option is probably to SELECT these rows and INSERT them into some
> other Cassandra table (same cluster or not). Here you would have to code it
> I think (client app of any kind)
> This might not a good fit, but just in case, you might want to check at
> the 'COPY' statement:
> https://stackoverflow.com/questions/21363046/how-to-select-data-from-a-table-and-insert-into-another-table
> I'm not too sure what suits you the best.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mer. 24 oct. 2018 à 12:46, Philip Ó Condúin 
> a écrit :
>
>> Hi All,
>>
>> I have a problem that I'm trying to work out and can't find anything
>> online that may help me.
>>
>> I have been asked to delete 4K records from a Column Family that has a
>> total of 1.8 million rows.  I have been given an excel spreadsheet with a
>> list of the 4K PRIMARY KEY numbers to be deleted.  Great, the delete will
>> be easy anyway.
>>
>> But before I delete them I want to take a backup of what I'm deleting
>> before I do, so that if the customer comes along and says they got the
>> wrong numbers then I can quickly restore one or all of them.
>> I have been trying to figure out how I can generate inserts from a select
>> but it looks like this is not possible.
>>
>> I'm using centos and Cassandra 2.11
>>
>> Does anyone have any ideas of what I can do to generate inserts based on
>> primary key numbers in an excel spreadsheet?
>>
>> Kind Regards,
>> Phil
>>
>>
>>

-- 
Regards,
Phil


Insert from Select - CQL

2018-10-24 Thread Philip Ó Condúin
Hi All,

I have a problem that I'm trying to work out and can't find anything online
that may help me.

I have been asked to delete 4K records from a Column Family that has a
total of 1.8 million rows.  I have been given an excel spreadsheet with a
list of the 4K PRIMARY KEY numbers to be deleted.  Great, the delete will
be easy anyway.

But before I delete them I want to take a backup of what I'm deleting
before I do, so that if the customer comes along and says they got the
wrong numbers then I can quickly restore one or all of them.
I have been trying to figure out how I can generate inserts from a select
but it looks like this is not possible.

I'm using centos and Cassandra 2.11

Does anyone have any ideas of what I can do to generate inserts based on
primary key numbers in an excel spreadsheet?

Kind Regards,
Phil


Re: jmxterm "#NullPointerException: No such PID "

2018-09-20 Thread Philip Ó Condúin
Thank you Yuki, this explains it.
I am used to working on C* 2.1 in production where this JVM flag is not
enabled.


On Wed, 19 Sep 2018 at 00:29, Yuki Morishita  wrote:

> This is because Cassandra sets -XX:+PerfDisableSharedMem JVM option by
> default.
> This prevents tools such as jps to list jvm processes.
> See https://issues.apache.org/jira/browse/CASSANDRA-9242 for detail.
>
> You can work around by doing what Riccardo said.
> On Tue, Sep 18, 2018 at 9:41 PM Philip Ó Condúin
>  wrote:
> >
> > Hi Riccardo,
> >
> > Yes that works for me:
> >
> > Welcome to JMX terminal. Type "help" for available commands.
> > $> open localhost:7199
> > #Connection to localhost:7199 is opened
> > $>domains
> > #following domains are available
> > JMImplementation
> > ch.qos.logback.classic
> > com.sun.management
> > java.lang
> > java.nio
> > java.util.logging
> > org.apache.cassandra.db
> > org.apache.cassandra.hints
> > org.apache.cassandra.internal
> > org.apache.cassandra.metrics
> > org.apache.cassandra.net
> > org.apache.cassandra.request
> > org.apache.cassandra.service
> > $>
> >
> > I can work with this :-)
> >
> > Not sure why the JVM is not listed when issuing the JVMS command, maybe
> its a server setting, our production servers find the Cass JVM.  I've spent
> half the day trying to figure it out so I think I'll just put it to bed now
> and work on something else.
> >
> > Regards,
> > Phil
> >
> > On Tue, 18 Sep 2018 at 13:34, Riccardo Ferrari 
> wrote:
> >>
> >> Hi Philip,
> >>
> >> I've used jmxterm myself without any problems particular problems. On
> my systems too, I don't get the cassandra daemon listed when issuing the
> `jvms` command but I never spent much time investigating it.
> >> Assuming you have not changed anything relevant in the cassandra-env.sh
> you can connect using jmxterm by issuing 'open 127.0.0.1:7199'. Would
> that work for you?
> >>
> >> HTH,
> >>
> >>
> >>
> >> On Tue, Sep 18, 2018 at 2:00 PM, Philip Ó Condúin <
> philipocond...@gmail.com> wrote:
> >>>
> >>> Further info:
> >>>
> >>> I would expect to see the following when I list the jvm's:
> >>>
> >>> Welcome to JMX terminal. Type "help" for available commands.
> >>> $>jvms
> >>> 25815    (m) - org.apache.cassandra.service.CassandraDaemon
> >>> 17628( ) - jmxterm-1.0-alpha-4-uber.jar
> >>>
> >>> But jmxtem is not picking up the JVM for Cassandra for some reason.
> >>>
> >>> Can someone point me in the right direction?  Is there settings in the
> cassandra-env.sh file I need to amend to get jmxterm to find the cass jvm?
> >>>
> >>> Im not finding much about it on google.
> >>>
> >>> Thanks,
> >>> Phil
> >>>
> >>>
> >>> On Tue, 18 Sep 2018 at 12:09, Philip Ó Condúin <
> philipocond...@gmail.com> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I need a little advice.  I'm trying to access the JMX terminal using
> jmxterm-1.0-alpha-4-uber.jar with a very simple default install of C* 3.11.3
> >>>>
> >>>> I keep getting the following:
> >>>>
> >>>> [cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
> >>>> Welcome to JMX terminal. Type "help" for available commands.
> >>>> $>open 1666
> >>>> #NullPointerException: No such PID 1666
> >>>> $>
> >>>>
> >>>> C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no
> and have even created a new VM to test it.
> >>>>
> >>>> Does anyone know what I might be doing wrong here?
> >>>>
> >>>> Kind Regards,
> >>>> Phil
> >>>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Phil
> >>
> >>
> >
> >
> > --
> > Regards,
> > Phil
>


-- 
Regards,
Phil


Re: jmxterm "#NullPointerException: No such PID "

2018-09-18 Thread Philip Ó Condúin
Hi Riccardo,

Yes that works for me:

Welcome to JMX terminal. Type "help" for available commands.
$> open localhost:7199
#Connection to localhost:7199 is opened
$>domains
#following domains are available
JMImplementation
ch.qos.logback.classic
com.sun.management
java.lang
java.nio
java.util.logging
org.apache.cassandra.db
org.apache.cassandra.hints
org.apache.cassandra.internal
org.apache.cassandra.metrics
org.apache.cassandra.net
org.apache.cassandra.request
org.apache.cassandra.service
$>

I can work with this :-)

Not sure why the JVM is not listed when issuing the JVMS command, maybe its
a server setting, our production servers find the Cass JVM.  I've spent
half the day trying to figure it out so I think I'll just put it to bed now
and work on something else.

Regards,
Phil

On Tue, 18 Sep 2018 at 13:34, Riccardo Ferrari  wrote:

> Hi Philip,
>
> I've used jmxterm myself without any problems particular problems. On my
> systems too, I don't get the cassandra daemon listed when issuing the
> `jvms` command but I never spent much time investigating it.
> Assuming you have not changed anything relevant in the cassandra-env.sh
> you can connect using jmxterm by issuing 'open 127.0.0.1:7199'. Would
> that work for you?
>
> HTH,
>
>
>
> On Tue, Sep 18, 2018 at 2:00 PM, Philip Ó Condúin <
> philipocond...@gmail.com> wrote:
>
>> Further info:
>>
>> I would expect to see the following when I list the jvm's:
>>
>> Welcome to JMX terminal. Type "help" for available commands.
>> $>jvms
>> *25815(m) - org.apache.cassandra.service.CassandraDaemon*
>> 17628( ) - jmxterm-1.0-alpha-4-uber.jar
>>
>> But jmxtem is not picking up the JVM for Cassandra for some reason.
>>
>> Can someone point me in the right direction?  Is there settings in the
>> cassandra-env.sh file I need to amend to get jmxterm to find the cass jvm?
>>
>> Im not finding much about it on google.
>>
>> Thanks,
>> Phil
>>
>>
>> On Tue, 18 Sep 2018 at 12:09, Philip Ó Condúin 
>> wrote:
>>
>>> Hi All,
>>>
>>> I need a little advice.  I'm trying to access the JMX terminal using
>>> *jmxterm-1.0-alpha-4-uber.jar* with a very simple default install of C*
>>> 3.11.3
>>>
>>> I keep getting the following:
>>>
>>> [cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
>>> Welcome to JMX terminal. Type "help" for available commands.
>>> $>open 1666
>>> *#NullPointerException: No such PID 1666*
>>> $>
>>>
>>> C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no and
>>> have even created a new VM to test it.
>>>
>>> Does anyone know what I might be doing wrong here?
>>>
>>> Kind Regards,
>>> Phil
>>>
>>>
>>
>> --
>> Regards,
>> Phil
>>
>
>

-- 
Regards,
Phil


Re: jmxterm "#NullPointerException: No such PID "

2018-09-18 Thread Philip Ó Condúin
Further info:

I would expect to see the following when I list the jvm's:

Welcome to JMX terminal. Type "help" for available commands.
$>jvms
*25815(m) - org.apache.cassandra.service.CassandraDaemon*
17628( ) - jmxterm-1.0-alpha-4-uber.jar

But jmxtem is not picking up the JVM for Cassandra for some reason.

Can someone point me in the right direction?  Is there settings in the
cassandra-env.sh file I need to amend to get jmxterm to find the cass jvm?

Im not finding much about it on google.

Thanks,
Phil


On Tue, 18 Sep 2018 at 12:09, Philip Ó Condúin 
wrote:

> Hi All,
>
> I need a little advice.  I'm trying to access the JMX terminal using
> *jmxterm-1.0-alpha-4-uber.jar* with a very simple default install of C*
> 3.11.3
>
> I keep getting the following:
>
> [cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
> Welcome to JMX terminal. Type "help" for available commands.
> $>open 1666
> *#NullPointerException: No such PID 1666*
> $>
>
> C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no and
> have even created a new VM to test it.
>
> Does anyone know what I might be doing wrong here?
>
> Kind Regards,
> Phil
>
>

-- 
Regards,
Phil


jmxterm "#NullPointerException: No such PID "

2018-09-18 Thread Philip Ó Condúin
Hi All,

I need a little advice.  I'm trying to access the JMX terminal using
*jmxterm-1.0-alpha-4-uber.jar* with a very simple default install of C*
3.11.3

I keep getting the following:

[cassandra@reaper-1 conf]$ java -jar jmxterm-1.0-alpha-4-uber.jar
Welcome to JMX terminal. Type "help" for available commands.
$>open 1666
*#NullPointerException: No such PID 1666*
$>

C* is running with a PID of 1666.  I've tried setting JMX_LOCAL=no and have
even created a new VM to test it.

Does anyone know what I might be doing wrong here?

Kind Regards,
Phil