Re: Corrupt SSTABLE over and over

2016-08-15 Thread Bryan Cheng
Hi Alaa,

Sounds like you have problems that go beyond Cassandra- likely filesystem
corruption or bad disks. I don't know enough about Windows to give you any
specific advice but I'd try a run of chkdsk to start.

--Bryan

On Fri, Aug 12, 2016 at 5:19 PM, Alaa Zubaidi (PDF) 
wrote:

> Hi Bryan,
>
> Changing disk_failure_policy to best_effort, and running nodetool scrub,
> did not work, it generated another error:
> java.nio.file.AccessDeniedException
>
> Also tried to remove all files (data, commitlog, savedcaches) and restart
> the node fresh, and still I am getting corruption.
>
> and Still nothing that indicate there is a HW issue?
> All other nodes are fine
>
> Regards,
> Alaa
>
>
> On Fri, Aug 12, 2016 at 12:00 PM, Bryan Cheng 
> wrote:
>
>> Should also add that if the scope of corruption is _very_ large, and you
>> have a good, aggressive repair policy (read: you are confident in the
>> consistency of the data elsewhere in the cluster), you may just want to
>> decommission and rebuild that node.
>>
>> On Fri, Aug 12, 2016 at 11:55 AM, Bryan Cheng 
>> wrote:
>>
>>> Looks like you're doing the offline scrub- have you tried online?
>>>
>>> Here's my typical process for corrupt SSTables.
>>>
>>> With disk_failure_policy set to stop, examine the failing sstables. If
>>> they are very small (in the range of kbs), it is unlikely that there is any
>>> salvageable data there. Just delete them, start the machine, and schedule a
>>> repair ASAP.
>>>
>>> If they are large, then it may be worth salvaging. If the scope of
>>> corruption is reasonable (limited to a few sstables scattered among
>>> different keyspaces), set disk_failure_policy to best_effort, start the
>>> machine up, and run the nodetool scrub. This is online scrub, faster than
>>> offline scrub (at least of 2.1.12, the last time I had to do this).
>>>
>>> Only if all else fails, attempt the very painful offline sstablescrub.
>>>
>>> Is the VMWare client Windows? (Trying to make sure its not just the
>>> host). YMMV but in the past Windows was somewhat of a neglected platform
>>> wrt Cassandra. I think you'd have a lot easier time getting help if running
>>> Linux is an option here.
>>>
>>>
>>>
>>> On Fri, Aug 12, 2016 at 9:16 AM, Alaa Zubaidi (PDF) <
>>> alaa.zuba...@pdf.com> wrote:
>>>
 Hi Jason,

 Thanks for your input...
 Thats what I am afraid of?
 Did you find any HW error in the VMware and HW logs? any indication
 that the HW is the reason? I need to make sure that this is the reason
 before asking the customer to spend more money?

 Thanks,
 Alaa

 On Thu, Aug 11, 2016 at 11:02 PM, Jason Wee  wrote:

> cassandra run on virtual server (vmware)?
>
> > I tried sstablescrub but it crashed with hs-err-pid-...
> maybe try with larger heap allocated to sstablescrub
>
> this sstable corrupt i ran into it as well (on cassandra 1.2), first i
> try nodetool scrub, still persist, then offline sstablescrub still
> persist, wipe the node and it happen again, then i change the hardware
> (disk and mem). things went good.
>
> hth
>
> jason
>
>
> On Fri, Aug 12, 2016 at 9:20 AM, Alaa Zubaidi (PDF)
>  wrote:
> > Hi,
> >
> > I have a 16 Node cluster, Cassandra 2.2.1 on Windows, local
> installation
> > (NOT on the cloud)
> >
> > and I am getting
> > Error [CompactionExecutor:2] 2016-08-12 06:51:52, 983 Cassandra
> > Daemon.java:183 - Execption in thread Thread[CompactionExecutor:2,1m
> ain]
> > org.apache.cassandra.io.FSReaderError:
> > org.apache.cassandra.io.sstable.CorruptSSTableExecption:
> > org.apache.cassandra.io.compress.CurrptBlockException:
> > (E:\\la-4886-big-Data.db): corruption detected, chunk at
> 4969092 of
> > length 10208.
> > at
> > org.apache.cassandra.io.util.RandomAccessReader.readBytes(Ra
> ndomAccessReader.java:357)
> > ~[apache-cassandra-2.2.1.jar:2.2.1]
> > 
> > 
> > ERROR [CompactionExecutor:2] ... FileUtils.java:463 - Existing
> > forcefully due to file system exception on startup, disk failure
> policy
> > "stop"
> >
> > I tried sstablescrub but it crashed with hs-err-pid-...
> > I removed the corrupted file and started the Node again, after one
> day the
> > corruption came back again, I removed the files, and restarted
> Cassandra, it
> > worked for few days, then I ran "nodetool repair" after it finished,
> > Cassandra failed again but with commitlog corruption, after removing
> the
> > commitlog files, it failed again with another sstable corruption.
> >
> > I was also checking the HW, file system, and memory, the VMware logs
> showed
> > no HW error, also the HW management logs showed NO problems or
> issues.
> 

Re: New node block in autobootstrap

2016-08-15 Thread Paulo Motta
What version are you in? This seems like a typical case were there was a
problem with streaming (hanging, etc), do you have access to the logs?
Maybe look for streaming errors? Typically streaming errors are related to
timeouts, so you should review your cassandra
streaming_socket_timeout_in_ms and kernel tcp_keepalive settings.

If you're on 2.2+ you can resume a failed bootstrap with nodetool bootstrap
resume. There were also some streaming hanging problems fixed recently, so
I'd advise you to upgrade to the latest version of your particular series
for a more robust version.

Is there any reason why you didn't use the replace procedure
(-Dreplace_address) to replace the node with the same tokens? This would be
a bit faster than remove + bootstrap procedure.

2016-08-15 15:37 GMT-03:00 Jérôme Mainaud :

> Hello,
>
> A client of mime have problems when adding a node in the cluster.
> After 4 days, the node is still in joining mode, it doesn't have the same
> level of load than the other and there seems to be no streaming from and to
> the new node.
>
> This node has a history.
>
>1. At the begin, it was in a seed in the cluster.
>2. Ops detected that client had problems with it.
>3. They tried to reset it but failed. In their process they launched
>several repair and rebuild process on the node.
>4. Then they asked me to help them.
>5. We stopped the node,
>6. removed it from the list of seeds (more precisely it was replaced
>by another node),
>7. removed it from the cluster (I choose not to use decommission since
>node data was compromised)
>8. deleted all files from data, commitlog and savedcache directories.
>9. after the leaving process ended, it was started as a fresh new node
>and began autobootstrap.
>
>
> As I don’t have direct access to the cluster I don't have a lot of
> information, but I will have tomorrow (logs and results of some commands).
> And I can ask for people any required information.
>
> Does someone have any idea of what could have happened and what I should
> investigate first ?
> What would you do to unlock the situation ?
>
> Context: The cluster consists of two DC, each with 15 nodes. Average load
> is around 3 TB per node. The joining node froze a little after 2 TB.
>
> Thank you for your help.
> Cheers,
>
>
> --
> Jérôme Mainaud
> jer...@mainaud.com
>


New node block in autobootstrap

2016-08-15 Thread Jérôme Mainaud
Hello,

A client of mime have problems when adding a node in the cluster.
After 4 days, the node is still in joining mode, it doesn't have the same
level of load than the other and there seems to be no streaming from and to
the new node.

This node has a history.

   1. At the begin, it was in a seed in the cluster.
   2. Ops detected that client had problems with it.
   3. They tried to reset it but failed. In their process they launched
   several repair and rebuild process on the node.
   4. Then they asked me to help them.
   5. We stopped the node,
   6. removed it from the list of seeds (more precisely it was replaced by
   another node),
   7. removed it from the cluster (I choose not to use decommission since
   node data was compromised)
   8. deleted all files from data, commitlog and savedcache directories.
   9. after the leaving process ended, it was started as a fresh new node
   and began autobootstrap.


As I don’t have direct access to the cluster I don't have a lot of
information, but I will have tomorrow (logs and results of some commands).
And I can ask for people any required information.

Does someone have any idea of what could have happened and what I should
investigate first ?
What would you do to unlock the situation ?

Context: The cluster consists of two DC, each with 15 nodes. Average load
is around 3 TB per node. The joining node froze a little after 2 TB.

Thank you for your help.
Cheers,


-- 
Jérôme Mainaud
jer...@mainaud.com


Re: unsubscibe

2016-08-15 Thread James Carman
On Mon, Aug 15, 2016 at 10:28 AM Eric Evans 
wrote:

>
> I'm always surprised when a Google search for 'unsubscribe cassandra'
> doesn't return mailing list results from people nicely telling someone
> how to unsubscribe.
>
>
Agreed.  It doesn't make for a very welcoming community to attract new
contributors, either.


Re: unsubscibe

2016-08-15 Thread Eric Evans
On Sat, Aug 13, 2016 at 7:24 PM, James Carman
 wrote:
> Was the Google stuff really necessary? Couldn't you have just nicely told
> them how to unsubscribe?

I'm always surprised when a Google search for 'unsubscribe cassandra'
doesn't return mailing list results from people nicely telling someone
how to unsubscribe.


-- 
Eric Evans
john.eric.ev...@gmail.com


unsubscribe

2016-08-15 Thread Radoslav Smilyanov
unsubscribe


Failure when setting up cassandra in cluster

2016-08-15 Thread Raimund Klein
Hi all,

Sorry if this is a fairly stupid question, but we've all only been exposed
to Cassandra very recently.

We're trying to configure a 2-node cluster with non-default credentials.
Here's what I've been doing so far based on my understanding of the
documentation. The platform is RHEL 7:


   1. Use an RPM I found with Datastax to perform a basic cassandra
   installation.
   2. Change the temporary directory in cassandra-env.sh, because nobody is
   allowed to execute anything in /tmp.
   3. In cassandra.yaml,
   - change the cluster_name
   - empty the listen_address entry
   - define both VMs as seeds
   4. Open port 7000 in the firewall.
   5. Start cassandra.
   6. In the cassandra.yaml, change to PasswordAuthenticator.
   7. Run cqlsh -u cassandra -p cassandra -e "ALTER KEYSPACE system_auth
   WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 2
   };"
   8. Restart cassandra
   9. Perform 1-8 on the second node
   10. To create a new user, run cqlsh -u cassandra -p cassandra -e "CREATE
   USER ${CASSANDRA_USERNAME} WITH PASSWORD '${CASSANDRA_PASSWORD}' SUPERUSER;"

Step 10 fails with this error:

Connection error: ('Unable to connect to any servers', {'127.0.0.1':
AuthenticationFailed(u'Failed to authenticate to 127.0.0.1: code=0100 [Bad
credentials] message="org.apache.cassandra.exceptions.UnavailableException:
Cannot achieve consistency level QUORUM"',)})


What am I missing?


Cheers

Raimund


Cassandra 2.1.16 Release

2016-08-15 Thread Malte Pickhan
Hey,

I'd like to ask when you are going to release cassandra 2.1.16 especially
because of https://issues.apache.org/jira/browse/CASSANDRA-11850

Best,

Malte