upgradesstables/cleanup/compaction strategy change

2016-05-23 Thread Erik Forsberg

Hi!

I have a 2.0.13 cluster which I have just extended, and I'm now looking 
into upgrading it to 2.1.


* The cleanup after the extension is partially done.
* I'm also looking into changing a few tables into Leveled Compaction 
Strategy.


In the interest of speeding up things by avoiding unnecessary rewrites 
of data, I'm pondering if I can:


1. Upgrade to 2.1, then run cleanup instead of upgradesstables getting 
cleanup + upgrade of sstable format to ka at the same time?


2. Upgrade to 2.1, then change compaction strategy and get LCS + upgrade 
of sstable format to ka at the same time?


Comments on that?

Thanks,
\EF


Re: Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg



On 2016-05-18 20:19, Jeff Jirsa wrote:

You can’t stream between versions, so in order to grow the cluster, you’ll need 
to be entirely on 2.0 or entirely on 2.1.


OK. I was sure you can't stream between a 2.0 node and a 2.1 node, but 
if I understand you correctly you can't stream between two 2.1 nodes 
unless the sstables on the source node has been upgraded to "ka", i.e. 
the 2.1 sstable version?


Looks like it's extend first, upgrade later, given that we're a bit 
close on disk capacity.


Thanks,
\EF


If you go to 2.1 first, be sure you run upgradesstables before you try to 
extend the cluster.





On 5/18/16, 11:17 AM, "Erik Forsberg" <forsb...@opera.com> wrote:


Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0
where not all sstables have been upgraded?

If I do that, will the sstables written on the new nodes be in the 2.1
format when I add them? Or will they be written in the 2.0 format so
I'll have to run upgradesstables anyway?

The cleanup I do on the existing nodes, will write the new 2.1 format,
right?

There might be other reasons not to do this, one being that it's seldom
wise to do many operations at once. So please enlighten me on how bad an
idea this is :-)

Thanks,
\EF




Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg

Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to 
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 
where not all sstables have been upgraded?


If I do that, will the sstables written on the new nodes be in the 2.1 
format when I add them? Or will they be written in the 2.0 format so 
I'll have to run upgradesstables anyway?


The cleanup I do on the existing nodes, will write the new 2.1 format, 
right?


There might be other reasons not to do this, one being that it's seldom 
wise to do many operations at once. So please enlighten me on how bad an 
idea this is :-)


Thanks,
\EF


Lot's of hints, but only on a few nodes

2016-05-10 Thread Erik Forsberg
I have this situation where a few (like, 3-4 out of 84) nodes misbehave. 
Very long GC pauses, dropping out of cluster etc.


This happens while loading data (via CQL), and analyzing metrics it 
looks like on these few nodes, a lot of hints are being generated close 
to the time when they start to misbehave.


Since this is Cassandra 2.0.13 which have a less than optimal hints 
implementation, largs numbers of hints is a GC troublemaker.


Again looking at metrics, it looks like hints are being generated for a 
large number of nodes, so it doesn't look like the destination nodes are 
at fault. So, I'm confused.


Any Hints (pun intended) on what could cause a few nodes to generate 
more hints than the rest of the cluster?


Regards,
\EF


Re: A few misbehaving nodes

2016-04-21 Thread Erik Forsberg



On 2016-04-19 15:54, sai krishnam raju potturi wrote:

hi;
   do we see any hung process like Repairs on those 3 nodes?  what 
does "nodetool netstats" show??


No hung process from what I can see.

root@cssa02-06:~# nodetool tpstats
Pool NameActive   Pending  Completed Blocked  
All time blocked
ReadStage 0 01530227 
0 0
RequestResponseStage  0 0   19230947 
0 0
MutationStage 0 0   37059234 
0 0
ReadRepairStage   0 0  80178 
0 0
ReplicateOnWriteStage 0 0  0 
0 0
GossipStage   0 0  43003 
0 0
CacheCleanupExecutor  0 0  0 
0 0
MigrationStage0 0  0 
0 0
MemoryMeter   0 0267 
0 0
FlushWriter   0 0202 
0 5
ValidationExecutor0 0212 
0 0
InternalResponseStage 0 0  0 
0 0
AntiEntropyStage  0 0427 
0 0
MemtablePostFlusher   0 0669 
0 0
MiscStage 0 0212 
0 0
PendingRangeCalculator0 0 70 
0 0
CompactionExecutor0 0   1206 
0 0
commitlog_archiver0 0  0 
0 0
HintedHandoff 0 1113 
0 0


Message type   Dropped
RANGE_SLICE  1
READ_REPAIR  0
PAGED_RANGE  0
BINARY   0
READ   219
MUTATION 3
_TRACE   0
REQUEST_RESPONSE 2
COUNTER_MUTATION 0

root@cssa02-06:~# nodetool netstats
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 75317
Mismatch (Blocking): 0
Mismatch (Background): 11
Pool NameActive   Pending  Completed
Commandsn/a 1   19248846
Responses   n/a 0   19875699

\EF


How are writes handled while adding nodes to cluster?

2015-10-06 Thread Erik Forsberg
Hi!

How are writes handled while I'm adding a node to a cluster, i.e. while
the new node is in JOINING state?

Are they queued up as hinted handoffs, or are they being written to the
joining node?

In the former case I guess I have to make sure my max_hint_window_in_ms
is long enough for the node to become NORMAL or hints will get dropped
and I must do repair. Am I right?

Thanks,
\EF


One node misbehaving (lot's of GC), ideas?

2015-04-15 Thread Erik Forsberg
Hi!

We having problems with one node (out of 56 in total) misbehaving.
Symptoms are:

* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
thrift insertions.
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
small ones)
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
other nodes)

This is on 2.0.13 with vnodes (256 per node).

All other nodes have normal behaviour, with a few (2-3) full CMS old
space  in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).

nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the Owns column, and the troublesome
node reports 1.7%.

All nodes are under puppet control, so configuration is the same
everywhere.

We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:

 15 cssa01
 15 cssa02
 13 cssa03
 13 cssa04

The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?

I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.

Regards,
\EF






Re: Cluster status instability

2015-04-08 Thread Erik Forsberg
To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems
to stay that way for a very long time (hours). I'm not even sure it will
recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that
thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for
all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we
have a lot of bulkloading going on. But why does it then stay for several
hours after the load goes down?
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12
about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own
datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there
are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all
nodes are up. Load on cluster right now is minimal, there's no GC going on.
Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN  2001:4c28:1:413:0:1:2:5   1.07 TB256 1.8%
114ff46e-57d0-40dd-87fb-3e4259e96c16  rack2
DN  2001:4c28:1:413:0:1:2:6   1.06 TB256 1.8%
b161a6f3-b940-4bba-9aa3-cfb0fc1fe759  rack2
DN  2001:4c28:1:413:0:1:2:13  896.82 GB  256 1.6%
4a488366-0db9-4887-b538-4c5048a6d756  rack2
DN  2001:4c28:1:413:0:1:3:7   1.04 TB256 1.8%
95cf2cdb-d364-4b30-9b91-df4c37f3d670  rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is
down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
  generation:1427712750
  heartbeat:2310212
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack2
  LOAD:1.172524771195E12
  INTERNAL_IP:2001:4c28:1:413:0:1:2:5
  HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100493381707736523347375230104768602825
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
  generation:1427714889
  heartbeat:2305710
  NET_VERSION:7
  RPC_ADDRESS:0.0.0.0
  RELEASE_VERSION:2.0.13
  RACK:rack3
  LOAD:1.047542503234E12
  INTERNAL_IP:2001:4c28:1:413:0:1:3:12
  HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
  DC:iceland
  SEVERITY:0.0
  STATUS:NORMAL,100163259989151698942931348962560111256
  SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made
04-05 mark it as up, with no success.

Please let me know what other debug information I can provide.

Regards,
\EF

On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle daeme...@gmail.com
wrote:

 Do you happen to be using a tool like Nagios or Ganglia that are able to
 report utilization (CPU, Load, disk io, network)? There are plugins for
 both that will also notify you of (depending on whether you enabled the
 intermediate GC logging) about what is happening.



 On Thu, Apr 2, 2015 at 8:35 AM, Jan cne...@yahoo.com wrote:

 Marcin  ;

 are all your nodes within the same Region   ?
 If not in the same region,   what is the Snitch type that you are using
 ?

 Jan/



   On Thursday, April 2, 2015 3:28 AM, Michal Michalski 
 michal.michal...@boxever.com wrote:


 Hey Marcin,

 Are they actually going up and down repeatedly (flapping) or just down
 and they never come back?
 There might be different reasons for flapping nodes, but to list what I
 have at the top of my head right now:

 1. Network issues. I don't think it's your case, but you can read about
 the issues some people are having when deploying C* on AWS EC2 (keyword to
 look for: phi_convict_threshold)

 2. Heavy load. Node is under heavy load because of massive number of
 reads / writes / bulkloads or e.g. unthrottled compaction etc., which may
 result in extensive GC.

 Could any of these be a problem in your case? I'd start from
 investigating GC logs e.g. to see how long does the stop the world full
 GC take (GC logs should be on by default from what I can see [1])

 [1] https://issues.apache.org/jira/browse/CASSANDRA-5319

 Michał


 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com

 On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com
 wrote:

 Hi!

 We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
 installed. Assume we have nodes A, B, C, D, E. On some irregular basis
 one of those nodes starts to report that subset of other nodes is in
 DN state although C* deamon on all nodes is running:

 A$ nodetool status
 UN B
 DN C
 DN D
 UN E

 B$ nodetool status
 UN A
 UN C
 UN D
 UN E

 C$ nodetool status
 DN A
 UN B
 UN D
 UN E

 After restart of A node, C and D report that A it's in UN and also A
 claims that whole cluster is in UN state. Right now I don't have any
 clear steps to reproduce that situation, do you guys have any idea
 what could be causing such behaviour? How this 

changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Erik Forsberg
Hi!

I was pleased to find out that cassandra 2.0.x has added support for
pluggable metrics export, which even includes a graphite metrics sender.

Question: Will changes to the metricsReporterConfigFile require a
restart of cassandra to take effect?

I.e, if I want to add a new exported metric to that file, will I have to
restart my cluster?

Thanks,
\EF


Anonymous user in permissions system?

2015-02-05 Thread Erik Forsberg
Hi!

Is there such a thing as the anonymous/unauthenticated user in the
cassandra permissions system?

What I would like to do is to grant select, i.e. provide read-only
access, to users which have not presented a username and password.

Then grant update/insert to other users which have presented a username
and (correct) password.

Doable?

Regards,
\EF


Re: Working with legacy data via CQL

2014-11-19 Thread Erik Forsberg
On 2014-11-19 01:37, Robert Coli wrote:
 
 Thanks, I can reproduce the issue with that, and I should be able to
 look into it tomorrow.  FWIW, I believe the issue is server-side,
 not in the driver.  I may be able to suggest a workaround once I
 figure out what's going on.
 
 
 Is there a JIRA tracking this issue? I like being aware of potential
 issues with legacy tables ... :D

I created one, just for you! :-)

https://issues.apache.org/jira/browse/CASSANDRA-8339

\EF


Re: Working with legacy data via CQL

2014-11-17 Thread Erik Forsberg
On 2014-11-15 01:24, Tyler Hobbs wrote:
 What version of cassandra did you originally create the column family
 in?  Have you made any schema changes to it through cql or
 cassandra-cli, or has it always been exactly the same?

Oh that's a tough question given that the cluster has been around since
2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift
calls from pycassa, and I don't think there has been any schema changes
to it since.

Thanks,
\EF

 
 On Wed, Nov 12, 2014 at 2:06 AM, Erik Forsberg forsb...@opera.com
 mailto:forsb...@opera.com wrote:
 
 On 2014-11-11 19:40, Alex Popescu wrote:
  On Tuesday, November 11, 2014, Erik Forsberg forsb...@opera.com 
 mailto:forsb...@opera.com
  mailto:forsb...@opera.com mailto:forsb...@opera.com wrote:
 
 
  You'll have better chances to get an answer about the Python driver on
  its own mailing
  list  
 https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user
 
 As I said, this also happens when using cqlsh:
 
 cqlsh:test SELECT column1,value from Users where key =
 a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created';
 
  column1  | value
 --+--
  date_created | '\x00\x00\x00\x00Ta\xf3\xe0'
 
 (1 rows)
 
 Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value')
 as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected
 end of data
 
 So let me rephrase: How do I work with data where the table has metadata
 that makes some columns differ from the main validation class? From
 cqlsh, or the python driver, or any driver?
 
 Thanks,
 \EF
 
 
 
 
 -- 
 Tyler Hobbs
 DataStax http://datastax.com/



Re: Working with legacy data via CQL

2014-11-17 Thread Erik Forsberg
On 2014-11-17 09:56, Erik Forsberg wrote:
 On 2014-11-15 01:24, Tyler Hobbs wrote:
 What version of cassandra did you originally create the column family
 in?  Have you made any schema changes to it through cql or
 cassandra-cli, or has it always been exactly the same?
 
 Oh that's a tough question given that the cluster has been around since
 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift
 calls from pycassa, and I don't think there has been any schema changes
 to it since.

Actually, I don't think it matters. I created a minimal repeatable set
of python code (see below). Running that against a 2.0.11 server,
creating fresh keyspace and CF, then insert some data with
thrift/pycassa, then trying to extract the data that has a different
validation class, the python-driver and cqlsh bails out.

cqlsh example after running the below script:

cqlsh:badcql select * from Users where column1 = 'default_account_id'
ALLOW FILTERING;

value \xf9\x8bu}!\xe9C\xbb\xa7=\xd0\x8a\xff';\xe5 (in col 'value')
can't be deserialized as text: 'utf8' codec can't decode byte 0xf9 in
position 0: invalid start byte

cqlsh:badcql select * from Users where column1 = 'date_created' ALLOW
FILTERING;

value '\x00\x00\x00\x00Ti\xe0\xbe' (in col 'value') can't be
deserialized as text: 'utf8' codec can't decode bytes in position 6-7:
unexpected end of data


So the question remains - how do I work with this data from cqlsh and /
or the python driver?

Thanks,
\EF

--repeatable example--
#!/usr/bin/env python

# Run this in virtualenv with pycassa and cassandra-driver installed via pip
import pycassa
import cassandra
import calendar
import traceback
import time
from uuid import uuid4

keyspace = badcql

sysmanager = pycassa.system_manager.SystemManager(localhost)
sysmanager.create_keyspace(keyspace,
strategy_options={'replication_factor':'1'})
sysmanager.create_column_family(keyspace, Users,
key_validation_class=pycassa.system_manager.LEXICAL_UUID_TYPE,

comparator_type=pycassa.system_manager.ASCII_TYPE,

default_validation_class=pycassa.system_manager.UTF8_TYPE)
sysmanager.create_index(keyspace, Users, username,
pycassa.system_manager.UTF8_TYPE)
sysmanager.create_index(keyspace, Users, email,
pycassa.system_manager.UTF8_TYPE)
sysmanager.alter_column(keyspace, Users, default_account_id,
pycassa.system_manager.LEXICAL_UUID_TYPE)
sysmanager.create_index(keyspace, Users, active,
pycassa.system_manager.INT_TYPE)
sysmanager.alter_column(keyspace, Users, date_created,
pycassa.system_manager.LONG_TYPE)

pool = pycassa.pool.ConnectionPool(keyspace, ['localhost:9160'])
cf = pycassa.ColumnFamily(pool, Users)

user_uuid = uuid4()

cf.insert(user_uuid, {'username':'test_username', 'auth_method':'ldap',
'email':'t...@example.com', 'active':1,

'date_created':long(calendar.timegm(time.gmtime())),
'default_account_id':uuid4()})

from cassandra.cluster import Cluster
cassandra_cluster = Cluster([localhost])
cassandra_session = cassandra_cluster.connect(keyspace)
print username, cassandra_session.execute('SELECT value from Users
where key = %s and column1 = %s', (user_uuid, 'username',))
print email, cassandra_session.execute('SELECT value from Users
where key = %s and column1 = %s', (user_uuid, 'email',))
try:
print default_account_id, cassandra_session.execute('SELECT value
from Users where key = %s and column1 = %s', (user_uuid,
'default_account_id',))
except Exception as e:
print Exception trying to get default_account_id,
traceback.format_exc()
cassandra_session = cassandra_cluster.connect(keyspace)

try:
print active, cassandra_session.execute('SELECT value from Users
where key = %s and column1 = %s', (user_uuid, 'active',))
except Exception as e:
print Exception trying to get active, traceback.format_exc()
cassandra_session = cassandra_cluster.connect(keyspace)

try:
print date_created, cassandra_session.execute('SELECT value from
Users where key = %s and column1 = %s', (user_uuid, 'date_created',))
except Exception as e:
print Exception trying to get date_created, traceback.format_exc()
-- end of example --


Re: Working with legacy data via CQL

2014-11-12 Thread Erik Forsberg
On 2014-11-11 19:40, Alex Popescu wrote:
 On Tuesday, November 11, 2014, Erik Forsberg forsb...@opera.com
 mailto:forsb...@opera.com wrote:
 
 
 You'll have better chances to get an answer about the Python driver on
 its own mailing
 list  
 https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user

As I said, this also happens when using cqlsh:

cqlsh:test SELECT column1,value from Users where key =
a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created';

 column1  | value
--+--
 date_created | '\x00\x00\x00\x00Ta\xf3\xe0'

(1 rows)

Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value')
as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected
end of data

So let me rephrase: How do I work with data where the table has metadata
that makes some columns differ from the main validation class? From
cqlsh, or the python driver, or any driver?

Thanks,
\EF


Working with legacy data via CQL

2014-11-11 Thread Erik Forsberg
Hi!

I have some data in a table created using thrift. In cassandra-cli, the
'show schema' output for this table is:

create column family Users
  with column_type = 'Standard'
  and comparator = 'AsciiType'
  and default_validation_class = 'UTF8Type'
  and key_validation_class = 'LexicalUUIDType'
  and column_metadata = [
{column_name : 'date_created',
validation_class : LongType},
{column_name : 'active',
validation_class : IntegerType,
index_name : 'Users_active_idx_1',
index_type : 0},
{column_name : 'email',
validation_class : UTF8Type,
index_name : 'Users_email_idx_1',
index_type : 0},
{column_name : 'username',
validation_class : UTF8Type,
index_name : 'Users_username_idx_1',
index_type : 0},
{column_name : 'default_account_id',
validation_class : LexicalUUIDType}];

From cqlsh, it looks like this:

[cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh:test describe table Users;

CREATE TABLE Users (
  key 'org.apache.cassandra.db.marshal.LexicalUUIDType',
  column1 ascii,
  active varint,
  date_created bigint,
  default_account_id 'org.apache.cassandra.db.marshal.LexicalUUIDType',
  email text,
  username text,
  value text,
  PRIMARY KEY ((key), column1)
) WITH COMPACT STORAGE;

CREATE INDEX Users_active_idx_12 ON Users (active);

CREATE INDEX Users_email_idx_12 ON Users (email);

CREATE INDEX Users_username_idx_12 ON Users (username);

Now, when I try to extract data from this using cqlsh or the
python-driver, I have no problems getting data for the columns which are
actually UTF8,but for those where column_metadata have been set to
something else, there's trouble. Example using the python driver:

-- snip --

In [8]: u = uuid.UUID(a6b07340-047c-4d4c-9a02-1b59eabf611c)

In [9]: sess.execute('SELECT column1,value from Users where key = %s
and column1 = %s', [u, 'username'])
Out[9]: [Row(column1='username', value=u'uc6vf')]

In [10]: sess.execute('SELECT column1,value from Users where key = %s
and column1 = %s', [u, 'date_created'])
---
UnicodeDecodeErrorTraceback (most recent call last)
ipython-input-10-d06f98a160e1 in module()
 1 sess.execute('SELECT column1,value from Users where key = %s
and column1 = %s', [u, 'date_created'])

/home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc
in execute(self, query, parameters, timeout, trace)
   1279 future = self.execute_async(query, parameters, trace)
   1280 try:
- 1281 result = future.result(timeout)
   1282 finally:
   1283 if trace:

/home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc
in result(self, timeout)
   2742 return PagedResult(self, self._final_result)
   2743 elif self._final_exception:
- 2744 raise self._final_exception
   2745 else:
   2746 raise OperationTimedOut(errors=self._errors,
last_host=self._current_host)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 6:
unexpected end of data

-- snap --

cqlsh gives me similar errors.

Can I tell the python driver to parse some column values as integers, or
is this an unsupported case?

For sure this is an ugly table, but I have data in it, and I would like
to avoid having to rewrite all my tools at once, so if I could support
it from CQL that would be great.

Regards,
\EF


Running out of disk at bootstrap in low-disk situation

2014-09-20 Thread Erik Forsberg
Hi!

We have unfortunately managed to put ourselves in a situation where we are
really close to full disks on our existing 27 nodes.

We are now trying to add 15 more nodes, but running into problems with out
of disk space on the new nodes while joining.

We're using vnodes, on Cassandra 1.2.18 (yes, I know that's old, and I'll
upgrade as soon as I'm out of this problematic situation).

I've added all the 15 nodes, with some time inbetween - definitely more
than the 2-minute rule. But it seems like compaction is not keeping up with
the incoming data. Or at least that's my theory.

What are the recommended settings to avoid this problem? I have now set
compaction threshold to 0 for unlimited compaction bandwidth, hoping that
will help (will it?)

Will it help to lower the streaming throughput too? I'm unsure about the
latter since from observation it seems that compaction will not start until
it has finished streaming from a node. With 27 nodes sharing the incoming
bandwidth, all of them will take equally long time to finish and then the
compaction can occur. I guess I could limit streaming bandwidth on some of
the source nodes too. Or am I completely wrong here?

Other ideas most welcome.

Regards,
\EF


Restart joining node

2014-09-20 Thread Erik Forsberg
Hi!

On the same subject as before - due to full disk during bootstrap, my
joining nodes are stuck. What's the correct procedure here, will a plain
restart of the node do the right thing, i.e. continue where bootstrap
stopped, or is it better to clean the data directories before new start of
daemon?

Regards,
\EF


Re: LeveledCompaction, streaming bulkload, and lot's of small sstables

2014-08-20 Thread Erik Forsberg
On 2014-08-18 19:52, Robert Coli wrote:
 On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg forsb...@opera.com
 mailto:forsb...@opera.com wrote:
 
 Is there some configuration knob I can tune to make this happen faster?
 I'm getting a bit confused by the description for min_sstable_size,
 bucket_high, bucket_low etc - and I'm not sure if they apply in this
 case.
 
 
 You probably don't want to use multi-threaded compaction, it is removed
 upstream.
 
 nodetool setcompactionthroughput 0
 
 Assuming you have enough IO headroom etc.

OK. I disabled multithreaded and gave it a bit more throughput to play
with, but I still don't think that's the full story.

What I see is the following case:

1) My hadoop cluster is bulkloading around 1000 sstables to the
Cassandra cluster.

2) Cassandra will start compacting.

With SizeTiered, I would see multiple ongoing compactions on the CF in
question, each taking on 32 sstables and compacting to one, all of them
running at the same time.

With Leveled, I see only one compaction, taking on 32 sstables
compacting to one. When that finished, it will start another one. So
it's essentially a serial process, and it takes a much longer time than
what it does with SizeTiered. While this compaction is ongoing, read
performance is not very good.

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2
mentions LCS is parallelized in Cassandra 1.2, but maybe that patch
doesn't cover my use case (although I realize that my use case is maybe
a bit weird)

So my question is if this is something I can tune? I'm running 1.2.18
now, but am strongly considering upgrade to 2.0.X.

Regards,
\EF




LeveledCompaction, streaming bulkload, and lot's of small sstables

2014-08-18 Thread Erik Forsberg
Hi!

I'm bulkloading via streaming from Hadoop to my Cassandra cluster. This
results in a rather large set of relatively small (~1MiB) sstables as
the number of mappers that generate sstables on the hadoop cluster is high.

With SizeTieredCompactionStrategy, the cassandra cluster would quickly
compact all these small sstables into decently sized sstables.

With LeveledCompactionStrategy however, it takes a much longer time. I
have multithreaded_compaction: true, but it is only taking on 32
sstables at a time in one single compaction task, so when it starts with
~1500 sstables, it takes quite some time. I'm not running out of I/O.

Is there some configuration knob I can tune to make this happen faster?
I'm getting a bit confused by the description for min_sstable_size,
bucket_high, bucket_low etc - and I'm not sure if they apply in this case.

I'm pondering options for decreasing the number of sstables being
streamed from the hadoop side, but if that is possible remains to be seen.

Thanks!
\EF


sstableloader and ttls

2014-08-16 Thread Erik Forsberg
Hi!

If I use sstableloader to load data to a cluster, and the source
sstables contain some columns where the TTL has expired, i.e. the
sstable has not yet been compacted - will those entries be properly
removed on the destination side?

Thanks,
\EF


Running sstableloader from live Cassandra server

2014-08-16 Thread Erik Forsberg
Hi!

I'm looking into moving some data from one Cassandra cluster to another,
both of them running Cassandra 1.2.13 (or maybe some later 1.2 version
if that helps me avoid some fatal bug). Sstableloader will probably be
the right thing for me, and given the size of my tables, I will want to
run the sstableloader on the source cluster, but at the same time, that
source cluster needs to keep running to serve data to clients.

If I understand the docs right, this means I will have to:

1. Bring up a new network interface on each of my source nodes. No
problem, I have an IPv6 /64 to choose from :-)

2. Put a cassandra.yaml in the classpath of the sstableloader that
differs from the one in /etc/cassandra/conf, i.e. the one used by the
source cluster's cassandra, with the following:

* listen_address set to my new interface.
* rpc_address set to my new interface.
* rpc_port set as on the destination cluster (i.e. 9160)
* cluster_name set as on the destination cluster.
* storage_port as on the destination cluster (i.e. 7000)

Given the above I should be able to run sstableloader on the nodes of my
source cluster, even with source cluster cassandra daemon running.

Am I right, or did I miss anything?

Thanks,
\EF


EOFException in bulkloader, then IllegalStateException

2014-01-27 Thread Erik Forsberg

Hi!

I'm bulkloading from Hadoop to Cassandra. Currently in the process of 
moving to new hardware for both Hadoop and Cassandra, and while 
testrunning bulkload, I see the following error:


Exception in thread Streaming to /2001:4c28:1:413:0:1:1:12:1 
java.lang.RuntimeException: java.io.EOFException at 
com.google.common.base.Throwables.propagate(Throwables.java:155) at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) 
at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException 
at java.io.DataInputStream.readInt(DataInputStream.java:375) at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) 
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) 
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) 
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
... 3 more


I see no exceptions related to this on the destination node 
(2001:4c28:1:413:0:1:1:12:1).


This makes the whole map task fail with:

2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException: 
Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244)
at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209)
at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
for the task

The failed task was on hadoop worker node hdp01-12-4.

However, hadoop later retries this map task on a different hadoop worker node 
(hdp01-10-2), and that retry succeeds.

So that's weird, but I could live with it. Now, however, comes the real trouble 
- the hadoop job does not finish due to one task running on hdp01-12-4 being 
stuck with this:

Exception in thread Streaming to /2001:4c28:1:413:0:1:1:12:1 
java.lang.IllegalStateException: target reports current file is 
/opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
 but is 
/opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_00_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
at 
org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154)
at 
org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45)
at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199)
at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

This just sits there forever, or at least until the hadoop task timeout kicks 
in.

So two questions here:

1) Any clues on what might cause the first EOFException? It seems to appear for 
*some* of my bulkloads. Not all, but frequent enough to be a problem. Like, 
every 10:th bulkload I do seems to have the problem.

2) The second problem I have a feeling could be related to 
https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk 
that with the bulkload case, we have *multiple java processes* creating 
streaming sessions on the same host, so streaming session IDs are not unique.

I'm thinking 2) happens because the EOFException made the streaming session in 
1) sit around on the target node without being closed.

This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid 
upgrading 

Re: EOFException in bulkloader, then IllegalStateException

2014-01-27 Thread Erik Forsberg

On 2014-01-27 12:56, Erik Forsberg wrote:
This is on Cassandra 1.2.1. I know that's pretty old, but I would like 
to avoid upgrading until I have made this migration from old to new 
hardware. Upgrading to 1.2.13 might be an option.


Update: Exactly the same behaviour on Cassandra 1.2.13.

Thanks,
\EF


Graveyard compactions, when do they occur?

2012-03-28 Thread Erik Forsberg

Hi!

I was trying out the truncate command in cassandra-cli.

http://wiki.apache.org/cassandra/CassandraCli08 says A snapshot of the 
data is created, which is deleted asyncronously during a 'graveyard' 
compaction.


When do graveyard compactions happen? Do I have to trigger them somehow?

Thanks,
\EF


On Bloom filters and Key Cache

2012-03-21 Thread Erik Forsberg

Hi!

We're currently testing Cassandra with a large number of row keys per 
node - nodetool cfstats approximated number of keys to something like 
700M per node. This seems to have caused a very large heap consumption.


After reading 
http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've 
tracked this down to the bloom filter, and the sampled index entries.


Regarding bloom filters, have I understood correctly that they are 
stored on Heap, and that the Bloom Filter Space Used reported by 
'nodetool cfstats' is an approximation of the heap space used by bloom 
filters? It reports the on-disk size, but if I understand 
CASSANDRA-3497, the on-disk size is smaller than the on-Heap size?


I understand that increasing bloom_filter_fp_chance will decrease the 
bloom filter size, but at the cost of worse performance when asking for 
keys that don't exist. I do have a fair amount of queries for keys that 
don't exist.


How much will increasing the key cache help, i.e. decrease bloom filter 
size but increase key cache size? Will the key cache cache negative 
results, i.e. the fact that a key didn't exist?


Regards,
\EF


sstable size increase at compaction

2012-03-21 Thread Erik Forsberg

Hi!

We're using the bulkloader to load data to Cassandra. During and after 
bulkloading, the minor compaction process seems to result in larger 
sstables being created. An example:


 INFO [CompactionExecutor:105] 2012-03-21 15:18:46,608 
CompactionTask.java (line 115) Compacting [SSTableReader(pat
h='/cassandra/OSP5/Data/OSP5-Data-hc-1755-Data.db'), (REMOVED A BUNCH OF 
OTHER SSTABLE PATHS), 
SSTableReader(path='/cassandra/OSP5/Data/OSP5-Data-hc-1749-Data.db'), 
SSTableReader(path='/cassandra/O

SP5/Data/OSP5-Data-hc-1753-Data.db')]

 INFO [CompactionExecutor:105] 2012-03-21 15:30:04,188 
CompactionTask.java (line 226) Compacted to 
[/cassandra/OSP5/Data/OSP5-Data-hc-3270-Data.db,].  84,214,484 to 
105,498,673 (~125% of original) bytes for 2,132,056 keys at 
0.148486MB/s.  Time: 677,580ms.


The sstables are compressed (DeflateCompressor with chunk size 128) on 
the Hadoop cluster before being transferred to Cassandra, and the CF has 
the same compression settings:


[default@Keyspace1] describe Data;
ColumnFamily: Data (Super)
  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Default column value validator: 
org.apache.cassandra.db.marshal.LongType
  Columns sorted by: 
org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.UTF8Type

  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  DC Local Read repair chance: 0.0
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: 0.01
  Built indexes: []
  Compaction Strategy: 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

  Compression Options:
chunk_length_kb: 128
sstable_compression: 
org.apache.cassandra.io.compress.DeflateCompressor


Any clues on this?

Regards,
\EF


Re: sstable size increase at compaction

2012-03-21 Thread Erik Forsberg

On 2012-03-21 16:36, Erik Forsberg wrote:

Hi!

We're using the bulkloader to load data to Cassandra. During and after
bulkloading, the minor compaction process seems to result in larger
sstables being created. An example:


This is on Cassandra 1.1, btw.

\EF


Re: Max TTL?

2012-02-21 Thread Erik Forsberg

On 2012-02-20 21:20, aaron morton wrote:

Nothing obvious.


Samarth (working on same project) found that his patch to CASSANDRA-3754 
was cleaned up a bit too much, which caused a negative ttl.


https://issues.apache.org/jira/browse/CASSANDRA-3754?focusedCommentId=13212395page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13212395

So problem found.

Regards,
\EF


Max TTL?

2012-02-20 Thread Erik Forsberg

Hi!

When setting ttl on columns, is there a maximum value (other than 
MAXINT, 2**31-1) that can be used?


I have a very odd behaviour here, where I try to set ttl to  9 622 973 
(~111 days) which works, but setting it to 11 824 305 (~137 days) does 
not - it seems columns are deleted instantly at insertion.


This is using the BulkOutputFormat. And it could be a problem with our 
code, i.e. the code using BulkOutputFormat. So, uhm, just asking to see 
if we're hitting something obvious.


Regards,
\EF


Streaming sessions from BulkOutputFormat job being listed long after they were killed

2012-02-17 Thread Erik Forsberg

Hi!

If I run a hadoop job that uses BulkOutputFormat to write data to 
Cassandra, and that hadoop job is aborted, i.e. streaming sessions are 
not completed, it seems like the streaming sessions hang around for a 
very long time, I've observed at least 12-15h, in output from 'nodetool 
netstats'.


To me it seems like they go away only after a restart of Cassandra.

Is this a known behaviour? Does it cause any problems, f. ex. consuming 
memory, or should I just ignore it?


Regards,
\EF



Recommended configuration for good streaming performance?

2012-02-02 Thread Erik Forsberg

Hi!

We're experimenting with streaming from Hadoop to Cassandra using 
BulkoutputFormat, on cassandra-1.1 branch.


Are there any specific settings we should tune on the Cassandra servers 
in order to get the best streaming performance?


Our Cassandra hardware are 16 core (including HT cores) with 24GiB of 
RAM. They have two disks each. So far we've configured them with 
commitlog on one disk and sstables on the other, but with streaming not 
using commitlog (correct?) maybe it makes sense to have sstables on both 
disks, doubling available I/O?


Thoughts on number of parallel streaming clients?

Thanks,
\EF


Can I use BulkOutputFormat from 1.1 to load data to older Cassandra versions?

2012-01-08 Thread Erik Forsberg

Hi!

Can the new BulkOutputFormat 
(https://issues.apache.org/jira/browse/CASSANDRA-3045) be used to load 
data to servers running cassandra 0.8.7 and/or Cassandra 1.0.6?


I'm thinking of using jar files from the development version to load 
data onto a production cluster which I want to keep on a production 
version of Cassandra. Can I do that, or does BulkOutputFormat require an 
API level that is only in the development version of Cassandra?


Thanks,
\EF


Re: Multiple large disks in server - setup considerations

2011-06-07 Thread Erik Forsberg
On Tue, 31 May 2011 13:23:36 -0500
Jonathan Ellis jbel...@gmail.com wrote:

 Have you read http://wiki.apache.org/cassandra/CassandraHardware ?

I had, but it was a while ago so I guess I kind of deserved an RTFM! :-)

After re-reading it, I still want to know:

* If we disregard the performance hit caused by having the commitlog on
  the same physical device as parts of the data, are there any other
  grave effects on Cassandra's functionality with a setup like that?

* How does Cassandra handle a case where one of the disks in a striped
  RAID0 partition goes bad and is replaced? Is the only option to wipe
  everything from that node and reinit the node, or will it handle
  corrupt files? I.e, what's the recommended thing to do from an
  operations point of view when a disk dies on one of the nodes in a
  RAID0 Cassandra setup? What will cause the least risk for data loss?
  What will be the fastest way to get the node up to speed with the
  rest of the cluster?

Thanks,
\EF



 
 On Tue, May 31, 2011 at 7:47 AM, Erik Forsberg forsb...@opera.com
 wrote:
  Hi!
 
  I'm considering setting up a small (4-6 nodes) Cassandra cluster on
  machines that each have 3x2TB disks. There's no hardware RAID in the
  machine, and if there were, it could only stripe single disks
  together, not parts of disks.
 
  I'm planning RF=2 (or higher).
 
  I'm pondering what the best disk configuration is. Two alternatives:
 
  1) Make small partition on first disk for Linux installation and
  commit log. Use Linux' software RAID0 to stripe the remaining space
  on disk1
    + the two remaining disks into one large XFS partition.
 
  2) Make small partition on first disk for Linux installation and
  commit log. Mount rest of disk 1 as /var/cassandra1, then disk2
    as /var/cassandra2 and disk3 as /var/cassandra3.
 
  Is it unwise to put the commit log on the same physical disk as
  some of the data? I guess it could impact write performance, but
  maybe it's bad from a data consistency point of view?
 
  How does Cassandra handle replacement of a bad disk in the two
  alternatives? With option 1) I guess there's risk of files being
  corrupt. With option 2) they will simply be missing after replacing
  the disk with a new one.
 
  With option 2) I guess I'm limiting the size of the total amount of
  data in the largest CF at compaction to, hmm.. the free space on the
  disk with most free space, correct?
 
  Comments welcome!
 
  Thanks,
  \EF
  --
  Erik Forsberg forsb...@opera.com
  Developer, Opera Software - http://www.opera.com/
 
 
 
 


-- 
Erik Forsberg forsb...@opera.com
Developer, Opera Software - http://www.opera.com/


Multiple large disks in server - setup considerations

2011-05-31 Thread Erik Forsberg
Hi!

I'm considering setting up a small (4-6 nodes) Cassandra cluster on
machines that each have 3x2TB disks. There's no hardware RAID in the
machine, and if there were, it could only stripe single disks
together, not parts of disks. 

I'm planning RF=2 (or higher).

I'm pondering what the best disk configuration is. Two alternatives:

1) Make small partition on first disk for Linux installation and commit
   log. Use Linux' software RAID0 to stripe the remaining space on disk1
   + the two remaining disks into one large XFS partition.

2) Make small partition on first disk for Linux installation and commit
   log. Mount rest of disk 1 as /var/cassandra1, then disk2
   as /var/cassandra2 and disk3 as /var/cassandra3.

Is it unwise to put the commit log on the same physical disk as some of
the data? I guess it could impact write performance, but maybe it's bad
from a data consistency point of view? 

How does Cassandra handle replacement of a bad disk in the two
alternatives? With option 1) I guess there's risk of files being
corrupt. With option 2) they will simply be missing after replacing the
disk with a new one.

With option 2) I guess I'm limiting the size of the total amount of
data in the largest CF at compaction to, hmm.. the free space on the
disk with most free space, correct? 

Comments welcome!

Thanks,
\EF
-- 
Erik Forsberg forsb...@opera.com
Developer, Opera Software - http://www.opera.com/