Re: Strange delay in query

2012-11-08 Thread Josep Blanquer
Can it be that you have tons and tons of tombstoned columns in the middle
of these two? I've seen plenty of performance issues with wide
rows littered with column tombstones (you could check with dumping the
sstables...)

Just a thought...

Josep M.

On Thu, Nov 8, 2012 at 12:23 PM, André Cruz andre.c...@co.sapo.pt wrote:

 These are the two columns in question:

 = (super_column=13957152-234b-11e2-92bc-e0db550199f4,
  (column=attributes, value=, timestamp=1351681613263657)
  (column=blocks,
 value=A4edo5MhHvojv3Ihx_JkFMsF3ypthtBvAZkoRHsjulw06pez86OHch3K3OpmISnDjHODPoCf69bKcuAZSJj-4Q,
 timestamp=1351681613263657)
  (column=hash,
 value=8_p2QaeRaX_QwJbUWQ07ZqlNHei7ixu0MHxgu9oennfYOGfyH6EsEe_LYO8V8EC_1NPL44Gx8B7UhYV9VSb7Lg,
 timestamp=1351681613263657)
  (column=icon, value=image_jpg, timestamp=1351681613263657)
  (column=is_deleted, value=true, timestamp=1351681613263657)
  (column=is_dir, value=false, timestamp=1351681613263657)
  (column=mime_type, value=image/jpeg, timestamp=1351681613263657)
  (column=mtime, value=1351646803, timestamp=1351681613263657)
  (column=name, value=/Mobile Photos/Photo 2012-10-28 17_13_50.jpeg,
 timestamp=1351681613263657)
  (column=revision, value=13957152-234b-11e2-92bc-e0db550199f4,
 timestamp=1351681613263657)
  (column=size, value=1379001, timestamp=1351681613263657)
  (column=thumb_exists, value=true, timestamp=1351681613263657))
 = (super_column=40b7ae4e-2449-11e2-8610-e0db550199f4,
  (column=attributes, value={posix: 420}, timestamp=1351790781154800)
  (column=blocks,
 value=9UCDkHNb8-8LuKr2bv9PjKcWCT0v7FCZa0ebNSflES4-o7QD6eYschVaweCKSbR29Dq2IeGl_Cu7BVnYJYphTQ,
 timestamp=1351790781154800)
  (column=hash,
 value=kao2EV8jw_wN4EBoMkCXZWCwg3qQ0X6m9_X9JIGkEkiGKJE_JeKgkdoTAkAefXgGtyhChuhWPlWMxl_tX7VZUw,
 timestamp=1351790781154800)
  (column=icon, value=text_txt, timestamp=1351790781154800)
  (column=is_dir, value=false, timestamp=1351790781154800)
  (column=mime_type, value=text/plain, timestamp=1351790781154800)
  (column=mtime, value=1351378576, timestamp=1351790781154800)
  (column=name, value=/Documents/VIMDocument.txt,
 timestamp=1351790781154800)
  (column=revision, value=40b7ae4e-2449-11e2-8610-e0db550199f4,
 timestamp=1351790781154800)
  (column=size, value=13, timestamp=1351790781154800)
  (column=thumb_exists, value=false, timestamp=1351790781154800))


 I don't think their size is an issue here.

 André

 On Nov 8, 2012, at 6:04 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 What is the size of columns? Probably those two are huge.


 On Thu, Nov 8, 2012 at 4:01 AM, André Cruz andre.c...@co.sapo.pt wrote:

 On Nov 7, 2012, at 12:15 PM, André Cruz andre.c...@co.sapo.pt wrote:

  This error also happens on my application that uses pycassa, so I don't
 think this is the same bug.

 I have narrowed it down to a slice between two consecutive columns.
 Observe this behaviour using pycassa:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('13957152-234b-11e2-92bc-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:55:51,170 pycassa_library.pool:30 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 11:55:53,415 pycassa_library.pool:37 6849
 139928791262976 Connection 52905488 (xxx:9160) was checked in to pool
 51715344
 [UUID('13957152-234b-11e2-92bc-e0db550199f4'),
 UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')]

 A two column slice took more than 2s to return. If I request the next 2
 column slice:

 
 DISCO_CASS.col_fam_nsrev.get(uuid.UUID('3cd88d97-ffde-44ca-8ae9-5336caaebc4e'),
 column_count=2,
 column_start=uuid.UUID('40b7ae4e-2449-11e2-8610-e0db550199f4')).keys()
 DEBUG 2012-11-08 11:57:32,750 pycassa_library.pool:30 6849
 139928791262976 Connection 52904912 (xxx:9160) was checked out from pool
 51715344
 DEBUG 2012-11-08 11:57:32,774 pycassa_library.pool:37 6849
 139928791262976 Connection 52904912 (xxx:9160) was checked in to pool
 51715344
 [UUID('40b7ae4e-2449-11e2-8610-e0db550199f4'),
 UUID('a364b028-2449-11e2-8882-e0db550199f4')]

 This takes 20msec... Is there a rational explanation for this different
 behaviour? Is there some threshold that I'm running into? Is there any way
 to obtain more debugging information about this problem?

 Thanks,
 André






Re: minor compaction and delete expired column-tombstones

2012-09-17 Thread Josep Blanquer
We've run exactly into the same problem recently. Some specific keys in a
couple CFs accumulate a fair amount of column churn over time.

Pre Cassandra 1.x we scheduled full compactions often to purge them.
However, when we moved to 1.x but we adopted the recommended practice of
avoiding full compactions. The problem took a while to manifest itself, but
over the course of several weeks (few months) of not doing full compactions
the load on those services slowly increased...and despite we have
everything monitored, it was not trivial to find out that it was the
accumulation of tombstones on 'some' keys, for 'some' CF in the cluster
that were really causing long latencies and CPU spikes (high CPU is a
typical signature when having a fair amount of tombstones in the SSTables).

Is there any JIRA or enhancement to perhaps be able to detect when certain
column tombstones can be deleted in minor compactions? The new introduction
of SSTable min-max timestamps might help? or perhaps there are new ones
coming up that I'm not aware of 

I'm saying this because there is absolutely no way (that I know of) to find
out or monitor when Cassandra encounters many column tombstones when doing
searches. That alone could help detect these cases so one can change the
data model and/or realize that needs full compactions. For example a new
metric at the CF level that tracks % of tombstones read per row (ideally a
histogram based on row size), or perhaps spit something out in the logs (a
la mysql slowquery log) when a wide row is read and a certain % of
tombstone columns are encountered...this alone can be a huge help in at
least detecting the latent problem.

...what we had to do to fully debug and understand the issue was to build
some tools that scanned SSTables and provided some of those stats. In a
large cluster that is painful to do.

Anyway, just wanted to chime in the thread to provide our input in the
matter.

Cheers,

Josep M.

On Mon, Sep 17, 2012 at 2:01 AM, Rene Kochen
rene.koc...@emea.schange.comwrote:

 Oke, thanks!

 So a column tombstone will only be removed if all row fragments are
 present in the tables being compacted.

 I have a row called Index which contains columns like page0,
 page1, page2, etc. Every several minutes, new columns are created
 and old ones deleted. The problem is that I now have an Index row in
 several SSTables, but the column tombstones are never deleted. And
 reading the Index row (and all its column tombstones) takes longer
 and longer.

 If I do a major compaction, all tombstones are deleted and reading the
 index row takes one millisecond again (and all the garbage-collect
 issues because of this).

 Is it not advised to use rows with many new column creates/deletes
 (because of how minor compactions work)?

 Thanks!

 Rene

 2012/9/17 aaron morton aa...@thelastpickle.com:
  Does minor compaction delete expired column-tombstones when the row is
  also present in another table which is
 
  No.
  Compaction is per Column Family.
 
  Tombstones will be expired by Minor Compaction if all fragments of the
 row
  are contained in the SSTables being compacted.
 
  Cheers
 
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
 
  On 15/09/2012, at 6:32 AM, Rene Kochen rene.koc...@schange.com wrote:
 
  Hi all,
 
  Does minor compaction delete expired column-tombstones when the row is
  also present in another table which is not subject to the minor
  compaction?
 
  Example:
 
  Say there are 5 SStables:
 
  - Customers_0 (10 MB)
  - Customers_1 (10 MB)
  - Customers_2 (10 MB)
  - Customers_3 (10 MB)
  - Customers_4 (30 MB)
 
  A minor compaction is triggered which will compact the similar sized
  tables 0 to 3. In these tables is a customer record with key C1 with
  an expired column tombstone. Customer C1 is also present in table 4.
  Will the minor compaction delete the column (i.e. will the tombstone
  be present in the newly created table)?
 
  Thanks,
 
  Rene
 
 



Re: CQL3 and column slices

2012-07-24 Thread Josep Blanquer
Thank Sylvain,

 The main argument for this is pagination. Let me try to explain the use
cases, and compare it to RDBMS for better illustration:
 1- Right now, Cassandra doesn't stream the requests, so large resultsets
are a royal pain in the neck to deal with. I.e., if I have a range_slice,
or even a slice query that cuts across 1 million columns...I have to
completely eat it all in the client receiving the response. That is, I'll
need to store 1 million results in the client no matter what, and that can
be quite prohibitive.
 2- In an effort to alleviate that, one can be smarter in the client and
play the pagination game...i.e., start slicing at some column and get the
next N results, then start the slice at the last column seen and get N
moreetc. That results in many more queries from the smart client, but
at least it would allow you to handle large result sets. (That's where the
need for the CQL query in my original email was about).
3- There's another important factor related to this problem in my opinion:
the LIMIT clause in Cassandra (in both CQL or Thrift) is a required
field. What I mean by required is that cassandra requires an explicit
count to operate underneath. So it is really different from RDBMS'
semantics where no LIMIT means you'll get all the results (instead of the
high, yet still bound count of 10K or 20K max resultset row cassandra
enforces by defaul)...and I cannot tell you how many problems we've had
with developers forgetting about these default counts in queries, and
realizing that some had results truncated because of that...in my mind,
LIMIT should be to only used restrict results...queries with no LIMIT
should always return all results (much like RDBMS)...otherwise the query
seems the same but it is semantically different.

So, all in all I think that the main problem/use case I'm facing is that
Cassandra cannot stream resultsets. If it did, I believe that the need for
my pagination use case would basically disappear, since it'd be the
transport/client that would throttle how many results are stored in the
client buffer at any point time. At the same time, I believe that with a
streaming protocol you could simply change Cassandra internals to have
infinite default limits...since there wouldn't be no reason to stop
scanning (unless an explicit LIMIT clause was specified by the client).
That would give you not only the SQL-equivalent syntax, but also the
equivalent semantics of most current DBs.

I hope that makes sense. That being said, are there any plans for streaming
results? I believe that without that (and especially with the new CQL
restrictions) it make much more difficult to use Cassandra with wide rows
and large resultsets (which, in my mind is one of its sweet spots ). I
believe that if that doesn't happen it would a) force the clients to be
built in a much more complex and inefficient way to handle wide rows or b)
will force users to use different, less efficient datamodels for their
data. Both seem bad propositions to me, as they wouldn't be taking
advantage of Cassandra's power, therefore diminishing its value.

 Cheers,

 Josep M.


On Tue, Jul 24, 2012 at 3:11 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Tue, Jul 24, 2012 at 12:09 AM, Josep Blanquer
 blanq...@rightscale.com wrote:
  is there some way to express that in CQL3? something logically
 equivalent to
 
  SELECT *  FROM bug_test WHERE a:b:c:d:e  1:1:1:1:2??

 No, there isn't. Not currently at least. But feel free of course to
 open a ticket/request on
 https://issues.apache.org/jira/browse/CASSANDRA.

 I note that I would be curious to know the concrete use case you have
 for such type of queries. It would also help as an argument to add
 such facilities more quickly (or at all). Typically, we should
 support it in CQL3 because it was possible with thrift is
 definitively an argument, but a much weaker one without concrete
 examples of why it might be useful in the first place.

 --
 Sylvain



CQL3 and column slices

2012-07-23 Thread Josep Blanquer
Hi,

 I am confused as to what is the way to specify column slices for composite
type CFs using CQL3.

I first thought that the way to do so was to use the very ugly and
unintuitive syntax of constructing the PK prefix with equalities, except
the last part of the composite type. But, now, after seeing
https://issues.apache.org/jira/browse/CASSANDRA-4372 , and realizing that
the ugly/unintuitive way to specify that has been taken away (i.e.,
fixed) ...I don't know what is the way to express it anymore.

In particular, and following the example of 4372...if you have this table
with 6 columns, 5 of them being the composite :

CREATE TABLE bug_test (a int, b int, c int, d int, e int, f text, PRIMARY
KEY (a, b, c, d, e) );
with some data in it:

SELECT * FROM bug_test;

Results:

a | b | c | d | e | f
--+--
1 | 1 | 1 | 1 | 1 | 1
1 | 1 | 1 | 1 | 2 | 2
1 | 1 | 1 | 1 | 3 | 3
1 | 1 | 1 | 1 | 5 | 5
1 | 1 | 1 | 2 | 1 | 1

how can I do a slice starting after 1:1:1:1:2 to the end?

I thought that the (very ugly way) was:

SELECT a, b, c, d, e, f FROM bug_test WHERE a = 1 AND b = 1 AND c = 1 AND d
= 1 AND e  2;

(despite the fact that it felt completely wrong since these conditions need
to be considered together, not as 5 independent ones...otherwise one
realizes that the result will contain rows that don't match it, for example
that contain d=2 in this case)

is there some way to express that in CQL3? something logically equivalent
to

SELECT *  FROM bug_test WHERE a:b:c:d:e  1:1:1:1:2??

Cheers,

Josep M.


Re: poor Memtable performance on column slices?

2012-01-18 Thread Josep Blanquer
Excellent Sylvain! Yes, that seems to remove the linear scan component of
slice read times.

FYI, I still see some interesting difference in some aspects though.

If I do a slice without a start (i.e., get me the first column)...it seems
to fly. GET(K, :count = 1 )
-- 4.832877  -- very fast, and actually in this case I see the reading
client being the bottleneck, not cassandra (which it is at about 20% CPU
only)

If I do the same, but actually specifying the start column with the first
existing value...GET(K,:start = '144abe16-416c-11e1-9e23-2cbae9ddfe8b' ,
:count = 1 )
-- 11.084275 -- half as fast, and using twice the CPU...hovering about
50% or more. (again Cassandra is not the bottleneck, but the significant
data is that the initial seeking seems to be doubling the time/cpu

If I do the same, starting by the middle.  GET(K,:start
= '9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 )
-- 11.038187  -- as expensive as starting from the beginning

The same starting at the last one.  GET(K,:start
= '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 )
-- 6.489683  - Much faster than any other slice ... although not quite as
fast as not using a start column

I could see that not having to seek into whatever backing map/structure
is obviously faster...although I'm surprised that seeking to an initial
value results in half as slow reads. Wouldn't this mostly imply following
some links/pointers in memory to start reading ordered columns? What is the
backing store used for Memtables when column slices are performed?

I am not sure why starting at the end (without reversing or anything)
yields much better performance.

 Cheers,

Josep M.

On Wed, Jan 18, 2012 at 12:57 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Wed, Jan 18, 2012 at 2:44 AM, Josep Blanquer blanq...@rightscale.com
 wrote:
  Hi,
 
   I've been doing some tests using wide rows recently, and I've seen some
 odd
  performance problems that I'd like to understand.
 
  In particular, I've seen that the time it takes for Cassandra to perform
 a
  column slice of a single key, solely in a Memtable, seems to be very
  expensive, but most importantly proportional to the ordered position
 where
  the start column of the slice lives.
 
  In other words:
   1- if I start Cassandra fresh (with an empty ColumnFamily with TimeUUID
  comparator)
   2- I create a single Row with Key K
   3- Then add 200K TimeUUID columns to key K
   4- (and make sure nothing is flushed to SSTables...so it's all in the
  Memtable)
 
  ...I observe the following timings (secondds to perform 1000 reads) while
  performing multiget slices on it:  (pardon the pseudo-code, but you'll
 get
  the gist)
 
  a) simply a get of the first column:  GET(K,:count=1)
--  2.351226
 
  b) doing a slice get, starting from the first column:  GET(K,:start =
  '144abe16-416c-11e1-9e23-2cbae9ddfe8b' , :count = 1 )
-- 2.189224   - so with or without start doesn't seem to make much
 of
  a difference
 
  c) doing a slice get, starting from the middle of the ordered
  columns..approx starting at item number 100K:   GET(K,:start =
  '9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 )
   -- 11.849326  - 5 times more expensive if the start of the slice is
 100K
  positions away
 
  d) doing a slice get, starting from the last of the ordered
 columns..approx
  position 200K:   GET(K,:start
 = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' ,
  :count = 1 )
-- 19.889741   - Almost twice as expensive than starting the slice at
  position 100K, and 10 times more expensive than starting from the first
 one
 
  This behavior leads me to believe that there's a clear Memtable column
 scan
  for the columns of the key.
  If one tries a column name read on those positions (i.e., not a slice),
 the
  performance is constant. I.e., GET(K,
  '144abe16-416c-11e1-9e23-2cbae9ddfe8b') . Retrieving the first, middle or
  last timeUUID is done in the same amount of time.
 
  Having increasingly worse performance for column slices in Memtables
 seems
  to be a bit of a problem...aren't Memtables backed by a structure that
 has
  some sort of column name indexing?...so that landing on the start column
 can
  be efficient? I'm definitely observing very high CPU utilization on those
  scans...By the way, with wide columns like this, slicing SSTables is
 quite
  faster than slicing Memtables...I'm attributing that to the sampled
 index of
  the SSTables, hence that's why I'm wondering if the Memtables do not have
  such column indexing builtin and resort to linked lists of sort
 
  Note, that the actual timings shown are not important, it's in my laptop
 and
  I have a small amount of debugging enabled...what it is important is the
  difference between then.
 
  I'm using Cassandra trunk as of Dec 1st, but I believe I've done
 experiments
  with 0.8 series too, leading to the same issue.

 You may want to retry your experiments on current trunk. We do had
 inefficiency
 in our memtable search that was fixed

Re: poor Memtable performance on column slices?

2012-01-18 Thread Josep Blanquer
On Wed, Jan 18, 2012 at 12:44 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Wed, Jan 18, 2012 at 12:31 PM, Josep Blanquer
 blanq...@rightscale.com wrote:
  If I do a slice without a start (i.e., get me the first column)...it
 seems
  to fly. GET(K, :count = 1 )

 Yep, that's a totally different code path (SimpleSliceReader instead
 of IndexedSliceReader) that we've done to optimize this common case.


Thanks Jonathan, yup, that makes sense. It was surprising to me that
avoiding the seek was that much faster..but I guess if it's a completely
different code path, there might be many other things in play.


  The same starting at the last one.  GET(K,:start
  = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' , :count = 1 )
  -- 6.489683  - Much faster than any other slice ... although not quite
 as
  fast as not using a start column

 That's not a special code path, but I'd guess that the last column is
 more likely to be still in memory instead of on disk.


Well, no need to prolong the thread, but my tests are exclusively in
Memtable reads (data has not flushed)...so there's no SSTable read involved
here...which is exactly why is felt a bit funny to have that case be
considerably faster. I just wanted to bring it up to you guys, in case you
can think of some cause and/or potential issue.

Thanks for the responses!

Josep M.


poor Memtable performance on column slices?

2012-01-17 Thread Josep Blanquer
Hi,

 I've been doing some tests using wide rows recently, and I've seen some
odd performance problems that I'd like to understand.

In particular, I've seen that the time it takes for Cassandra to perform a
column slice of a single key, solely in a Memtable, seems to be very
expensive, but most importantly proportional to the ordered position where
the start column of the slice lives.

In other words:
 1- if I start Cassandra fresh (with an empty ColumnFamily with TimeUUID
comparator)
 2- I create a single Row with Key K
 3- Then add 200K TimeUUID columns to key K
 4- (and make sure nothing is flushed to SSTables...so it's all in the
Memtable)

...I observe the following timings (secondds to perform 1000 reads) while
performing multiget slices on it:  (pardon the pseudo-code, but you'll get
the gist)

a) simply a get of the first column:  GET(K,:count=1)
  --  2.351226

b) doing a slice get, starting from the first column:  GET(K,:start =
'144abe16-416c-11e1-9e23-2cbae9ddfe8b' , :count = 1 )
  -- 2.189224   - so with or without start doesn't seem to make much of
a difference

c) doing a slice get, starting from the middle of the ordered
columns..approx starting at item number 100K:   GET(K,:start =
'9c13c644-416c-11e1-81dd-4ba530dc83d0' , :count = 1 )
 -- 11.849326  - 5 times more expensive if the start of the slice is 100K
positions away

d) doing a slice get, starting from the last of the ordered columns..approx
position 200K:   GET(K,:start = '1c1b9b32-416d-11e1-83ff-dd2796c3abd7' ,
:count = 1 )
  -- 19.889741   - Almost twice as expensive than starting the slice at
position 100K, and 10 times more expensive than starting from the first one

This behavior leads me to believe that there's a clear Memtable column scan
for the columns of the key.
If one tries a column name read on those positions (i.e., not a slice), the
performance is constant. I.e., GET(K,
'144abe16-416c-11e1-9e23-2cbae9ddfe8b') . Retrieving the first, middle or
last timeUUID is done in the same amount of time.

Having increasingly worse performance for column slices in Memtables seems
to be a bit of a problem...aren't Memtables backed by a structure that has
some sort of column name indexing?...so that landing on the start column
can be efficient? I'm definitely observing very high CPU utilization on
those scans...By the way, with wide columns like this, slicing SSTables is
quite faster than slicing Memtables...I'm attributing that to the sampled
index of the SSTables, hence that's why I'm wondering if the Memtables do
not have such column indexing builtin and resort to linked lists of sort

Note, that the actual timings shown are not important, it's in my laptop
and I have a small amount of debugging enabled...what it is important is
the difference between then.

I'm using Cassandra trunk as of Dec 1st, but I believe I've done
experiments with 0.8 series too, leading to the same issue.

 Cheers,

Josep M.


Live migrating data from 2 separate cassandra clusters

2011-08-25 Thread Josep Blanquer
Hi,

 I am looking for an efficient way migrate a portion of the data existing in
a Cassandra cluster to another, separate Cassandra cluster. What I need is
to solve the typical live migration problem that appears in any DB
sharding where need to transfer ownership of certain rows from DB1 to
DB2...but in a way that clients see no (or almost no) disruption when you
actually do the cutover to DB2 for those writes.

I mean doing something as typical like:

loop (until almost no rows have been modified):
 rows = SELECT * from T where criteria matches (i.e., shard_id=1)  AND
updated_at  last_time
 last_time = now
 insert(rows) elsewhere
end
...
lock modifications to original DB
do one last SELECT to get the last few modified rows
cutover the ownership - (change and ensure the clients know that the new
home for that data is in the other DB)
unlock modifications


 So, anyway, I thought that I'd be able to apply the same principles by
passing a timestamp of sorts to the get_slices call so I could further
restrict getting only matching columns that have timestamps newer than the
one passed. Now, looking at the thrift interface I see that there is no
timestamp parameter at all...which makes me wonder how people are doing it,
and if there are any well-know practices for it. Setting up a full new
replicating DC within the same cluster doesn't work, as there are some clear
cases where you want to have completely separate cassandra rings.

Cheers,

 Josep M.


Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?

2011-06-23 Thread Josep Blanquer
On Thu, Jun 23, 2011 at 5:04 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  1. Is it feasible to run directly against a Cassandra data directory
  restored from an EBS snapshot? (as opposed to nodetool snapshots restored
  from an EBS snapshot).

 Assuming EBS is not buggy, including honor write barriers, including
 the linux guest kernel etc, then yes. EBS snapshots of a single
 volumes are promised to be atomic. As such, a restore from an EBS
 snapshot should be semantically identical to recover after a power
 outage or sudden reboot of the node.

 I make no claims as to how well EBS snapshot atomicity is actually
 tested in practice.


EBS volume atomicity is good. We've had tons of experience since EBS came
out almost 4 years ago,  to back all kinds of things, including large DBs.
One important thing to have in mind though, is that EBS snapshots are done
at the block level, not at the filesystem level. So depending on the
filesystem you have on top of the drives you might need to tell the
filesystem to make sure this is consistent or recoverable now. For
example, if you use the log-based XFS, you might need to do xfs_freeze,
snapshot disc/s, xfs_unfreeze. To make sure that the restored filesystem
data (and not only the low level disk blocks) is recoverable when you
restore them.

 Snapshotting volume stripes works exactly in the same way, you just have to
keep track of what position each snapshot has in the stripe, so you can
recreate the stripe back correctly.

The freezing of the filesystem might cause a quick/mini hickup, which is
usually not noticeable unless you have very stringent requirements in the
box (or if you have a very large stripe, and/or some sort of network issue
where the calls to amazon endpoint are very slow...and therefore you're
locking the FS a tad longer than you'd want to).

 Cheers,

Josep M.


Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?

2011-06-23 Thread Josep Blanquer
On Thu, Jun 23, 2011 at 7:30 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  EBS volume atomicity is good. We've had tons of experience since EBS came
  out almost 4 years ago,  to back all kinds of things, including large
 DBs.
  One important thing to have in mind though, is that EBS snapshots are
 done
  at the block level, not at the filesystem level. So depending on the
  filesystem you have on top of the drives you might need to tell the
  filesystem to make sure this is consistent or recoverable now. For
  example, if you use the log-based XFS, you might need to do xfs_freeze,
  snapshot disc/s, xfs_unfreeze. To make sure that the restored filesystem
  data (and not only the low level disk blocks) is recoverable when you
  restore them.

 No. That is only require if you're doing multi-volume EBS snapshots
 (e.g. XFS on LVM). The entire point of an atomic snapshot is that
 atomicity gives a consistent snapshot; a modern file system which is
 already crash-consistent will be consistent in an atomic snapshot
 without additional action taken.



 That said, of course exercising those code paths regularly, rather
 than just on crashes, may mean that you have an elevated chance of
 triggering a bug that you would normally see very rarely. In that way,
 xfs_freeze might actually help probabilistically; however strictly
 speaking, discounting bugs, a crash-consistent fs will be consistent
 snapshot consistent as well (it is logically implied).


Actually, I'm afraid that's not true (unless I'm missing something). Even if
you have only 1 drive, you still need to stop writes to the disk for the
short time it takes the low level drivers to snapshot it (i.e., marking
all blocks as clean so you can do CopyOnWrite later). I.e., you need to give
a chance to LVM, or the EBS low level 'modules' in the hypervisor ( whatever
you use underneath...), to have exclusive control of the drive for a moment.
Now, that being said, some systems (like LVM)  will do a freeze themselves,
so technically speaking you don't need to explicitly do a freeze
yourself...but that's not to say that a freeze is not required for
snapshotting.

But this all assumes the entire stack is correct, and that e.g. an
 fsync() propagates correctly (i.e., not eaten by some LVM or mount
 option to the fs) in order to bring that consistency up to the
 application level.

 --
 / Peter Schuller



Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?

2011-06-23 Thread Josep Blanquer
On Thu, Jun 23, 2011 at 8:02 AM, William Oberman
ober...@civicscience.comwrote:

 I've been doing EBS snapshots for mysql for some time now, and was using a
 similar pattern as Josep (XFS with freeze, snap, unfreeze), with the extra
 complication that I was actually using 8 EBS's in RAID-0 (and the extra
 extra complication that I had to lock the MyISAM tables... glad to be moving
 away from that).  For cassandra I switched to ephemeral disks, as per
 recommendations from this forum.

 yes, if you want to consistently snap MySQL you need to get it into a
consistent state, so you need to do the whole FLUSH TABLES WITH READ LOCK
yadda yadda, on top of the rest. Otherwise you might snapshot something that
is not correct/consistent...and it's a bit more tricky with snapshotting
slaves, since you need to know where they are in the replication
stream...etc



 One note on EBS snapshots though: the last time I checked (which was some
 time ago) I noticed degraded IO performance on the box during the
 snapshotting process even though the take snapshot command returns almost
 immediately.  My theory back then was that amazon does the
 delta/compress/store outside of the VM, but it obviously has an effect on
 resources on the box the VM runs on.  I was doing this on a mysql slave that
 no one talked to, so I didn't care/bother looking into it further.


Yes, that is correct. The underlying copy-on-write-and-ship-to-EBS/S3 does
has some performance impact  on the running box. For the most part it's
never presented a problem for us or many of our customers, although you're
right, it's something you want to know about and have in mind when designing
your system (for example for snapshot slaves much more often than masters,
and do masters when the traffic is low, stagger cassandra snaps...yadda
yadda).
If you think about it, this effect is not that different from using LVM
snaps on the ephemeral, and then moving the data from the snap to another
disk or a remote storage...moving those blocks it would have an impact on
the original LVM volume since it's reading the same physical (ephemeral)
disk/s underneath (list of clean and dirty blocks).

One case I could see the slightly reduced IO performance being problematic
if your DB/storage is already at the edge of I/O capacity...but in that
case, the small overhead of a snapshots is probably the least of your
problems :) EBS slowness or malfunction can also impact the instance,
obviously, although that is not only related to snapshots, since it can
impact the actual volume regardless.

 Josep M.


Re: CFHistograms?

2011-05-07 Thread Josep Blanquer
I believe the offset value of Writes and Reads are in *micro*seconds right?
(that page talks about *milli*seconds)

Also, are any timeouts or errors reflected in those times or just successful
operations? if not, is there any JMX or other tool to keep track of them?

 Josep M.

On Fri, May 6, 2011 at 9:09 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Those are updated at compaction time.

 On Thu, May 5, 2011 at 11:38 PM, Xaero S xaeros...@gmail.com wrote:
 
  Can someone point me to a document that explains how to interpret
  CFHistograms output? i went through
 
 http://narendrasharma.blogspot.com/2011/04/cassandra-07x-understanding-output-of.html
  which is a good beginning, but was wondering if there was anything more
  detailed. For e.g when i run CFHistograms, i see rowsize and columncount
  items in the table always 0 (which cant be right?)
 
  -Xaero
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Re: Remove call vs. delete mutation

2011-04-12 Thread Josep Blanquer
Is there anybody else that might see a problem with just using delete
mutations instead of remove calls?

I'm thinking about changing a Cassandra client to always use delete
mutations when removing objects, that way the delete/remove call
interface can be kept the same:
1- the delete/remove client call would always support all features:
single-key/column, multi-column and slice range deletes.
2- it could be used in the same way regardless of embedding the calls
into batch mutations or removing a single column/key

 I'd like to hear some more thoughts about this change not causing the
Cassandra server to take a much higher CPU toll just because decoding
mutations is much less optimized than straight removes or something
like that...(I don't think so but...). In other words, if I do 1000
inserts or 1000 single-delete mutations, would the Cassandra server
see much of a difference?

 Cheers,

Josep M.

On Mon, Apr 11, 2011 at 3:49 PM, aaron morton aa...@thelastpickle.com wrote:
 AFAIK both follow the same path internally.

 Aaron

 On 12 Apr 2011, at 06:47, Josep Blanquer wrote:

 All,

 From a thrift client perspective using Cassandra, there are currently
 2 options for deleting keys/columns/subcolumns:

 1- One can use the remove call: which only takes a column path so
 you can only delete 'one thing' at a time (an entire key, an entire
 supercolumn, a column or a subcolumn)
 2- A delete mutation: which is more flexible as it allows to delete a
 list of columns an even a slice range of them within a single call.

 The question I have is: is there a noticeable difference in
 performance between issuing a remove call, or a mutation with a single
 delete? In other words, why would I use the remove call if it's much
 less flexible than the mutation?

 ...or another way to put it: is the remove call just there for
 backwards compatibility and will be superseded by the delete mutations
 in the future?

 Cheers,

 Josep M.




Remove call vs. delete mutation

2011-04-11 Thread Josep Blanquer
All,

 From a thrift client perspective using Cassandra, there are currently
2 options for deleting keys/columns/subcolumns:

1- One can use the remove call: which only takes a column path so
you can only delete 'one thing' at a time (an entire key, an entire
supercolumn, a column or a subcolumn)
2- A delete mutation: which is more flexible as it allows to delete a
list of columns an even a slice range of them within a single call.

The question I have is: is there a noticeable difference in
performance between issuing a remove call, or a mutation with a single
delete? In other words, why would I use the remove call if it's much
less flexible than the mutation?

...or another way to put it: is the remove call just there for
backwards compatibility and will be superseded by the delete mutations
in the future?

 Cheers,

Josep M.