Re: cassandra-shuffle time to completion and required disk space

2013-04-28 Thread John Watson
11 nodes
1 keyspace
256 vnodes per node
upgraded 1.1.9 to 1.2.3 a week ago

These are taken just before starting shuffle (ran repair/cleanup the day
before).
During shuffle disabled all reads/writes to the cluster.

nodetool status keyspace:

Load   Tokens  Owns (effective)  Host ID
80.95 GB   256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27
87.15 GB   256 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450
98.16 GB   256 16.7% ff821e8e-b2ca-48a9-ac3f-8234b16329ce
142.6 GB   253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3
77.64 GB   256 16.7% e59a02b3-8b91-4abd-990e-b3cb2a494950
194.31 GB  256 25.0% 6d726cbf-147d-426e-a735-e14928c95e45
221.94 GB  256 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8
87.61 GB   256 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283
101.02 GB  256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37
172.44 GB  256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed
108.5 GB   256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761

nodetool status:

Load   Tokens  Owns   Host ID
142.6 GB   253 97.5%  339c474f-cf19-4ada-9a47-8b10912d5eb3
172.44 GB  256 0.1%   78192d73-be0b-4d49-a129-9bec0770efed
221.94 GB  256 0.4%   83ca527c-60c5-4ea0-89a8-de53b92b99c8
194.31 GB  256 0.1%   6d726cbf-147d-426e-a735-e14928c95e45
77.64 GB   256 0.3%   e59a02b3-8b91-4abd-990e-b3cb2a494950
87.15 GB   256 0.4%   93f4400a-09d9-4ca0-b6a6-9bcca2427450
98.16 GB   256 0.1%   ff821e8e-b2ca-48a9-ac3f-8234b16329ce
87.61 GB   256 0.3%   c3ea4026-551b-4a14-a346-480e8c1fe283
80.95 GB   256 0.4%   754f9f4c-4ba7-4495-97e7-1f5b6755cb27
108.5 GB   256 0.1%   9889280a-1433-439e-bb84-6b7e7f44d761
101.02 GB  256 0.3%   df7ba879-74ad-400b-b371-91b45dcbed37

Here's image of the actual disk usage during shuffle:

https://dl.dropbox.com/s/bx57j1z5c2spqo0/shuffle%20disk%20space.png

Little after 00:00 I disabled/cleared the xfers and restarted the cluster
(those drops around 00:15 are the restarts) before starting running
cleanup. The disks are only 540G and whenever cassandra runs out of disk
space, bad things seem to happen. Was just barely able to run cleanup
without running out space after the failed shuffle.

After the restart:

Load   Tokens  Owns (effective)  Host ID
131.73 GB  256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27
418.88 GB  255 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450
171.19 GB  255 8.5%  ff821e8e-b2ca-48a9-ac3f-8234b16329ce
142.61 GB  253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3
178.83 GB  257 24.9% e59a02b3-8b91-4abd-990e-b3cb2a494950
442.32 GB  257 25.0% 6d726cbf-147d-426e-a735-e14928c95e45
185.28 GB  257 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283
274.47 GB  255 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8
210.73 GB  256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37
274.49 GB  256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed
106.47 GB  256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761

It's currently still running cleanup, so taking the output from status will
be a little inaccurate.

I have everything instrumented by Metrics being pushed into Graphite. So if
there's graphs/data that may help from there please let me know.

Thanks,

John


On Sun, Apr 28, 2013 at 2:52 PM, aaron morton wrote:

> Can you provide some info on the number of nodes, node load, cluster load
> etc ?
>
> AFAIK shuffle was not an easy thing to test and does not get much real
> world use as only some people will run it and they (normally) use it once.
>
> Any info you can provide may help improve the process.
>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 29/04/2013, at 9:21 AM, John Watson  wrote:
>
> The amount of time/space cassandra-shuffle requires when upgrading to
> using vnodes should really be apparent in documentation (when some is made).
>
> Only semi-noticeable remark about the exorbitant amount of time is a
> bullet point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance
>
> "Shuffling will entail moving a lot of data around the cluster and so has
> the potential to consume a lot of disk and network I/O, and to take a
> considerable amount of time. For this to be an online operation, the
> shuffle will need to operate on a lower priority basis to other streaming
> operations, and should be expected to take days or weeks to complete."
>
> We tried running shuffle on a QA version of our cluster and 2 things were
> brought to light:
>  - Even with no reads/writes it was going to take 20 days
>  - Each machine needed enough free diskspace to potentially hold the
> entire cluster's sstables on disk
>
> Regards,
>
> John
>
>
>


Re: CQL Clarification

2013-04-28 Thread Michael Theroux
Yes, that does help,

So, in the link I provided:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

It states:

You can specify these options:

Consistency level
Time-to-live (TTL)
Timestamp for the written columns.

Where timestamp is a link to "Working with dates and times" and mentions the 
64bit millisecond value.  Is that incorrect?

-Mike

On Apr 28, 2013, at 11:42 AM, Michael Theroux wrote:

> Hello,
> 
> Just wondering if I can get a quick clarification on some simple CQL.  We 
> utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
> previous question I had, when using CQL and Thrift, timestamps on the 
> cassandra column data is assigned by the server, not the client, unless "AND 
> TIMESTAMP" is utilized in the query, for example:
> 
> http://www.datastax.com/docs/1.0/references/cql/UPDATE
> 
> According to the Datastax documentation, this timestamp should be:
> 
> "Values serialized with the timestamp type are encoded as 64-bit signed 
> integers representing a number of milliseconds since the standard base time 
> known as the epoch: January 1 1970 at 00:00:00 GMT."
> 
> However, my testing showed that updates didn't work when I used a timestamp 
> of this format.  Looking at the Cassandra code, it appears that cassandra 
> will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp 
> is not specified, which would be the number of nanoseconds since the stand 
> base time.  In my test environment, setting the timestamp to be the current 
> time * 1000 seems to work.  It seems that if you have an older installation 
> without TIMESTAMP being specified in the CQL,   or a mixed environment, the 
> timestamp should be * 1000.
> 
> Just making sure I'm reading everything properly... improperly setting the 
> timestamp could cause us some serious damage.
> 
> Thanks,
> -Mike
> 
> 



Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-28 Thread John Watson
On Sun, Apr 28, 2013 at 2:19 PM, aaron morton wrote:

>  We're going to try running a shuffle before adding a new node again...
>> maybe that will help
>>
> I don't think  hurt but I doubt it will help.
>

We had to bail on shuffle since we need to add capacity ASAP and not in 20
days.


>
>It seems when new nodes join, they are streamed *all* sstables in the
>>> cluster.
>>>
>>>
>>>
>>> How many nodes did you join, what was the num_tokens ?
> Did you notice streaming from all nodes (in the logs) or are you saying
> this in response to the cluster load increasing ?
>
>
Was only adding 2 nodes at the time (planning to add a total of 12.)
Starting with a cluster of 12, but now 11 since 1 node entered some weird
state when one of the new nodes ran out disk space.
num_tokens is set to 256 on all nodes.
Yes, nearly all current nodes were streaming to the new ones (which was
great until disk space was an issue.)

> The purple line machine, I just stopped the joining process because
>>> the main cluster was dropping mutation messages at this point on a few
>>> nodes (and it still had dozens of sstables to stream.)
>>>
>>> Which were the new nodes ?
> Can you show the output from nodetool status?
>
>
The new nodes are the purple and gray lines above all the others.

nodetool status doesn't show joining nodes. I think I saw a bug already
filed for this but I can't seem to find it.


>
> Cheers
>
> -
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 27/04/2013, at 9:35 AM, Bryan Talbot  wrote:
>
> I believe that "nodetool rebuild" is used to add a new datacenter, not
> just a new host to an existing cluster.  Is that what you ran to add the
> node?
>
> -Bryan
>
>
>
> On Fri, Apr 26, 2013 at 1:27 PM, John Watson  wrote:
>
>> Small relief we're not the only ones that had this issue.
>>
>> We're going to try running a shuffle before adding a new node again...
>> maybe that will help
>>
>> - John
>>
>>
>> On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral <
>> fsob...@igcorp.com.br> wrote:
>>
>>> I am using the same version and observed something similar.
>>>
>>> I've added a new node, but the instructions from Datastax did not work
>>> for me. Then I ran "nodetool rebuild" on the new node. After finished this
>>> command, it contained two times the load of the other nodes. Even when I
>>> ran "nodetool cleanup" on the older nodes, the situation was the same.
>>>
>>> The problem only seemed to disappear when "nodetool repair" was applied
>>> to all nodes.
>>>
>>> Regards,
>>> Francisco Sobral.
>>>
>>>
>>>
>>>
>>> On Apr 25, 2013, at 4:57 PM, John Watson  wrote:
>>>
>>> After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and
>>> running upgradesstables, I figured it would be safe to start adding nodes
>>> to the cluster. Guess not?
>>>
>>> It seems when new nodes join, they are streamed *all* sstables in the
>>> cluster.
>>>
>>>
>>> https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png
>>>
>>> The gray the line machine ran out disk space and for some reason
>>> cascaded into errors in the cluster about 'no host id' when trying to store
>>> hints for it (even though it hadn't joined yet).
>>> The purple line machine, I just stopped the joining process because the
>>> main cluster was dropping mutation messages at this point on a few nodes
>>> (and it still had dozens of sstables to stream.)
>>>
>>> I followed this:
>>> http://www.datastax.com/docs/1.2/operations/add_replace_nodes
>>>
>>> Is there something missing in that documentation?
>>>
>>> Thanks,
>>>
>>> John
>>>
>>>
>>>
>>
>
>


Re: cassandra-shuffle time to completion and required disk space

2013-04-28 Thread aaron morton
Can you provide some info on the number of nodes, node load, cluster load etc ?

AFAIK shuffle was not an easy thing to test and does not get much real world 
use as only some people will run it and they (normally) use it once.

Any info you can provide may help improve the process. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 9:21 AM, John Watson  wrote:

> The amount of time/space cassandra-shuffle requires when upgrading to using 
> vnodes should really be apparent in documentation (when some is made).
> 
> Only semi-noticeable remark about the exorbitant amount of time is a bullet 
> point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance
> 
> "Shuffling will entail moving a lot of data around the cluster and so has the 
> potential to consume a lot of disk and network I/O, and to take a 
> considerable amount of time. For this to be an online operation, the shuffle 
> will need to operate on a lower priority basis to other streaming operations, 
> and should be expected to take days or weeks to complete."
> 
> We tried running shuffle on a QA version of our cluster and 2 things were 
> brought to light:
>  - Even with no reads/writes it was going to take 20 days
>  - Each machine needed enough free diskspace to potentially hold the entire 
> cluster's sstables on disk
> 
> Regards,
> 
> John



Re: setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread John Watson
The help command says 0 to disable:
  setcompactionthroughput  - Set the MB/s throughput cap for
compaction in the system, or 0 to disable throttling.
  setstreamthroughput   - Set the MB/s throughput cap for
streaming in the system, or 0 to disable throttling.

I also set both to 1000 and it also had no effect (just in case the
documentation was incorrect.)



On Sun, Apr 28, 2013 at 2:43 PM, Edward Capriolo wrote:

> Out of curiosity. Why did you decide to set it to 0 rather then 9.
> Does any documentation anywhere say that setting to 0 disables the feature?
> I have set streamthroughput higher and seen node join improvements. The
> features do work however they are probably not your limiting factor.
> Remember for stream you are setting Mega Bytes per second but network cards
> are measured in Mega Bits per second.
>
>
> On Sun, Apr 28, 2013 at 5:28 PM, John Watson  wrote:
>
>> Running these 2 commands are noop IO wise:
>>   nodetool setcompactionthroughput 0
>>   nodetool setstreamtrhoughput 0
>>
>> If trying to recover or rebuild nodes, it would be super helpful to get
>> more than ~120mbit/s of streaming throughput (per session or ~500mbit
>> total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).
>>
>> Even enabling multithreaded_compaction gives marginal improvements (1
>> additional thread doesn't help all that much and was only measurable in CPU
>> usage).
>>
>> I understand that these processes should take lower priority to servicing
>> reads and writes. However, in emergencies it would be a nice feature to
>> have a switch to recover a cluster ASAP.
>>
>> Thanks,
>>
>> John
>>
>
>


Re: question about internode_compression

2013-04-28 Thread aaron morton
It uses Snappy Compression with the default block size. 

There may be a case for allowing configuration, for example so the 
LZ4Compressor can be used. Feel free to raise a ticket at 
https://issues.apache.org/jira/browse/CASSANDRA

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 8:39 AM, John Sanda  wrote:

> When internode_compression is enabled, will the compression algorithm used be 
> the same as whatever I am using for sstable_compression?
> 
> 
> - John



Re: setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread Edward Capriolo
Out of curiosity. Why did you decide to set it to 0 rather then 9. Does
any documentation anywhere say that setting to 0 disables the feature? I
have set streamthroughput higher and seen node join improvements. The
features do work however they are probably not your limiting factor.
Remember for stream you are setting Mega Bytes per second but network cards
are measured in Mega Bits per second.


On Sun, Apr 28, 2013 at 5:28 PM, John Watson  wrote:

> Running these 2 commands are noop IO wise:
>   nodetool setcompactionthroughput 0
>   nodetool setstreamtrhoughput 0
>
> If trying to recover or rebuild nodes, it would be super helpful to get
> more than ~120mbit/s of streaming throughput (per session or ~500mbit
> total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).
>
> Even enabling multithreaded_compaction gives marginal improvements (1
> additional thread doesn't help all that much and was only measurable in CPU
> usage).
>
> I understand that these processes should take lower priority to servicing
> reads and writes. However, in emergencies it would be a nice feature to
> have a switch to recover a cluster ASAP.
>
> Thanks,
>
> John
>


Re: CQL Clarification

2013-04-28 Thread aaron morton
I think this is some confusion about the two different usages of timestamp. 

The timestamp stored with the column value (not a column of timestamp type) is 
stored using microsecond scale, it's just a 64 bit int we do not use it as a 
time value. Each mutation in a single request will have a different timestamp 
as per 
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/QueryState.java#L48
 

A column of type timestamp is internally stored as a DateTime type which is 
milliseconds past the epoch 
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/db/marshal/DateType.java

Does that help ? 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 29/04/2013, at 3:42 AM, Michael Theroux  wrote:

> Hello,
> 
> Just wondering if I can get a quick clarification on some simple CQL.  We 
> utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
> previous question I had, when using CQL and Thrift, timestamps on the 
> cassandra column data is assigned by the server, not the client, unless "AND 
> TIMESTAMP" is utilized in the query, for example:
> 
> http://www.datastax.com/docs/1.0/references/cql/UPDATE
> 
> According to the Datastax documentation, this timestamp should be:
> 
> "Values serialized with the timestamp type are encoded as 64-bit signed 
> integers representing a number of milliseconds since the standard base time 
> known as the epoch: January 1 1970 at 00:00:00 GMT."
> 
> However, my testing showed that updates didn't work when I used a timestamp 
> of this format.  Looking at the Cassandra code, it appears that cassandra 
> will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp 
> is not specified, which would be the number of nanoseconds since the stand 
> base time.  In my test environment, setting the timestamp to be the current 
> time * 1000 seems to work.  It seems that if you have an older installation 
> without TIMESTAMP being specified in the CQL,   or a mixed environment, the 
> timestamp should be * 1000.
> 
> Just making sure I'm reading everything properly... improperly setting the 
> timestamp could cause us some serious damage.
> 
> Thanks,
> -Mike
> 
> 



setcompactionthroughput and setstreamthroughput have no effect

2013-04-28 Thread John Watson
Running these 2 commands are noop IO wise:
  nodetool setcompactionthroughput 0
  nodetool setstreamtrhoughput 0

If trying to recover or rebuild nodes, it would be super helpful to get
more than ~120mbit/s of streaming throughput (per session or ~500mbit
total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf).

Even enabling multithreaded_compaction gives marginal improvements (1
additional thread doesn't help all that much and was only measurable in CPU
usage).

I understand that these processes should take lower priority to servicing
reads and writes. However, in emergencies it would be a nice feature to
have a switch to recover a cluster ASAP.

Thanks,

John


Re: Really odd issue (AWS related?)

2013-04-28 Thread Alex Major
Hi Mike,

We had issues with the ephemeral drives when we first got started, although
we never got to the bottom of it so I can't help much with troubleshooting
unfortunately. Contrary to a lot of the comments on the mailing list we've
actually had a lot more success with EBS drives (PIOPs!). I'd definitely
suggest try striping 4 EBS drives (Raid 0) and using PIOPs.

You could be having a noisy neighbour problem, I don't believe that
m1.large or m1.xlarge instances get all of the actual hardware,
virtualisation on EC2 still sucks in isolating resources.

We've also had more success with Ubuntu on EC2, not so much with our
Cassandra nodes but some of our other services didn't run as well on Amazon
Linux AMIs.

Alex



On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux wrote:

> I forgot to mention,
>
> When things go really bad, I'm seeing I/O waits in the 80->95% range.  I
> restarted cassandra once when a node is in this situation, and it took 45
> minutes to start (primarily reading SSTables).  Typically, a node would
> start in about 5 minutes.
>
> Thanks,
> -Mike
>
> On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:
>
> Hello,
>
> We've done some additional monitoring, and I think we have more
> information.  We've been collecting vmstat information every minute,
> attempting to catch  a node with issues,.
>
> So, it appears, that the cassandra node runs fine.  Then suddenly, without
> any correlation to any event that I can identify, the I/O wait time goes
> way up, and stays up indefinitely.  Even non-cassandra  I/O activities
> (such as snapshots and backups) start causing large I/O Wait times when
> they typically would not.  Previous to an issue, we would typically see I/O
> wait times 3-4% with very few blocked processes on I/O.  Once this issue
> manifests itself, i/O wait times for the same activities jump to 30-40%
> with many blocked processes.  The I/O wait times do go back down when there
> is literally no activity.
>
> -  Updating the node to the latest Amazon Linux patches and rebooting the
> instance doesn't correct the issue.
> -  Backing up the node, and replacing the instance does correct the issue.
>  I/O wait times return to normal.
>
> One relatively recent change we've made is we upgraded to m1.xlarge
> instances which has 4 ephemeral drives available.  We create a logical
> volume from the 4 drives with the idea that we should be able to get
> increased I/O throughput.  When we ran m1.large instances, we had the same
> setup, although it was only using 2 ephemeral drives.  We chose to use LVM,
> vs. madm because we were having issues having madm create the raid volume
> reliably on restart (and research showed that this was a common problem).
>  LVM just worked (and had worked for months before this upgrade)..
>
> For reference, this is the script we used to create the logical volume:
>
> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
> sleep 2
> mkfs.xfs /dev/mnt_vg/mnt_lv
> sleep 3
> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
> sleep 3
>
> Another tidbit... thus far (and this maybe only a coincidence), we've only
> had to replace DB nodes within a single availability zone within us-east.
>  Other availability zones, in the same region, have yet to show an issue.
>
> It looks like I'm going to need to replace a third DB node today.  Any
> advice would be appreciated.
>
> Thanks,
> -Mike
>
>
> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
>
> Thanks.
>
> We weren't monitoring this value when the issue occurred, and this
> particular issue has not appeared for a couple of days (knock on wood).
>  Will keep an eye out though,
>
> -Mike
>
> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>
> top command? st : time stolen from this vm by the hypervisor
>
> jason
>
>
> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux wrote:
>
>> Sorry, Not sure what CPU steal is :)
>>
>> I have AWS console with detailed monitoring enabled... things seem to
>> track close to the minute, so I can see the CPU load go to 0... then jump
>> at about the minute Cassandra reports the dropped messages,
>>
>> -Mike
>>
>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>
>> The messages appear right after the node "wakes up".
>>
>> Are you tracking CPU steal ?
>>
>>-
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 25/04/2013, at 4:15 AM, Robert Coli  wrote:
>>
>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux 
>> wrote:
>>
>> Another related question.  Once we see messages being dropped on one
>> node, our cassandra client appears to see this, reporting errors.  We use
>> LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
>> an error?  If only one node reports an error, shouldn't the consistency
>> level prevent the client from seeing an issue?
>>
>>
>> If the c

cassandra-shuffle time to completion and required disk space

2013-04-28 Thread John Watson
The amount of time/space cassandra-shuffle requires when upgrading to using
vnodes should really be apparent in documentation (when some is made).

Only semi-noticeable remark about the exorbitant amount of time is a bullet
point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance

"Shuffling will entail moving a lot of data around the cluster and so has
the potential to consume a lot of disk and network I/O, and to take a
considerable amount of time. For this to be an online operation, the
shuffle will need to operate on a lower priority basis to other streaming
operations, and should be expected to take days or weeks to complete."

We tried running shuffle on a QA version of our cluster and 2 things were
brought to light:
 - Even with no reads/writes it was going to take 20 days
 - Each machine needed enough free diskspace to potentially hold the entire
cluster's sstables on disk

Regards,

John


Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-28 Thread aaron morton
> We're going to try running a shuffle before adding a new node again... maybe 
> that will help
I don't think  hurt but I doubt it will help. 


>> It seems when new nodes join, they are streamed *all* sstables in the 
>> cluster.

> 

How many nodes did you join, what was the num_tokens ? 
Did you notice streaming from all nodes (in the logs) or are you saying this in 
response to the cluster load increasing ? 

>> The purple line machine, I just stopped the joining process because the main 
>> cluster was dropping mutation messages at this point on a few nodes (and it 
>> still had dozens of sstables to stream.)
Which were the new nodes ?
Can you show the output from nodetool status?


Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 9:35 AM, Bryan Talbot  wrote:

> I believe that "nodetool rebuild" is used to add a new datacenter, not just a 
> new host to an existing cluster.  Is that what you ran to add the node?
> 
> -Bryan
> 
> 
> 
> On Fri, Apr 26, 2013 at 1:27 PM, John Watson  wrote:
> Small relief we're not the only ones that had this issue.
> 
> We're going to try running a shuffle before adding a new node again... maybe 
> that will help
> 
> - John
> 
> 
> On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral 
>  wrote:
> I am using the same version and observed something similar.
> 
> I've added a new node, but the instructions from Datastax did not work for 
> me. Then I ran "nodetool rebuild" on the new node. After finished this 
> command, it contained two times the load of the other nodes. Even when I ran 
> "nodetool cleanup" on the older nodes, the situation was the same.
> 
> The problem only seemed to disappear when "nodetool repair" was applied to 
> all nodes.
> 
> Regards,
> Francisco Sobral.
> 
> 
> 
> 
> On Apr 25, 2013, at 4:57 PM, John Watson  wrote:
> 
>> After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running 
>> upgradesstables, I figured it would be safe to start adding nodes to the 
>> cluster. Guess not?
>> 
>> It seems when new nodes join, they are streamed *all* sstables in the 
>> cluster.
>> 
>> https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png
>> 
>> The gray the line machine ran out disk space and for some reason cascaded 
>> into errors in the cluster about 'no host id' when trying to store hints for 
>> it (even though it hadn't joined yet).
>> The purple line machine, I just stopped the joining process because the main 
>> cluster was dropping mutation messages at this point on a few nodes (and it 
>> still had dozens of sstables to stream.)
>> 
>> I followed this: 
>> http://www.datastax.com/docs/1.2/operations/add_replace_nodes
>> 
>> Is there something missing in that documentation?
>> 
>> Thanks,
>> 
>> John
> 
> 
> 



Re: cost estimate about some Cassandra patchs

2013-04-28 Thread aaron morton
> Does anyone know enough of the inner working of Cassandra to tell me how much 
> work is needed to patch Cassandra to enable such communication 
> vectorization/batch ?
>  
Assuming you mean "have the coordinator send multiple row read/write requests 
in a single message to replicas"

Pretty sure this has been raised as a ticket before but I cannot find one now. 

It would be a significant change and I'm not sure how big the benefit is. To 
send the messages the coordinator places them in a queue, there is little delay 
sending. Then it waits on them async. So there may be some saving on networking 
but from the coordinators point of view I think the impact is minimal. 

What is your use case?

Cheers


-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 4:04 AM, DE VITO Dominique  
wrote:

> Hi,
>  
> We are created a new partitioner that groups some rows with **different** row 
> keys on the same replicas.
>  
> But neither the batch_mutate, or the multiget_slice are able to take 
> opportunity of this partitioner-defined placement to vectorize/batch 
> communications between the coordinator and the replicas.
>  
> Does anyone know enough of the inner working of Cassandra to tell me how much 
> work is needed to patch Cassandra to enable such communication 
> vectorization/batch ?
>  
> Thanks.
>  
> Regards,
> Dominique
>  
>  



question about internode_compression

2013-04-28 Thread John Sanda
When internode_compression is enabled, will the compression algorithm used
be the same as whatever I am using for sstable_compression?


- John


Re: Is Cassandra oversized for this kind of use case?

2013-04-28 Thread aaron morton
Sounds like something C* would be good at. 

I would do some searching on Time Series data in cassandra, such as 
http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra And 
definitely consider storing data at the smallest level on granularity. 

On the analytics side there is good news and no so good news. First the good 
news is reads do not block writes as in a traditional RDBMS (without MVCC) 
running with Transaction Isolation of Repeatable Read or higher. 

The not the so good news it's not as easy to support the wide range of 
analytical queries that you are used to with SQL using the standard Thrift/CQL 
API. If you need very flexible analysis I recommend looking into Hive / Pig 
with Hadoop, DataStax Enterprise is a commercial product but free for 
development and a great way to learn without having to worry about the setup 
http://www.datastax.com/

You may also be interested in http://www.pentaho.com/ or 
http://www.karmasphere.com/

Hope that helps. 

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 5:26 AM, "Hiller, Dean"  wrote:

> I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on 
> writes and reads most likely getting your feet wet.  Don't buy very expensive 
> computers like a lot do getting into the game for the first time…Every time I 
> walk into a new gig, they seem to think they need to spend 6/10k per node.  I 
> think this kind of scenario sounds find to use cassandra.  When you say 
> virtualize, I believe you mean "use Vms"…..many use Amazon Vms and there is 
> stuff to configure if you are on amazon specifically for this.
> 
> If you are on your own VM's, you do need to worry about if two nodes end up 
> on the same hardware stealing resources from each other or if hardware fails 
> as well.  Ie. The idea in noSQL is you typically have 3 copies of all data so 
> if one node goes down, you are still live with CL=TWO.
> 
> Also, plan on doing ~300GB per node typically depending on how it works out 
> in testing.
> 
> Later,
> Dean
> 
> From: Marc Teufel 
> mailto:teufel.m...@googlemail.com>>
> Reply-To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>
> Date: Friday, April 26, 2013 10:59 AM
> To: "user@cassandra.apache.org" 
> mailto:user@cassandra.apache.org>>
> Subject: Re: Is Cassandra oversized for this kind of use case?
> 
> Okay one billion rows of data is a lot, compared to that i am far far away - 
> means i can stay with Oracle? Maybe.
> But you're right when you say its not only about big data but also about your 
> need.
> 
> So storing the data is one part, doing analytical analysis is the second. I 
> do a lot of calculations and queries to generate management criteria about 
> how the production is going on actually, how the production went the last 
> week, month, years and so on. Saving in a 5 minute rhythm is only a 
> compromise to reduce the amount of data - maybe in the future the usecase 
> will change an is about to store status of each machine as soon as it 
> changes. This will of course increase the amount of data and the complexity 
> of my queries again. And sure I show "Live" Data today... 5 Minute old Live 
> Data... but if i tell the CEO that i am also able to work with real live 
> data, i am sure this is what he wants to get  ;-)
> 
> Can you recommend me to use Cassandra for this kind of scenario or is this 
> oversized ?
> 
> Does it makes sense to start with 2 Nodes ?
> 
> Can i virtualize these two Nodes ?
> 
> 
> Thx a lot for your assistance.
> 
> Marc
> 
> 
> 
> 
> 2013/4/26 Hiller, Dean mailto:dean.hil...@nrel.gov>>
> Well, it depends more on what you will do with the data.  I know I was on a 
> sybase(RDBMS) with 1 billion rows but it was getting close to not being able 
> to handle more (constraints had to be turned off and all sorts of 
> optimizations done and expert consultants brought in and everything).
> 
> BUT there are other use cases where noSQL is great for (ie. It is not just 
> great for big data type systems).  It is great for really high write 
> throughput as you can add more nodes and handle more writes/second than an 
> RDBMS very easily yet you may be doing so many deletes that the system 
> constantly stays at a small data set.
> 
> You may want to analyze the data constantly or near real time involving huge 
> amounts of reads / second in which case noSQL can be better as well.
> 
> Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we 
> have handled many different use cases out there.
> 
> Later,
> Dean
> 
> From: Marc Teufel 
> mailto:teufel.m...@googlemail.com>>>
> Reply-To: 
> "user@cassandra.apache.org>"
>  
> mail

Re: Deletes, null values

2013-04-28 Thread aaron morton
What's your table definition ? 

>> select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
>> from myCF where key = 'all';

The output looks correct to me. CQL table return values, including null, for 
all of the selected columns.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 12:48 AM, Sorin Manolache  wrote:

> On 2013-04-26 11:55, Alain RODRIGUEZ wrote:
>> Of course:
>> 
>> From CQL 2 (cqlsh -2):
>> 
>> delete '183#16684','183#16714','183#16717' from myCF where key = 'all';
>> 
>> And selecting this data as follow gives me the result above:
>> 
>> select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
>> from myCF where key = 'all';
>> 
>> From thrift (phpCassa client):
>> 
>> $pool = new
>> ConnectionPool('myKeyspace',array('192.168.100.201'),6,0,3,3);
>> $my_cf= new ColumnFamily($pool, 'myCF', true, true,
>> ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM);
>> $my_cf->remove('all', array('1228#16857','1228#16866','1228#16875'));
>> 
> 
> I see. I'm sorry, I know nothing about phpCassa. I use batch_mutation with 
> deletions and it works. But I guess phpCassa must use the same thrift 
> primitives.
> 
> Sorin
> 
> 
>> 
>> 
>> 2013/4/25 Sorin Manolache mailto:sor...@gmail.com>>
>> 
>>On 2013-04-25 11:48, Alain RODRIGUEZ wrote:
>> 
>>Hi, I tried to delete some columns using cql2 as well as thrift on
>>C*1.2.2 and instead of being unreachable, deleted columns have a
>>null value.
>> 
>>I am using no value in this CF, the only information I use is the
>>existence of the column. So when I select all the column for a
>>given key
>>I have the following returned:
>> 
>>   1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553
>>
>> ---+--__+--+--__-+__--
>>   null |  null | null |
>>  |
>> 
>> 
>>This is quite annoying since my app thinks that I have 5 columns
>>there
>>when I should have 2 only.
>> 
>>I first thought that this was a visible marker of tombstones but
>>they
>>didn't vanish after a major compaction.
>> 
>>How can I get rid of these null/ghost columns and why does it
>>happen ?
>> 
>> 
>>I do something similar but I don't see null values. Could you please
>>post the code where you delete the columns?
>> 
>>Sorin
>> 
>> 
> 



Re: Slow retrieval using secondary indexes

2013-04-28 Thread aaron morton
Try the request tracing in 1.2 
http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2 it may point to the 
different. 

> In our model the secondary index in also unique, as the primary key is. Is it 
> better, in this case, to create another CF mapping the secondary index to the 
> key?
IMHO if you have a request that is frequently used as part of a hot code path 
it is still a good idea to support that with a custom CF. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 27/04/2013, at 12:27 AM, Francisco Nogueira Calmon Sobral 
 wrote:

> Hi all!
> 
> We are using Cassandra 1.2.1 with a 8 node cluster running at Amazon. We 
> started with 6 nodes and added the 2 later. When performing some reads in 
> Cassandra, we observed a high difference between gets using the primary key 
> and gets using secondary indexes:
> 
> 
> [default@Sessions] get Users where mahoutUserid = 30127944399716352;
> ---
> RowKey: STQ0TTNII2LS211YYJI4GEV80M1SE8
> => (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000)
> 
> 1 Row Returned.
> Elapsed time: 3508 msec(s).
> 
> [default@Sessions] get Users['STQ0TTNII2LS211YYJI4GEV80M1SE8'];
> => (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000)
> Returned 1 results.
> 
> Elapsed time: 3.06 msec(s).
> 
> 
> In our model the secondary index in also unique, as the primary key is. Is it 
> better, in this case, to create another CF mapping the secondary index to the 
> key?
> 
> Best regards,
> Francisco Sobral.



Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80->95% range.  I 
restarted cassandra once when a node is in this situation, and it took 45 
minutes to start (primarily reading SSTables).  Typically, a node would start 
in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

> Hello,
> 
> We've done some additional monitoring, and I think we have more information.  
> We've been collecting vmstat information every minute, attempting to catch  a 
> node with issues,.
> 
> So, it appears, that the cassandra node runs fine.  Then suddenly, without 
> any correlation to any event that I can identify, the I/O wait time goes way 
> up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
> snapshots and backups) start causing large I/O Wait times when they typically 
> would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
> with very few blocked processes on I/O.  Once this issue manifests itself, 
> i/O wait times for the same activities jump to 30-40% with many blocked 
> processes.  The I/O wait times do go back down when there is literally no 
> activity.   
> 
> -  Updating the node to the latest Amazon Linux patches and rebooting the 
> instance doesn't correct the issue.
> -  Backing up the node, and replacing the instance does correct the issue.  
> I/O wait times return to normal.
> 
> One relatively recent change we've made is we upgraded to m1.xlarge instances 
> which has 4 ephemeral drives available.  We create a logical volume from the 
> 4 drives with the idea that we should be able to get increased I/O 
> throughput.  When we ran m1.large instances, we had the same setup, although 
> it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because 
> we were having issues having madm create the raid volume reliably on restart 
> (and research showed that this was a common problem).  LVM just worked (and 
> had worked for months before this upgrade)..
> 
> For reference, this is the script we used to create the logical volume:
> 
> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
> sleep 2
> mkfs.xfs /dev/mnt_vg/mnt_lv
> sleep 3
> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
> sleep 3
> 
> Another tidbit... thus far (and this maybe only a coincidence), we've only 
> had to replace DB nodes within a single availability zone within us-east.  
> Other availability zones, in the same region, have yet to show an issue.
> 
> It looks like I'm going to need to replace a third DB node today.  Any advice 
> would be appreciated.
> 
> Thanks,
> -Mike
> 
> 
> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
> 
>> Thanks.
>> 
>> We weren't monitoring this value when the issue occurred, and this 
>> particular issue has not appeared for a couple of days (knock on wood).  
>> Will keep an eye out though,
>> 
>> -Mike
>> 
>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>> 
>>> top command? st : time stolen from this vm by the hypervisor
>>> 
>>> jason
>>> 
>>> 
>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux  
>>> wrote:
>>> Sorry, Not sure what CPU steal is :)
>>> 
>>> I have AWS console with detailed monitoring enabled... things seem to track 
>>> close to the minute, so I can see the CPU load go to 0... then jump at 
>>> about the minute Cassandra reports the dropped messages,
>>> 
>>> -Mike
>>> 
>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>> 
> The messages appear right after the node "wakes up".
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli  wrote:
 
> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux  
> wrote:
>> Another related question.  Once we see messages being dropped on one 
>> node, our cassandra client appears to see this, reporting errors.  We 
>> use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients 
>> would see an error?  If only one node reports an error, shouldn't the 
>> consistency level prevent the client from seeing an issue?
> 
> If the client is talking to a broken/degraded coordinator node, RF/CL
> are unable to protect it from RPCTimeout. If it is unable to
> coordinate the request in a timely fashion, your clients will get
> errors.
> 
> =Rob
 
>>> 
>>> 
>> 
> 



Re: Many creation/inserts in parallel

2013-04-28 Thread aaron morton
> At first many CF are being created in parallel (about 1000 CF).
> 
> 
Can you explain this in a bit more detail ? By in parallel do you mean multiple 
threads creating CF's at the same time ?

I would also recommend taking a second look at your data model, you probably do 
not want to create so many CF's. 

>  During tests we're receiving some exceptions from driver, e.g.:
> 
> 

The CF you are trying to read / write from does not exist. Check if the table 
exists using cqlsh / cassandra-cli. 

Check your code to make sure it was created. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/04/2013, at 10:49 PM, Sasha Yanushkevich  wrote:

> Hi All
> 
> We are testing Cassandra 1.2.3 (3 nodes with RF:2) with FluentCassandra 
> driver. At first many CF are being created in parallel (about 1000 CF). After 
> creation is done follows many insertions of little amount of data into the 
> DB. During tests we're receiving some exceptions from driver, e.g.:
> 
> FluentCassandra.Operations.CassandraOperationException: unconfigured 
> columnfamily table_78_9
> and
> FluentCassandra.Operations.CassandraOperationException: Connection to 
> Cassandra has timed out
> 
> Though in Cassandra's logs there are no exceptions.
> 
> What should we do to handle these exceptions?
> 
> -- 
> Best regards,
> Alexander



Re: CQL indexing

2013-04-28 Thread aaron morton
This discussion belongs on the user list, also please only email one list at a 
time. 

The article discusses improvements in secondary indexes in 1.2 
http://www.datastax.com/dev/blog/improving-secondary-index-write-performance-in-1-2

If you have some more specific questions let us know. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 26/04/2013, at 7:01 PM, Sri Ramya  wrote:

> HI
> 
> In cql to perform a query based on columns you have to create a index on
> that column. What exactly happening when we create a index on a column.
> What the index column family might contain.



Re: 1.2.3 and 1.2.4 memory usage growth on idle cluster

2013-04-28 Thread aaron morton
> INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; 
> max is 10630070272
It depends on the settings. It looks like you are using non default JVM 
settings. 

It'd recommend restoring the default JVM settings as a start. 

CHeers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 9:30 PM, Igor  wrote:

> Hello
> 
> Does  anybody seen memory problems on idle cluster?
> I have 8-node ring with cassandra 1.2.3 which never been used and stay idle 
> for several weeks. Yesterday when I decided to upgrade it to 1.2.4 I found 
> lot of messages like
> 
> INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; 
> max is 10630070272
> INFO 11:10:56,273 Pool NameActive   Pending Blocked
> INFO 11:10:56,275 ReadStage 0 0 0
> INFO 11:10:56,276 RequestResponseStage  0 0 0
> INFO 11:10:56,276 ReadRepairStage   0 0 0
> INFO 11:10:56,277 MutationStage 0 0 0
> INFO 11:10:56,277 ReplicateOnWriteStage 0 0 0
> INFO 11:10:56,278 GossipStage   0 0 0
> INFO 11:10:56,278 AntiEntropyStage  0 0 0
> INFO 11:10:56,278 MigrationStage0 0 0
> INFO 11:10:56,279 MemtablePostFlusher   0 0 0
> INFO 11:10:56,279 FlushWriter   0 0 0
> INFO 11:10:56,280 MiscStage 0 0 0
> INFO 11:10:56,280 commitlog_archiver0 0 0
> INFO 11:10:56,280 InternalResponseStage 0 0 0
> INFO 11:10:56,281 HintedHandoff 0 0 0
> INFO 11:10:56,281 CompactionManager 0 0
> INFO 11:10:56,281 MessagingServicen/a   0,0
> INFO 11:10:56,281 Cache Type Size Capacity   
> KeysToSave
>Provider
> INFO 11:10:56,281 KeyCache 7368104857600  
> all
> 
> INFO 11:10:56,281 RowCache 00  all
> org.apache.cassandra.cache.SerializingCacheProvider
> INFO 11:10:56,281 ColumnFamilyMemtable ops,data
> INFO 11:10:56,281 system.local 4,52
> INFO 11:10:56,281 system.peers  30,6093
> INFO 11:10:56,282 system.batchlog   0,0
> INFO 11:10:56,282 system.NodeIdInfo 0,0
> INFO 11:10:56,282 system.LocationInfo   0,0
> INFO 11:10:56,282 system.Schema 0,0
> INFO 11:10:56,282 system.Migrations 0,0
> INFO 11:10:56,282 system.schema_keyspaces   0,0
> INFO 11:10:56,282 system.schema_columns 0,0
> INFO 11:10:56,282 system.schema_columnfamilies 0,0
> INFO 11:10:56,282 system.IndexInfo  0,0
> INFO 11:10:56,282 system.range_xfers0,0
> INFO 11:10:56,282 system.peer_events0,0
> INFO 11:10:56,283 system.hints  0,0
> INFO 11:10:56,283 system.HintsColumnFamily  0,0
> INFO 11:10:56,283 system_auth.users 0,0
> INFO 11:10:56,283 system_traces.sessions0,0
> INFO 11:10:56,283 system_traces.events  0,0
> INFO 11:11:21,205 GC for ParNew: 1035 ms for 1 collections, 6633037168 used; 
> max is 10630070272
> 
> So you can see - there is no any activity. And what I can see from the java 
> heap graph - it constantly grow. I plan to use this ring in prod, but this 
> strange behaviour confuses me.
> 



Re: Secondary Index on table with a lot of data crashes Cassandra

2013-04-28 Thread aaron morton
> What are we doing wrong? Can it be that Cassandra is actually trying to read 
> all the CF data rather than just the keys! (actually, it doesn't need to go 
> to the users CF at all - all the data it needs is in the index CF)
>  
Data is not stored as a BTree, that's the RDBMS approach. We hit the in memory 
bloom filter, then perhaps the -index.db and finally the -data.db. While in 
this edge case it may be possible to serve your query just from the -index.db 
there is no optimisation in place for that. 

>  
> Select user_name from users where status = 2; 
>  
> Always crashes.
>  
What is the error ? 

> 2. understand if there is something in this use case which indicates that we 
> are not using Cassandra the way it is meant. 
Just like a RDBMS data base, this are fastest when you use the primary key, a 
bit slower when you use a non primary index, and slowest still when you do not 
use an index. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 8:32 PM, moshe.kr...@barclays.com wrote:

> IMHO: user_name is not a column, it is the row key. Therefore, according 
> tohttp://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ , the row does 
> not contain a relevant column index, which causes the iterator to read each 
> column (including value) of each row.
>  
> I believe that instead of referring to user_name as if it were a column, you 
> need to refer to it via the reserved word “KEY”, e.g.:
>  
> Select KEY from users where status = 2; 
>  
> Always glad to share a theory with a friend….
>  
>  
> From: Tamar Rosen [mailto:ta...@correlor.com] 
> Sent: Thursday, April 25, 2013 11:04 AM
> To: user@cassandra.apache.org
> Subject: Secondary Index on table with a lot of data crashes Cassandra
>  
> Hi,
>  
> We have a case of a reproducible crash, probably due to out of memory, but I 
> don't understand why. 
>  
> The installation is currently single node. 
>  
> We have a column family with approx 5 rows. 
>  
> In cql, the CF definition is:
>  
>  
> CREATE TABLE users (
>   user_name text PRIMARY KEY,
>   big_json text,
>   status int
> );
>  
> Each big_json can have 500K or more of data.
>  
> There is also a secondary index on the status column. 
> Status can have various values, over 90% of all rows have status = 2. 
>  
>  
> Calling:
>  
> Select user_name from users limit 8;
>  
> Is pretty fast
>  
>  
>  
> Calling:
>  
> Select user_name from users where status = 1; 
> is slower, even though much less data is returned.
>  
> Calling:
>  
> Select user_name from users where status = 2; 
>  
> Always crashes.
>  
>  
> What are we doing wrong? Can it be that Cassandra is actually trying to read 
> all the CF data rather than just the keys! (actually, it doesn't need to go 
> to the users CF at all - all the data it needs is in the index CF)
>  
>  
> Also, in the code I am doing the same using Astyanax index query with 
> pagination, and the behavior is the same. 
> 
> 
> Please help me:
>  
> 1. solve the immediate issue
>  
> 2. understand if there is something in this use case which indicates that we 
> are not using Cassandra the way it is meant. 
>  
> 
> 
> Thanks,
>  
> 
> 
> Tamar Rosen
>  
> Correlor.com
>  
> 
> 
>  
> ___
> 
> This message may contain information that is confidential or privileged. If 
> you are not an intended recipient of this message, please delete it and any 
> attachments, and notify the sender that you have received it in error. Unless 
> specifically stated in the message or otherwise indicated, you may not 
> duplicate, redistribute or forward this message or any portion thereof, 
> including any attachments, by any means to any other person, including any 
> retail investor or customer. This message is not a recommendation, advice, 
> offer or solicitation, to buy/sell any product or service, and is not an 
> official confirmation of any transaction. Any opinions presented are solely 
> those of the author and do not necessarily represent those of Barclays. This 
> message is subject to terms available at: www.barclays.com/emaildisclaimer 
> and, if received from Barclays' Sales or Trading desk, the terms available 
> at: www.barclays.com/salesandtradingdisclaimer/. By messaging with Barclays 
> you consent to the foregoing. Barclays Bank PLC is a company registered in 
> England (number 1026167) with its registered office at 1 Churchill Place, 
> London, E14 5HP. This email may relate to or be sent from other members of 
> the Barclays group.
> 
> ___
> 



Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
Hello,

We've done some additional monitoring, and I think we have more information.  
We've been collecting vmstat information every minute, attempting to catch  a 
node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any 
correlation to any event that I can identify, the I/O wait time goes way up, 
and stays up indefinitely.  Even non-cassandra  I/O activities (such as 
snapshots and backups) start causing large I/O Wait times when they typically 
would not.  Previous to an issue, we would typically see I/O wait times 3-4% 
with very few blocked processes on I/O.  Once this issue manifests itself, i/O 
wait times for the same activities jump to 30-40% with many blocked processes.  
The I/O wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the 
instance doesn't correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O 
wait times return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances 
which has 4 ephemeral drives available.  We create a logical volume from the 4 
drives with the idea that we should be able to get increased I/O throughput.  
When we ran m1.large instances, we had the same setup, although it was only 
using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having 
issues having madm create the raid volume reliably on restart (and research 
showed that this was a common problem).  LVM just worked (and had worked for 
months before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had 
to replace DB nodes within a single availability zone within us-east.  Other 
availability zones, in the same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice 
would be appreciated.

Thanks,
-Mike


On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

> Thanks.
> 
> We weren't monitoring this value when the issue occurred, and this particular 
> issue has not appeared for a couple of days (knock on wood).  Will keep an 
> eye out though,
> 
> -Mike
> 
> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
> 
>> top command? st : time stolen from this vm by the hypervisor
>> 
>> jason
>> 
>> 
>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux  wrote:
>> Sorry, Not sure what CPU steal is :)
>> 
>> I have AWS console with detailed monitoring enabled... things seem to track 
>> close to the minute, so I can see the CPU load go to 0... then jump at about 
>> the minute Cassandra reports the dropped messages,
>> 
>> -Mike
>> 
>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>> 
 The messages appear right after the node "wakes up".
>>> Are you tracking CPU steal ? 
>>> 
>>> -
>>> Aaron Morton
>>> Freelance Cassandra Consultant
>>> New Zealand
>>> 
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 25/04/2013, at 4:15 AM, Robert Coli  wrote:
>>> 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux  
 wrote:
> Another related question.  Once we see messages being dropped on one 
> node, our cassandra client appears to see this, reporting errors.  We use 
> LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
> see an error?  If only one node reports an error, shouldn't the 
> consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
>>> 
>> 
>> 
> 



CQL Clarification

2013-04-28 Thread Michael Theroux
Hello,

Just wondering if I can get a quick clarification on some simple CQL.  We 
utilize Thrift CQL Queries to access our cassandra setup.  As clarified in a 
previous question I had, when using CQL and Thrift, timestamps on the cassandra 
column data is assigned by the server, not the client, unless "AND TIMESTAMP" 
is utilized in the query, for example:

http://www.datastax.com/docs/1.0/references/cql/UPDATE

According to the Datastax documentation, this timestamp should be:

"Values serialized with the timestamp type are encoded as 64-bit signed 
integers representing a number of milliseconds since the standard base time 
known as the epoch: January 1 1970 at 00:00:00 GMT."

However, my testing showed that updates didn't work when I used a timestamp of 
this format.  Looking at the Cassandra code, it appears that cassandra will 
assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not 
specified, which would be the number of nanoseconds since the stand base time.  
In my test environment, setting the timestamp to be the current time * 1000 
seems to work.  It seems that if you have an older installation without 
TIMESTAMP being specified in the CQL,   or a mixed environment, the timestamp 
should be * 1000.

Just making sure I'm reading everything properly... improperly setting the 
timestamp could cause us some serious damage.

Thanks,
-Mike