CQL indexing

2013-04-26 Thread Sri Ramya
HI

In cql to perform a query based on columns you have to create a index on
that column. What exactly happening when we create a index on a column.
What the index column family might contain.


Re: Unable to drop secondary index

2013-04-26 Thread Michal Michalski

W dniu 26.04.2013 03:45, aaron morton pisze:

You can drop the hints via JMX and stopping the node and deleting the SSTables.


Thanks for advice :-) It's +/- what I did. I've paused hints delivery 
first and then I upgraded whole cluster to C* with CASSANDRA-5179 patch 
applied, removing the SSTables before restart, so it's fine now :-) Now 
I'm leaving for 3 weeks and when I'm back I'll have to revisit the 
schemas problem - you can't get bored with Cassandra! ;-)


M.




Re: Really odd issue (AWS related?)

2013-04-26 Thread Jason Wee
top command? st : time stolen from this vm by the hypervisor

jason


On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote:

 Sorry, Not sure what CPU steal is :)

 I have AWS console with detailed monitoring enabled... things seem to
 track close to the minute, so I can see the CPU load go to 0... then jump
 at about the minute Cassandra reports the dropped messages,

 -Mike

 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

 The messages appear right after the node wakes up.

 Are you tracking CPU steal ?

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com
 wrote:

 Another related question.  Once we see messages being dropped on one node,
 our cassandra client appears to see this, reporting errors.  We use
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
 an error?  If only one node reports an error, shouldn't the consistency
 level prevent the client from seeing an issue?


 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.

 =Rob






Re: Deletes, null values

2013-04-26 Thread Alain RODRIGUEZ
Of course:

From CQL 2 (cqlsh -2):

delete '183#16684','183#16714','183#16717' from myCF where key = 'all';

And selecting this data as follow gives me the result above:

select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
from myCF where key = 'all';

From thrift (phpCassa client):

$pool = new ConnectionPool('myKeyspace', array('192.168.100.201'), 6, 0,
3, 3);
$my_cf= new ColumnFamily($pool, 'myCF', true, true,
ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM);
$my_cf-remove('all', array('1228#16857','1228#16866','1228#16875'));



2013/4/25 Sorin Manolache sor...@gmail.com

 On 2013-04-25 11:48, Alain RODRIGUEZ wrote:

 Hi, I tried to delete some columns using cql2 as well as thrift on
 C*1.2.2 and instead of being unreachable, deleted columns have a null
 value.

 I am using no value in this CF, the only information I use is the
 existence of the column. So when I select all the column for a given key
 I have the following returned:

   1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553
 ---+--**+--+--**
 -+**--
   null |  null | null |
  |


 This is quite annoying since my app thinks that I have 5 columns there
 when I should have 2 only.

 I first thought that this was a visible marker of tombstones but they
 didn't vanish after a major compaction.

 How can I get rid of these null/ghost columns and why does it happen ?


 I do something similar but I don't see null values. Could you please post
 the code where you delete the columns?

 Sorin




Re: How to change existing cluster to multi-center

2013-04-26 Thread Alain RODRIGUEZ
I just asked this exact same question but after maybe after reading a bit
more doc than you did. You may want to read this thread:
http://grokbase.com/t/cassandra/user/134j85av4x/ec2snitch-to-ec2multiregionsnitch

You also may want to read some doc. Datastax explain things quite well and
update the doc regularly.

Hope this will help.


2013/4/25 Daning Wang dan...@netseer.com

 Hi All,

 We have 8 nodes cluster(replication factor is 3), about 50G data on each
 node. we need to change the cluster to multi-center environment(to EC2).
 the data need to have one replica on ec2.

 Here is the plan,

 - Change cluster config to mult-center.
 - Add 2 or 3 nodes in another center, which is ec2.
 - Change the replication factor to make data synced to other center.

 We have not done the test yet, is this doable? the main concern is that
 since connection to ec2 is slow, it will take longer time to streaming
 data(should be more than 100G) at the beginning.

 Anybody has done this before, please share some light,

 Thanks in advance,

 Daning






Re: Deletes, null values

2013-04-26 Thread Alain RODRIGUEZ
I copied the wrong query:

In CQL 2 it was:

delete '1228#16857','1228#16866','1228#16875' from myCF where key = 'all';

Sorry about the mistake.


2013/4/26 Alain RODRIGUEZ arodr...@gmail.com

 Of course:

 From CQL 2 (cqlsh -2):

 delete '183#16684','183#16714','183#16717' from myCF where key = 'all';

 And selecting this data as follow gives me the result above:

 select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
 from myCF where key = 'all';

 From thrift (phpCassa client):

 $pool = new ConnectionPool('myKeyspace', array('192.168.100.201'), 6, 0,
 3, 3);
 $my_cf= new ColumnFamily($pool, 'myCF', true, true,
 ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM);
 $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875'));



 2013/4/25 Sorin Manolache sor...@gmail.com

 On 2013-04-25 11:48, Alain RODRIGUEZ wrote:

 Hi, I tried to delete some columns using cql2 as well as thrift on
 C*1.2.2 and instead of being unreachable, deleted columns have a null
 value.

 I am using no value in this CF, the only information I use is the
 existence of the column. So when I select all the column for a given key
 I have the following returned:

   1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553
 ---+--**+--+--**
 -+**--
   null |  null | null |
  |


 This is quite annoying since my app thinks that I have 5 columns there
 when I should have 2 only.

 I first thought that this was a visible marker of tombstones but they
 didn't vanish after a major compaction.

 How can I get rid of these null/ghost columns and why does it happen ?


 I do something similar but I don't see null values. Could you please post
 the code where you delete the columns?

 Sorin





Re: vnodes and load balancing - 1.2.4

2013-04-26 Thread Sam Overton
Some extra information you could provide which will help debug this: the
logs from those 3 nodes which have no data and the output of nodetool ring

Before seeing those I can only guess, but my guess would be that in the
logs on those 3 nodes you will see this: Calculating new tokens and this:
Split previous range (blah, blah] into long list of tokens

If that is the case then it means you accidentally started those three
nodes with the default configuration (single-token) and then subsequently
changed (num_tokens) and then joined them into the cluster. What happens
when you do this is that the node thinks it used to be responsible for a
single range and is being migrated to vnodes, so it splits its single range
(now a very small part of the keyspace) into 256 smaller ranges, and ends
up with just a tiny portion of the ring assigned to it.

To fix this you'll need to decommission those 3 nodes, remove all data from
them, then bootstrap them in again with the correct configuration from the
start.

Sam



On 26 April 2013 06:07, David McNelis dmcne...@gmail.com wrote:

 So, I had 7 nodes that I set up using vnodes, 256 tokens each, no problem.

 I added two 512 token nodes, no problem, things seemed to balance.

 The next 3 nodes I added, all at 256 tokens, and they have a cumulative
 load of 116mb (where as the other nodes are at ~100GB and ~200GB (256 and
 512 respectively).

 Anyone else seen this is 1.2.4?

 The nodes seem to join the cluster ok, and I have num_tokens set and have
 tried both an empty initial_token and a commented out initial token, with
 no change.

 I see nothing streaming with netstats either, though these nodes were
 added days apart.  At first I thought I must have a hot key or something,
 but that doesn't seem to be the case, since the node I thought that one was
 on has evened out over the past couple of days with no new nodes added.

 I really *DON'T* want to deal with another shufflebut what options do
 I have, since vnodes make it unneeded to balance the cluster?  (which, at
 the moment, seems like a load of bullshit).




-- 
Sam Overton
Acunu | http://www.acunu.com | @acunu


Many creation/inserts in parallel

2013-04-26 Thread Sasha Yanushkevich
Hi All

We are testing Cassandra 1.2.3 (3 nodes with RF:2) with
FluentCassandra driver. At first many CF are being created in parallel
(about 1000 CF). After creation is done follows many insertions of
little amount of data into the DB. During tests we're receiving some
exceptions from driver, e.g.:

FluentCassandra.Operations.CassandraOperationException: unconfigured
columnfamily table_78_9
and
FluentCassandra.Operations.CassandraOperationException: Connection to
Cassandra has timed out

Though in Cassandra's logs there are no exceptions.

What should we do to handle these exceptions?
-- 
Best regards,
Alexander


Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-26 Thread Francisco Nogueira Calmon Sobral
I am using the same version and observed something similar.

I've added a new node, but the instructions from Datastax did not work for me. 
Then I ran nodetool rebuild on the new node. After finished this command, it 
contained two times the load of the other nodes. Even when I ran nodetool 
cleanup on the older nodes, the situation was the same.

The problem only seemed to disappear when nodetool repair was applied to all 
nodes.

Regards,
Francisco Sobral.




On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:

 After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running 
 upgradesstables, I figured it would be safe to start adding nodes to the 
 cluster. Guess not?
 
 It seems when new nodes join, they are streamed *all* sstables in the cluster.
 
 https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png
 
 The gray the line machine ran out disk space and for some reason cascaded 
 into errors in the cluster about 'no host id' when trying to store hints for 
 it (even though it hadn't joined yet).
 The purple line machine, I just stopped the joining process because the main 
 cluster was dropping mutation messages at this point on a few nodes (and it 
 still had dozens of sstables to stream.)
 
 I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes
 
 Is there something missing in that documentation?
 
 Thanks,
 
 John



Slow retrieval using secondary indexes

2013-04-26 Thread Francisco Nogueira Calmon Sobral
Hi all!

We are using Cassandra 1.2.1 with a 8 node cluster running at Amazon. We 
started with 6 nodes and added the 2 later. When performing some reads in 
Cassandra, we observed a high difference between gets using the primary key and 
gets using secondary indexes:


[default@Sessions] get Users where mahoutUserid = 30127944399716352;
---
RowKey: STQ0TTNII2LS211YYJI4GEV80M1SE8
= (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000)

1 Row Returned.
Elapsed time: 3508 msec(s).

[default@Sessions] get Users['STQ0TTNII2LS211YYJI4GEV80M1SE8'];
= (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000)
Returned 1 results.

Elapsed time: 3.06 msec(s).


In our model the secondary index in also unique, as the primary key is. Is it 
better, in this case, to create another CF mapping the secondary index to the 
key?

Best regards,
Francisco Sobral.

Re: Deletes, null values

2013-04-26 Thread Sorin Manolache

On 2013-04-26 11:55, Alain RODRIGUEZ wrote:

Of course:

 From CQL 2 (cqlsh -2):

delete '183#16684','183#16714','183#16717' from myCF where key = 'all';

And selecting this data as follow gives me the result above:

select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553'
from myCF where key = 'all';

 From thrift (phpCassa client):

$pool = new
ConnectionPool('myKeyspace',array('192.168.100.201'),6,0,3,3);
$my_cf= new ColumnFamily($pool, 'myCF', true, true,
ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM);
$my_cf-remove('all', array('1228#16857','1228#16866','1228#16875'));



I see. I'm sorry, I know nothing about phpCassa. I use batch_mutation 
with deletions and it works. But I guess phpCassa must use the same 
thrift primitives.


Sorin





2013/4/25 Sorin Manolache sor...@gmail.com mailto:sor...@gmail.com

On 2013-04-25 11:48, Alain RODRIGUEZ wrote:

Hi, I tried to delete some columns using cql2 as well as thrift on
C*1.2.2 and instead of being unreachable, deleted columns have a
null value.

I am using no value in this CF, the only information I use is the
existence of the column. So when I select all the column for a
given key
I have the following returned:

   1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553

---+--__+--+--__-+__--
   null |  null | null |
  |


This is quite annoying since my app thinks that I have 5 columns
there
when I should have 2 only.

I first thought that this was a visible marker of tombstones but
they
didn't vanish after a major compaction.

How can I get rid of these null/ghost columns and why does it
happen ?


I do something similar but I don't see null values. Could you please
post the code where you delete the columns?

Sorin






lastest PlayOrm released for cassandra and mongodb

2013-04-26 Thread Hiller, Dean
PlayOrm now supports mongodb and cassandra with a query language that is 
portable across both systems as well.

https://github.com/deanhiller/playorm

Later,
Dean


Re: Really odd issue (AWS related?)

2013-04-26 Thread Michael Theroux
Thanks.

We weren't monitoring this value when the issue occurred, and this particular 
issue has not appeared for a couple of days (knock on wood).  Will keep an eye 
out though,

-Mike

On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

 top command? st : time stolen from this vm by the hypervisor
 
 jason
 
 
 On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote:
 Sorry, Not sure what CPU steal is :)
 
 I have AWS console with detailed monitoring enabled... things seem to track 
 close to the minute, so I can see the CPU load go to 0... then jump at about 
 the minute Cassandra reports the dropped messages,
 
 -Mike
 
 On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
 
 The messages appear right after the node wakes up.
 Are you tracking CPU steal ? 
 
 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote:
 
 On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com 
 wrote:
 Another related question.  Once we see messages being dropped on one node, 
 our cassandra client appears to see this, reporting errors.  We use 
 LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would 
 see an error?  If only one node reports an error, shouldn't the 
 consistency level prevent the client from seeing an issue?
 
 If the client is talking to a broken/degraded coordinator node, RF/CL
 are unable to protect it from RPCTimeout. If it is unable to
 coordinate the request in a timely fashion, your clients will get
 errors.
 
 =Rob
 
 
 



Is Cassandra oversized for this kind of use case?

2013-04-26 Thread Marc Teufel
I hope the Cassandra Community can help me finding a decision.

The project i am working on actually is located in industrial plant,
machines are connected to a server an every 5 minutes i get data from the
machines about its status. We are talking about a production with 100+
machines, so the data amount is very high:

Per Machine every 5th minute one row,
means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per
day
multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I
have to hold
the last 3 years and i must be able to do analytics on this data. in the
end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers
each row)
Okay, its kind of big data is not really  big data isn'it  but for me its
a lot data to handle anyway.
Actually i am holding all these data in a oracle database but doing
analytics on so many rows
 is not the good and modern way i think. as the company is successfull they
will grew, means more machines, again more data to handle...

So i thought maybe Big Data technologies are a possible solution for me to
store my data.

Meanwhile i know Apache Hadoop is not the right tool for this kind of thing
because it scales not down.But maybe Cassandra ? This is my question to
you, do you think cassandra is the right store for this kind of data?

I am thinking about 2 Nodes. Maybe virtual.

Let me know what you think. And if Cassandra is not the right tool please
tell me and if you know any please tell me alternatives. Maybe i am already
doing the right thing with storing that much data in oracle database and
maybe one of you is doing the same - if so please let me also know.

Thank you very much.


Web: http://www.teufel.net


Re: Is Cassandra oversized for this kind of use case?

2013-04-26 Thread Hiller, Dean
Well, it depends more on what you will do with the data.  I know I was on a 
sybase(RDBMS) with 1 billion rows but it was getting close to not being able to 
handle more (constraints had to be turned off and all sorts of optimizations 
done and expert consultants brought in and everything).

BUT there are other use cases where noSQL is great for (ie. It is not just 
great for big data type systems).  It is great for really high write throughput 
as you can add more nodes and handle more writes/second than an RDBMS very 
easily yet you may be doing so many deletes that the system constantly stays at 
a small data set.

You may want to analyze the data constantly or near real time involving huge 
amounts of reads / second in which case noSQL can be better as well.

Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we have 
handled many different use cases out there.

Later,
Dean

From: Marc Teufel 
teufel.m...@googlemail.commailto:teufel.m...@googlemail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday, April 26, 2013 8:17 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Is Cassandra oversized for this kind of use case?

I hope the Cassandra Community can help me finding a decision.

The project i am working on actually is located in industrial plant, machines 
are connected to a server an every 5 minutes i get data from the machines about 
its status. We are talking about a production with 100+ machines, so the data 
amount is very high:

Per Machine every 5th minute one row,
means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day
multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have 
to hold
the last 3 years and i must be able to do analytics on this data. in the end i 
deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row)
Okay, its kind of big data is not really  big data isn'it  but for me its a 
lot data to handle anyway.
Actually i am holding all these data in a oracle database but doing analytics 
on so many rows
 is not the good and modern way i think. as the company is successfull they 
will grew, means more machines, again more data to handle...

So i thought maybe Big Data technologies are a possible solution for me to 
store my data.

Meanwhile i know Apache Hadoop is not the right tool for this kind of thing 
because it scales not down.But maybe Cassandra ? This is my question to you, do 
you think cassandra is the right store for this kind of data?

I am thinking about 2 Nodes. Maybe virtual.

Let me know what you think. And if Cassandra is not the right tool please tell 
me and if you know any please tell me alternatives. Maybe i am already doing 
the right thing with storing that much data in oracle database and maybe one of 
you is doing the same - if so please let me also know.

Thank you very much.


Web: http://www.teufel.net


Re: Performance / limitations of WHERE ... IN queries

2013-04-26 Thread Thierry Templier

Thanks very much, Aaron, for your answer!

Thierry
You are effectively doing a multi get. Getting more than one row at a 
time is normally faster, but there will be a drop off point where the 
improvements slow down. Run some tests.


Also consider that each row you requests creates RF number of commands 
spread around the thread pools for the row. If one client reqrequests 
100's or 1000's then this can delay other client requests.


Cheers




CQL update and TTL

2013-04-26 Thread Shahryar Sedghi
Apparently when I update a column using CQL that already has a TTL, it
resets the TTL to null, so if there was already a TTL for all columns that
I inserted part of a composite column set, this specific column that I
updated will not expire while the others are are getting expired. Is it how
it is expected to work or it is a bug?

Thanks in advance

Shahryar


Re: CQL update and TTL

2013-04-26 Thread Alain RODRIGUEZ
This seems to be the correct behavior. An update refreshes the TTL, as it
does in memcache for example. Yet, what I do not know is whether this
behavior can be changed somehow to let the initial TTL, this might be
useful on some use cases.

Alain


2013/4/26 Shahryar Sedghi shsed...@gmail.com

 Apparently when I update a column using CQL that already has a TTL, it
 resets the TTL to null, so if there was already a TTL for all columns that
 I inserted part of a composite column set, this specific column that I
 updated will not expire while the others are are getting expired. Is it how
 it is expected to work or it is a bug?

 Thanks in advance

 Shahryar





Re: CQL update and TTL

2013-04-26 Thread Shahryar Sedghi
The issue is,  I can get the original TTL using the select and use it for
the update, however since TTL can not be dynamic (using ?) it will exhaust
the  prepared statement cache, because I have tons of updates like this and
every one will have a different signature due to changing TTL. I am using
1.2.3 now.

Thanks


On Fri, Apr 26, 2013 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 This seems to be the correct behavior. An update refreshes the TTL, as it
 does in memcache for example. Yet, what I do not know is whether this
 behavior can be changed somehow to let the initial TTL, this might be
 useful on some use cases.

 Alain


 2013/4/26 Shahryar Sedghi shsed...@gmail.com

 Apparently when I update a column using CQL that already has a TTL, it
 resets the TTL to null, so if there was already a TTL for all columns that
 I inserted part of a composite column set, this specific column that I
 updated will not expire while the others are are getting expired. Is it how
 it is expected to work or it is a bug?

 Thanks in advance

 Shahryar






-- 
Life is what happens while you are making other plans. ~ John Lennon


Re: CQL update and TTL

2013-04-26 Thread Sylvain Lebresne
This is indeed intended. That behavior is largely dictated by how the
storage engine works, and the fact that an update does no read internally
in particular.

Yet, what I do not know is whether this behavior can be changed somehow to
 let the initial TTL,


There's nothing like that supported, no. You have to read the value first
to get his TTL and then insert whatever update you want with the TTL you've
just fetch. And since we don't have a good way to do it much more
efficiently than server side, we prefer not doing it. That way the
performance impact is very explicit.

--
Sylvain




 Alain


 2013/4/26 Shahryar Sedghi shsed...@gmail.com

 Apparently when I update a column using CQL that already has a TTL, it
 resets the TTL to null, so if there was already a TTL for all columns that
 I inserted part of a composite column set, this specific column that I
 updated will not expire while the others are are getting expired. Is it how
 it is expected to work or it is a bug?

 Thanks in advance

 Shahryar






Re: CQL update and TTL

2013-04-26 Thread Alain RODRIGUEZ
That is more or less what I was guessing, thanks for these precision.


2013/4/26 Sylvain Lebresne sylv...@datastax.com

 This is indeed intended. That behavior is largely dictated by how the
 storage engine works, and the fact that an update does no read internally
 in particular.

 Yet, what I do not know is whether this behavior can be changed somehow to
 let the initial TTL,


 There's nothing like that supported, no. You have to read the value first
 to get his TTL and then insert whatever update you want with the TTL you've
 just fetch. And since we don't have a good way to do it much more
 efficiently than server side, we prefer not doing it. That way the
 performance impact is very explicit.

 --
 Sylvain




 Alain


 2013/4/26 Shahryar Sedghi shsed...@gmail.com

 Apparently when I update a column using CQL that already has a TTL, it
 resets the TTL to null, so if there was already a TTL for all columns that
 I inserted part of a composite column set, this specific column that I
 updated will not expire while the others are are getting expired. Is it how
 it is expected to work or it is a bug?

 Thanks in advance

 Shahryar







Re: CQL update and TTL

2013-04-26 Thread Sylvain Lebresne
 is there a way to either make TTL dynamic  (using ?)


Not at this time. There is
https://issues.apache.org/jira/browse/CASSANDRA-4450 open for that, but
that's not done yet.


 tell the engine not to cache the Prepared statement. I am using the new
 CQL Java Driver.


In that case, just don't use a prepared statement. Use a normal, non
prepared query. Yes, normal statements will be slightly slower, but if you
really have to update a column while preserving its TTL, as said above you
will have to do a read followed by a write, so the whole thing won't be
excessively efficient and hence I doubt not using prepared statements will
be the blocking part performance wise.

--
Sylvain





 Shahryar


 On Fri, Apr 26, 2013 at 11:42 AM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 This is indeed intended. That behavior is largely dictated by how the
 storage engine works, and the fact that an update does no read internally
 in particular.

 Yet, what I do not know is whether this behavior can be changed somehow
 to let the initial TTL,


 There's nothing like that supported, no. You have to read the value first
 to get his TTL and then insert whatever update you want with the TTL you've
 just fetch. And since we don't have a good way to do it much more
 efficiently than server side, we prefer not doing it. That way the
 performance impact is very explicit.

 --
 Sylvain




 Alain


 2013/4/26 Shahryar Sedghi shsed...@gmail.com

 Apparently when I update a column using CQL that already has a TTL, it
 resets the TTL to null, so if there was already a TTL for all columns that
 I inserted part of a composite column set, this specific column that I
 updated will not expire while the others are are getting expired. Is it how
 it is expected to work or it is a bug?

 Thanks in advance

 Shahryar







 --
 Life is what happens while you are making other plans. ~ John Lennon



cost estimate about some Cassandra patchs

2013-04-26 Thread DE VITO Dominique
Hi,

We are created a new partitioner that groups some rows with **different** row 
keys on the same replicas.

But neither the batch_mutate, or the multiget_slice are able to take 
opportunity of this partitioner-defined placement to vectorize/batch 
communications between the coordinator and the replicas.

Does anyone know enough of the inner working of Cassandra to tell me how much 
work is needed to patch Cassandra to enable such communication 
vectorization/batch ?

Thanks.

Regards,
Dominique




Re: Is Cassandra oversized for this kind of use case?

2013-04-26 Thread Marc Teufel
Okay one billion rows of data is a lot, compared to that i am far far away
- means i can stay with Oracle? Maybe.
But you're right when you say its not only about big data but also about
your need.

So storing the data is one part, doing analytical analysis is the second. I
do a lot of calculations and queries to generate management criteria about
how the production is going on actually, how the production went the last
week, month, years and so on. Saving in a 5 minute rhythm is only a
compromise to reduce the amount of data - maybe in the future the usecase
will change an is about to store status of each machine as soon as it
changes. This will of course increase the amount of data and the complexity
of my queries again. And sure I show Live Data today... 5 Minute old Live
Data... but if i tell the CEO that i am also able to work with real live
data, i am sure this is what he wants to get  ;-)

Can you recommend me to use Cassandra for this kind of scenario or is this
oversized ?

Does it makes sense to start with 2 Nodes ?

Can i virtualize these two Nodes ?


Thx a lot for your assistance.

Marc




2013/4/26 Hiller, Dean dean.hil...@nrel.gov

 Well, it depends more on what you will do with the data.  I know I was on
 a sybase(RDBMS) with 1 billion rows but it was getting close to not being
 able to handle more (constraints had to be turned off and all sorts of
 optimizations done and expert consultants brought in and everything).

 BUT there are other use cases where noSQL is great for (ie. It is not just
 great for big data type systems).  It is great for really high write
 throughput as you can add more nodes and handle more writes/second than an
 RDBMS very easily yet you may be doing so many deletes that the system
 constantly stays at a small data set.

 You may want to analyze the data constantly or near real time involving
 huge amounts of reads / second in which case noSQL can be better as well.

 Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we
 have handled many different use cases out there.

 Later,
 Dean

 From: Marc Teufel teufel.m...@googlemail.commailto:
 teufel.m...@googlemail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Friday, April 26, 2013 8:17 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Is Cassandra oversized for this kind of use case?

 I hope the Cassandra Community can help me finding a decision.

 The project i am working on actually is located in industrial plant,
 machines are connected to a server an every 5 minutes i get data from the
 machines about its status. We are talking about a production with 100+
 machines, so the data amount is very high:

 Per Machine every 5th minute one row,
 means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per
 day
 multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I
 have to hold
 the last 3 years and i must be able to do analytics on this data. in the
 end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers
 each row)
 Okay, its kind of big data is not really  big data isn'it  but for me
 its a lot data to handle anyway.
 Actually i am holding all these data in a oracle database but doing
 analytics on so many rows
  is not the good and modern way i think. as the company is successfull
 they will grew, means more machines, again more data to handle...

 So i thought maybe Big Data technologies are a possible solution for me to
 store my data.

 Meanwhile i know Apache Hadoop is not the right tool for this kind of
 thing because it scales not down.But maybe Cassandra ? This is my question
 to you, do you think cassandra is the right store for this kind of data?

 I am thinking about 2 Nodes. Maybe virtual.

 Let me know what you think. And if Cassandra is not the right tool please
 tell me and if you know any please tell me alternatives. Maybe i am already
 doing the right thing with storing that much data in oracle database and
 maybe one of you is doing the same - if so please let me also know.

 Thank you very much.


 Web: http://www.teufel.net




-- 
Mail: teufel.m...@gmail.com
Web: http://www.teufel.net


Re: Is Cassandra oversized for this kind of use case?

2013-04-26 Thread Hiller, Dean
I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on 
writes and reads most likely getting your feet wet.  Don't buy very expensive 
computers like a lot do getting into the game for the first time…Every time I 
walk into a new gig, they seem to think they need to spend 6/10k per node.  I 
think this kind of scenario sounds find to use cassandra.  When you say 
virtualize, I believe you mean use Vms…..many use Amazon Vms and there is 
stuff to configure if you are on amazon specifically for this.

If you are on your own VM's, you do need to worry about if two nodes end up on 
the same hardware stealing resources from each other or if hardware fails as 
well.  Ie. The idea in noSQL is you typically have 3 copies of all data so if 
one node goes down, you are still live with CL=TWO.

Also, plan on doing ~300GB per node typically depending on how it works out in 
testing.

Later,
Dean

From: Marc Teufel 
teufel.m...@googlemail.commailto:teufel.m...@googlemail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday, April 26, 2013 10:59 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Is Cassandra oversized for this kind of use case?

Okay one billion rows of data is a lot, compared to that i am far far away - 
means i can stay with Oracle? Maybe.
But you're right when you say its not only about big data but also about your 
need.

So storing the data is one part, doing analytical analysis is the second. I do 
a lot of calculations and queries to generate management criteria about how the 
production is going on actually, how the production went the last week, month, 
years and so on. Saving in a 5 minute rhythm is only a compromise to reduce the 
amount of data - maybe in the future the usecase will change an is about to 
store status of each machine as soon as it changes. This will of course 
increase the amount of data and the complexity of my queries again. And sure I 
show Live Data today... 5 Minute old Live Data... but if i tell the CEO that 
i am also able to work with real live data, i am sure this is what he wants to 
get  ;-)

Can you recommend me to use Cassandra for this kind of scenario or is this 
oversized ?

Does it makes sense to start with 2 Nodes ?

Can i virtualize these two Nodes ?


Thx a lot for your assistance.

Marc




2013/4/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov
Well, it depends more on what you will do with the data.  I know I was on a 
sybase(RDBMS) with 1 billion rows but it was getting close to not being able to 
handle more (constraints had to be turned off and all sorts of optimizations 
done and expert consultants brought in and everything).

BUT there are other use cases where noSQL is great for (ie. It is not just 
great for big data type systems).  It is great for really high write throughput 
as you can add more nodes and handle more writes/second than an RDBMS very 
easily yet you may be doing so many deletes that the system constantly stays at 
a small data set.

You may want to analyze the data constantly or near real time involving huge 
amounts of reads / second in which case noSQL can be better as well.

Ie. Nosql is not just for big data.  I know with PlayOrm for cassandra, we have 
handled many different use cases out there.

Later,
Dean

From: Marc Teufel 
teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Friday, April 26, 2013 8:17 AM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Is Cassandra oversized for this kind of use case?

I hope the Cassandra Community can help me finding a decision.

The project i am working on actually is located in industrial plant, machines 
are connected to a server an every 5 minutes i get data from the machines about 
its status. We are talking about a production with 100+ machines, so the data 
amount is very high:

Per Machine every 5th minute one row,
means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day
multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have 
to hold
the last 3 years and i must be able to do analytics on this data. in the end i 
deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row)
Okay, its kind of big data is not really  big data isn'it  but for me its a 
lot data 

Re: vnodes and load balancing - 1.2.4

2013-04-26 Thread Robert Coli
On Fri, Apr 26, 2013 at 3:48 AM, Sam Overton s...@acunu.com wrote:
 If that is the case then it means you accidentally started those three nodes
 with the default configuration (single-token) and then subsequently changed
 (num_tokens) and then joined them into the cluster.

This would seem to be another reason why the debian package
auto-starting cassandra could be hazardous?

=Rob


Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-26 Thread John Watson
Small relief we're not the only ones that had this issue.

We're going to try running a shuffle before adding a new node again...
maybe that will help

- John


On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral 
fsob...@igcorp.com.br wrote:

 I am using the same version and observed something similar.

 I've added a new node, but the instructions from Datastax did not work for
 me. Then I ran nodetool rebuild on the new node. After finished this
 command, it contained two times the load of the other nodes. Even when I
 ran nodetool cleanup on the older nodes, the situation was the same.

 The problem only seemed to disappear when nodetool repair was applied to
 all nodes.

 Regards,
 Francisco Sobral.




 On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:

 After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running
 upgradesstables, I figured it would be safe to start adding nodes to the
 cluster. Guess not?

 It seems when new nodes join, they are streamed *all* sstables in the
 cluster.


 https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png

 The gray the line machine ran out disk space and for some reason cascaded
 into errors in the cluster about 'no host id' when trying to store hints
 for it (even though it hadn't joined yet).
 The purple line machine, I just stopped the joining process because the
 main cluster was dropping mutation messages at this point on a few nodes
 (and it still had dozens of sstables to stream.)

 I followed this:
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes

 Is there something missing in that documentation?

 Thanks,

 John





Re: Adding nodes in 1.2 with vnodes requires huge disks

2013-04-26 Thread Bryan Talbot
I believe that nodetool rebuild is used to add a new datacenter, not just
a new host to an existing cluster.  Is that what you ran to add the node?

-Bryan



On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote:

 Small relief we're not the only ones that had this issue.

 We're going to try running a shuffle before adding a new node again...
 maybe that will help

 - John


 On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral 
 fsob...@igcorp.com.br wrote:

 I am using the same version and observed something similar.

 I've added a new node, but the instructions from Datastax did not work
 for me. Then I ran nodetool rebuild on the new node. After finished this
 command, it contained two times the load of the other nodes. Even when I
 ran nodetool cleanup on the older nodes, the situation was the same.

 The problem only seemed to disappear when nodetool repair was applied
 to all nodes.

 Regards,
 Francisco Sobral.




 On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote:

 After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running
 upgradesstables, I figured it would be safe to start adding nodes to the
 cluster. Guess not?

 It seems when new nodes join, they are streamed *all* sstables in the
 cluster.


 https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png

 The gray the line machine ran out disk space and for some reason cascaded
 into errors in the cluster about 'no host id' when trying to store hints
 for it (even though it hadn't joined yet).
 The purple line machine, I just stopped the joining process because the
 main cluster was dropping mutation messages at this point on a few nodes
 (and it still had dozens of sstables to stream.)

 I followed this:
 http://www.datastax.com/docs/1.2/operations/add_replace_nodes

 Is there something missing in that documentation?

 Thanks,

 John