Cassandra implement in two different data-center

2012-08-30 Thread Adeel Akbar

Dear All,

I am going to implement Apache Cassandra in two different data-center 
with 2 nodes in each ring.  I also need to set replica 2 factor in same 
data center. Over the data center data should be replicates between both 
data center rings. Please help me or provide any document which help to 
implement this model.

--


Thanks  Regards

*Adeel**Akbar*



How to set LeveledCompactionStrategy for an existing table

2012-08-30 Thread Jean-Armel Luce
Hello,

I am using Cassandra 1.1.1 and CQL3.
I have a cluster with 1 node (test environment)
Could you tell how to set the compaction strategy to Leveled Strategy for
an existing table ?

I have a table pns_credentials

jal@jal-VirtualBox:~/cassandra/apache-cassandra-1.1.1/bin$ ./cqlsh -3
Connected to Test Cluster at localhost:9160.
[cqlsh 2.2.0 | Cassandra 1.1.1 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
Use HELP for help.
cqlsh use test1;
cqlsh:test1 describe table pns_credentials;

CREATE TABLE pns_credentials (
  ise text PRIMARY KEY,
  isnew int,
  ts timestamp,
  mergestatus int,
  infranetaccount text,
  user_level int,
  msisdn bigint,
  mergeusertype int
) WITH
  comment='' AND
  comparator=text AND
  read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  default_validation=text AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  replicate_on_write='true' AND
  compaction_strategy_class='SizeTieredCompactionStrategy' AND
  compression_parameters:sstable_compression='SnappyCompressor';

I want to set the LeveledCompaction strategy for this table, so I execute
the following ALTER TABLE :

cqlsh:test1 alter table pns_credentials
 ... WITH compaction_strategy_class='LeveledCompactionStrategy'
 ... AND compaction_strategy_options:sstable_size_in_mb=10;

In Cassandra logs, I see some informations :
 INFO 10:23:52,532 Enqueuing flush of
Memtable-schema_columnfamilies@965212657(1391/1738 serialized/live bytes,
20 ops)
 INFO 10:23:52,533 Writing Memtable-schema_columnfamilies@965212657(1391/1738
serialized/live bytes, 20 ops)
 INFO 10:23:52,629 Completed flushing
/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-hd-94-Data.db
(1442 bytes) for commitlog position ReplayPosition(segmentId=3556583843054,
position=1987)


However, when I look at the description of the table, the table is still
with the SizeTieredCompactionStrategy
cqlsh:test1 describe table pns_credentials ;

CREATE TABLE pns_credentials (
  ise text PRIMARY KEY,
  isnew int,
  ts timestamp,
  mergestatus int,
  infranetaccount text,
  user_level int,
  msisdn bigint,
  mergeusertype int
) WITH
  comment='' AND
  comparator=text AND
  read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  default_validation=text AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  replicate_on_write='true' AND
  compaction_strategy_class='SizeTieredCompactionStrategy' AND
  compression_parameters:sstable_compression='SnappyCompressor';

In the schema_columnfamilies table (in system keyspace), the table
pns_credentials is still using the SizeTieredCompactionStrategy
cqlsh:test1 use system;
cqlsh:system select * from schema_columnfamilies ;
...
 test1 |   pns_credentials |   null | KEYS_ONLY
|[] | |
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
|  {}
|
org.apache.cassandra.db.marshal.UTF8Type |
{sstable_compression:org.apache.cassandra.io.compress.SnappyCompressor}
|  org.apache.cassandra.db.marshal.UTF8Type |   864000 |
1029 |   ise | org.apache.cassandra.db.marshal.UTF8Type
|0 |   32
|4 |0.1 |   True
|  null | Standard |null
...


I stopped/started the Cassandra node, but the table is still with
SizeTieredCompactionStrategy

I tried using cassandra-cli, but the alter is still unsuccessfull.

Is there anything I am missing ?


Thanks.

Jean-Armel


Store a timeline with uniques properties

2012-08-30 Thread Morgan Segalis
Hi everyone,

I'm trying to use cassandra in order to store a timeline, but with values 
that must be unique (replaced). (So not really a timeline, but didn't find a 
better word for it)

Let's me give you an example :

- An user have a list of friends
- Friends can change their nickname, status, profile picture, etc...

at the beginning the CF will look like that for user1: 

lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 
means that the timestamp are older)

user1 row : |   lte |   
lte -1  |   lte -2  |   lte -3  
|   lte -4  |
values :| user2-name-change | user3-pic-change  
| user4-status-change | user2-pic-change| user2-status-change |

If for example, user2 changes it's picture, the row should look like that : 

user1 row : |   lte |   
lte -1  |   lte -2  |   lte -3  
|   lte -4   |
values :|   user2-pic-change| 
user2-name-change | user3-pic-change  | user4-status-change | 
user2-status-change |

notice that user2-pic-change in the first representation (lte -3) has moved 
to the (lte) on the second representation.

That way when user1 connects again, It can retrieve only informations that 
occurred between the last time he connected.

e.g. : if the user1's last connexion date it between lte -2 and lte -3, 
then he will only be notified that :

- user2 has changed his picture
- user2 has changed his name
- user3 has changed his picture

I would not keep the old data since the timeline is saved locally on the 
client, and not on the server.
I really would like not to search for each column in order to find the 
user2-pic-change, that can be long especially if the user has many friends.

Is there a simple way to do that with cassandra, or I am bound to create 
another CF, with column title holding the action e.g. user2-pic-change and 
for value the timestamp when it appears ?

Thanks,

Morgan.



Re: Store a timeline with uniques properties

2012-08-30 Thread Morgan Segalis
Sorry for the scheme that has not keep the right tabulation for some people...
Here's a space-version instead of a tabulation.

user1 row :|   lte|  
lte -1|   lte -2|  lte 
-3   |   lte -4   |
  values :| user2-name-change | user3-pic-change   | 
user4-status-change | user2-pic-change | user2-status-change |

If for example, user2 changes it's picture, the row should look like that : 

user1 row :|lte   | 
  lte -1   |   lte -2   |   
 lte -3  |  lte -4|
values :  |   user2-pic-change| user2-name-change | 
user3-pic-change   | user4-status-change | user2-status-change |

Le 30 août 2012 à 13:22, Morgan Segalis a écrit :

 Hi everyone,
 
 I'm trying to use cassandra in order to store a timeline, but with values 
 that must be unique (replaced). (So not really a timeline, but didn't find a 
 better word for it)
 
 Let's me give you an example :
 
 - An user have a list of friends
 - Friends can change their nickname, status, profile picture, etc...
 
 at the beginning the CF will look like that for user1: 
 
 lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 
 means that the timestamp are older)
 
 user1 row :   |   lte |   
 lte -1  |   lte -2  |   lte 
 -3  |   lte -4  |
   values :| user2-name-change | user3-pic-change  
 | user4-status-change | user2-pic-change| user2-status-change |
 
 If for example, user2 changes it's picture, the row should look like that : 
 
 user1 row :   |   lte |   
 lte -1  |   lte -2  |   lte 
 -3  |   lte -4   |
   values :|   user2-pic-change| 
 user2-name-change | user3-pic-change  | user4-status-change | 
 user2-status-change |
 
 notice that user2-pic-change in the first representation (lte -3) has moved 
 to the (lte) on the second representation.
 
 That way when user1 connects again, It can retrieve only informations that 
 occurred between the last time he connected.
 
 e.g. : if the user1's last connexion date it between lte -2 and lte -3, 
 then he will only be notified that :
 
 - user2 has changed his picture
 - user2 has changed his name
 - user3 has changed his picture
 
 I would not keep the old data since the timeline is saved locally on the 
 client, and not on the server.
 I really would like not to search for each column in order to find the 
 user2-pic-change, that can be long especially if the user has many friends.
 
 Is there a simple way to do that with cassandra, or I am bound to create 
 another CF, with column title holding the action e.g. user2-pic-change and 
 for value the timestamp when it appears ?
 
 Thanks,
 
 Morgan.
 



Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade

2012-08-30 Thread Илья Шипицин
we are running somewhat queue-like with aggressive write-read patterns.
I was looking for scripting queries from live Cassandra installation, but I
didn't find any.

is there something like thrift-proxy or other query logging/scripting
engine ?

2012/8/30 aaron morton aa...@thelastpickle.com

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8

 We've not had any reports of a performance drop off. All tests so far have
 show improvements in both read and write performance.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.

 The sha1 is created so we can diagnose corruptions in the -Data component
 of the SSTables. They are not used to save network IO.
 It is calculated while streaming the Memtable to disk so has no impact on
 disk IO. While not the fasted algorithm I would assume it's CPU overhead in
 this case is minimal.

  there's already relatively small Bloom filter file, which can be used for
 saving network traffic instead of sha1 digest.

 Bloom filters are used to test if a row key may exist in an SSTable.

 any explanation ?

 If you can provide some more information on your use case we may be able
 to help.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote:

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 after some investigation carried out I noticed files with sha1 extension
 (which are missing for Cassandra-0.7.8)

 in maybeWriteDigest() function I see no option fot switching sha1 digests
 off.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 why to use one more digest (which have to be calculated), there's already
 relatively small Bloom filter file, which can be used for saving network
 traffic instead of sha1 digest.

 any explanation ?

 Ilya Shipitsin





Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade

2012-08-30 Thread Edward Capriolo
If you move from 7.X to 0.8X or 1.0X you have to rebuild sstables as
soon as possible. If you have large bloomfilters you can hit a bug
where the bloom filters will not work properly.


On Thu, Aug 30, 2012 at 9:44 AM, Илья Шипицин chipits...@gmail.com wrote:
 we are running somewhat queue-like with aggressive write-read patterns.
 I was looking for scripting queries from live Cassandra installation, but I
 didn't find any.

 is there something like thrift-proxy or other query logging/scripting engine
 ?

 2012/8/30 aaron morton aa...@thelastpickle.com

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8

 We've not had any reports of a performance drop off. All tests so far have
 show improvements in both read and write performance.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.

 The sha1 is created so we can diagnose corruptions in the -Data component
 of the SSTables. They are not used to save network IO.
 It is calculated while streaming the Memtable to disk so has no impact on
 disk IO. While not the fasted algorithm I would assume it's CPU overhead in
 this case is minimal.

  there's already relatively small Bloom filter file, which can be used for
 saving network traffic instead of sha1 digest.

 Bloom filters are used to test if a row key may exist in an SSTable.

 any explanation ?

 If you can provide some more information on your use case we may be able
 to help.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote:

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 after some investigation carried out I noticed files with sha1 extension
 (which are missing for Cassandra-0.7.8)

 in maybeWriteDigest() function I see no option fot switching sha1 digests
 off.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 why to use one more digest (which have to be calculated), there's already
 relatively small Bloom filter file, which can be used for saving network
 traffic instead of sha1 digest.

 any explanation ?

 Ilya Shipitsin





Re: Spring - cassandra

2012-08-30 Thread Radim Kolar




You looking for the author of Spring Data Cassandra?
https://github.com/boneill42/spring-data-cassandra

If so, I guess that is me. =)
Did you get in touch with spring guys? They have cassandra support on 
their spring data todo list. They might have some todo or feature list 
they want to implement for cassandra, i am willing to code something to 
make official spring cassandra support happen faster.


Re: Spring - cassandra

2012-08-30 Thread Brian O'Neill

Yes.  I'm in contact with Oliver Gierke and Erez Mazor of Spring Data.

We are working on two fronts:
1) Spring Data support via JPA (using Kundera underneath)
- Initial attempt here:
http://brianoneill.blogspot.com/2012/07/spring-data-w-cassandra-using-jpa.h
tml
- Most recently (an hour ago): The issues w/ MetaModel are fixed, now
waiting on an enhancement to the EntityManager to fully support type
queries.

For this one, we're in a holding pattern until Kundera is fully JPA
compliant.

2) Spring Data support via Astyanax
- The project I'm working below should mimic Spring Data MongoDB's
approach and capabilities, allowing people to use Spring Data with
Cassandra without the constraints of JPA.  I'd love some help working on
the project.  Once we have it functional we should be able to push it to
Spring. (with Oliver's help)

Go ahead and fork.  Feel free to email me directly so we don't spam this
list.
(or setup a googlegroup just in case others want to contribute)

-brian


---
Brian O'Neill
Lead Architect, Software Development
Apache Cassandra MVP
 
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 http://www.twitter.com/boneill42  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 8/30/12 9:01 AM, Radim Kolar h...@filez.com wrote:



 You looking for the author of Spring Data Cassandra?
 https://github.com/boneill42/spring-data-cassandra

 If so, I guess that is me. =)
Did you get in touch with spring guys? They have cassandra support on
their spring data todo list. They might have some todo or feature list
they want to implement for cassandra, i am willing to code something to
make official spring cassandra support happen faster.




Re: Cassandra implement in two different data-center

2012-08-30 Thread Aaron Turner
On Thu, Aug 30, 2012 at 1:14 AM, Adeel Akbar
adeel.ak...@panasiangroup.com wrote:
 Dear All,

 I am going to implement Apache Cassandra in two different data-center with 2
 nodes in each ring.  I also need to set replica 2 factor in same data
 center. Over the data center data should be replicates between both data
 center rings. Please help me or provide any document which help to implement
 this model.

http://www.datastax.com/docs/1.1/initialize/cluster_init_multi_dc

has good info on building a multi-DC cluster.

That said, 2 nodes per-DC means you can't use LOCAL_QUORUM/QUORUM for
read  writes.  I would strongly suggest 3 nodes per DC if you care
about consistent reads.  Generally speaking, 3 nodes per-DC is
considered the recommended minimum number of nodes for a production
system.



-- 
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
carpe diem quam minimum credula postero


Cassandra - cqlsh

2012-08-30 Thread Morantus, James (PCLN-NW)
Hello all,

This is my first setup of Cassandra and I'm having some issues running the 
cqlsh tool.
Have any of you come across this error before? If so, please help.

/bin/cqlsh -h localhost -p 9160
No appropriate python interpreter found. 

Thanks
James


adding node to cluster

2012-08-30 Thread Casey Deccio
All,

I'm adding a new node to an existing cluster that uses
ByteOrderedPartitioner.  The documentation says that if I don't configure a
token, then one will be automatically generated to take load from an
existing node.  What I'm finding is that when I add a new node, (super)
column lookups begin failing (not sure if it was the row lookup failing or
the supercolumn lookup failing), and I'm not sure why.  I assumed that
while the existing node is transitioning data to the new node the affected
rows and (super) columns would still be found in the right place.  Any idea
why these lookups might be failing?  When I decommissioned the the new
node, the lookups began working again.  Any help is appreciated.

Regards,
Casey


Re: How to set LeveledCompactionStrategy for an existing table

2012-08-30 Thread feedly team
in cassandra-cli, i did something like:

update column family xyz with
compaction_strategy='LeveledCompactionStrategy'

On Thu, Aug 30, 2012 at 5:20 AM, Jean-Armel Luce jaluc...@gmail.com wrote:


 Hello,

 I am using Cassandra 1.1.1 and CQL3.
 I have a cluster with 1 node (test environment)
 Could you tell how to set the compaction strategy to Leveled Strategy for
 an existing table ?

 I have a table pns_credentials

 jal@jal-VirtualBox:~/cassandra/apache-cassandra-1.1.1/bin$ ./cqlsh -3
 Connected to Test Cluster at localhost:9160.
 [cqlsh 2.2.0 | Cassandra 1.1.1 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
 Use HELP for help.
 cqlsh use test1;
 cqlsh:test1 describe table pns_credentials;

 CREATE TABLE pns_credentials (
   ise text PRIMARY KEY,
   isnew int,
   ts timestamp,
   mergestatus int,
   infranetaccount text,
   user_level int,
   msisdn bigint,
   mergeusertype int
 ) WITH
   comment='' AND
   comparator=text AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   default_validation=text AND
   min_compaction_threshold=4 AND
   max_compaction_threshold=32 AND
   replicate_on_write='true' AND
   compaction_strategy_class='SizeTieredCompactionStrategy' AND
   compression_parameters:sstable_compression='SnappyCompressor';

 I want to set the LeveledCompaction strategy for this table, so I execute
 the following ALTER TABLE :

 cqlsh:test1 alter table pns_credentials
  ... WITH compaction_strategy_class='LeveledCompactionStrategy'
  ... AND compaction_strategy_options:sstable_size_in_mb=10;

 In Cassandra logs, I see some informations :
  INFO 10:23:52,532 Enqueuing flush of
 Memtable-schema_columnfamilies@965212657(1391/1738 serialized/live bytes,
 20 ops)
  INFO 10:23:52,533 Writing Memtable-schema_columnfamilies@965212657(1391/1738
 serialized/live bytes, 20 ops)
  INFO 10:23:52,629 Completed flushing
 /var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-hd-94-Data.db
 (1442 bytes) for commitlog position ReplayPosition(segmentId=3556583843054,
 position=1987)


 However, when I look at the description of the table, the table is still
 with the SizeTieredCompactionStrategy
 cqlsh:test1 describe table pns_credentials ;

 CREATE TABLE pns_credentials (
   ise text PRIMARY KEY,
   isnew int,
   ts timestamp,
   mergestatus int,
   infranetaccount text,
   user_level int,
   msisdn bigint,
   mergeusertype int
 ) WITH
   comment='' AND
   comparator=text AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   default_validation=text AND
   min_compaction_threshold=4 AND
   max_compaction_threshold=32 AND
   replicate_on_write='true' AND
   compaction_strategy_class='SizeTieredCompactionStrategy' AND
   compression_parameters:sstable_compression='SnappyCompressor';

 In the schema_columnfamilies table (in system keyspace), the table
 pns_credentials is still using the SizeTieredCompactionStrategy
 cqlsh:test1 use system;
 cqlsh:system select * from schema_columnfamilies ;
 ...
  test1 |   pns_credentials |   null | KEYS_ONLY
 |[] | |
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
 |  {}
 |
 org.apache.cassandra.db.marshal.UTF8Type |
 {sstable_compression:org.apache.cassandra.io.compress.SnappyCompressor}
 |  org.apache.cassandra.db.marshal.UTF8Type |   864000 |
 1029 |   ise | org.apache.cassandra.db.marshal.UTF8Type
 |0 |   32
 |4 |0.1 |   True
 |  null | Standard |null
 ...


 I stopped/started the Cassandra node, but the table is still with
 SizeTieredCompactionStrategy

 I tried using cassandra-cli, but the alter is still unsuccessfull.

 Is there anything I am missing ?


 Thanks.

 Jean-Armel



Re: Cassandra - cqlsh

2012-08-30 Thread Tyler Hobbs
What OS are you using?

On Thu, Aug 30, 2012 at 12:09 PM, Morantus, James (PCLN-NW) 
james.moran...@priceline.com wrote:

 Hello all,

 This is my first setup of Cassandra and I'm having some issues running the
 cqlsh tool.
 Have any of you come across this error before? If so, please help.

 /bin/cqlsh -h localhost -p 9160
 No appropriate python interpreter found.

 Thanks
 James




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: adding node to cluster

2012-08-30 Thread Rob Coli
On Thu, Aug 30, 2012 at 10:18 AM, Casey Deccio ca...@deccio.net wrote:
 I'm adding a new node to an existing cluster that uses
 ByteOrderedPartitioner.  The documentation says that if I don't configure a
 token, then one will be automatically generated to take load from an
 existing node.
 What I'm finding is that when I add a new node, (super)
 column lookups begin failing (not sure if it was the row lookup failing or
 the supercolumn lookup failing), and I'm not sure why.

1) You almost never actually want BOP.
2) You never want Cassandra to pick a token for you. IMO and the
opinion of many others, the fact that it does this is a bug. Specify a
token with initial_token.
3) You never want to use Supercolumns. The project does not support
them but currently has no plan to deprecate them. Use composite row
keys.
4) Unless your existing cluster consists of one node, you almost never
want to add only a single new node to a cluster. In general you want
to double it.

In summary, you are Doing It just about as Wrong as possible... but on
to your actual question ... ! :)

In what way are the lookups failing? Is there an exception?

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


RE: Cassandra - cqlsh

2012-08-30 Thread Morantus, James (PCLN-NW)
Red Hat Enterprise Linux Server release 5.8 (Tikanga)

Linux nw-mydb-s05 2.6.18-308.8.2.el5 #1 SMP Tue May 29 11:54:17 EDT 2012 x86_64 
x86_64 x86_64 GNU/Linux

Thanks


From: Tyler Hobbs [mailto:ty...@datastax.com]
Sent: Thursday, August 30, 2012 2:21 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra - cqlsh

What OS are you using?
On Thu, Aug 30, 2012 at 12:09 PM, Morantus, James (PCLN-NW) 
james.moran...@priceline.commailto:james.moran...@priceline.com wrote:
Hello all,

This is my first setup of Cassandra and I'm having some issues running the 
cqlsh tool.
Have any of you come across this error before? If so, please help.

/bin/cqlsh -h localhost -p 9160
No appropriate python interpreter found.

Thanks
James



--
Tyler Hobbs
DataStaxhttp://datastax.com/


Re: Cassandra - cqlsh

2012-08-30 Thread Tyler Hobbs
RHEL 5 only ships with Python 2.4, which is pretty ancient and below what
cqlsh will accept.  You can install Python 2.6 with EPEL enabled:
http://blog.nexcess.net/2011/02/25/python-2-6-for-centos-5/

On Thu, Aug 30, 2012 at 1:34 PM, Morantus, James (PCLN-NW) 
james.moran...@priceline.com wrote:

 Red Hat Enterprise Linux Server release 5.8 (Tikanga)

 ** **

 Linux nw-mydb-s05 2.6.18-308.8.2.el5 #1 SMP Tue May 29 11:54:17 EDT 2012
 x86_64 x86_64 x86_64 GNU/Linux

 ** **

 Thanks

 ** **

 ** **

 *From:* Tyler Hobbs [mailto:ty...@datastax.com]
 *Sent:* Thursday, August 30, 2012 2:21 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Cassandra - cqlsh

 ** **

 What OS are you using?

 On Thu, Aug 30, 2012 at 12:09 PM, Morantus, James (PCLN-NW) 
 james.moran...@priceline.com wrote:

 Hello all,

 This is my first setup of Cassandra and I'm having some issues running the
 cqlsh tool.
 Have any of you come across this error before? If so, please help.

 /bin/cqlsh -h localhost -p 9160
 No appropriate python interpreter found.

 Thanks
 James




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Why Cassandra secondary indexes are so slow on just 350k rows?

2012-08-30 Thread Tyler Hobbs
pycassa already breaks up the query into smaller chunks, but you should try
playing with the buffer_size kwarg for get_indexed_slices, perhaps lowering
it to ~300, as Aaron suggests:
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices

On Wed, Aug 29, 2012 at 11:40 PM, aaron morton aa...@thelastpickle.comwrote:

  *from 12 to 20 seconds (!!!) to find 5000 rows*.

 More is not always better.

 Cassandra must materialise the full 5000 rows and send them all over the
 wire to be materialised on the other side. Try asking for a few hundred at
 a time and see how it goes.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 29/08/2012, at 6:46 PM, Robin Verlangen ro...@us2.nl wrote:

 @Edward: I think you should consider a queue for exporting the new rows.
 Just store the rowkey in a queue (you might want to consider looking at
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
  )
 and process that row every couple of minutes. Then manually delete columns
 from that queue-row.

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Robin Verlangen ro...@us2.nl

 What this means is that eventually you will have 1 row in the secondary
 index table with 350K columns

 Is this really true? I would have expected that Cassandra used internal
 index sharding/bucketing?

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Dave Brosius dbros...@mebigfatguy.com

 If i understand you correctly, you are only ever querying for the rows
 where is_exported = false, and turning them into trues. What this means is
 that eventually you will have 1 row in the secondary index table with 350K
 columns that you will never look at.

 It seems to me you that perhaps you should just hold your own manual
 index cf that points to non exported rows, and just delete those columns
 when they are exported.



 On 08/28/2012 05:23 PM, Edward Kibardin wrote:

 I have a column family with the secondary index. The secondary index is
 basically a binary field, but I'm using a string for it. The field called
 *is_exported* and can be *'true'* or *'false'*. After request all loaded
 rows are updated with *is_exported = 'false'*.

 I'm polling this column table each ten minutes and exporting new rows
 as they appear.

 But here the problem: I'm seeing that time for this query grows pretty
 linear with amount of data in column table, and currently it takes *from 12
 to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
 request should not depend on number of rows in CF but from number of rows
 per one index value (cardinality), as it's just another hidden CF like:

 true : rowKey1 rowKey2 rowKey3 ...
 false: rowKey1 rowKey2 rowKey3 ...

 I'm using Pycassa to query the data, here the code I'm using:

 column_family = pycassa.ColumnFamily(**cassandra_pool,
 column_family_name, read_consistency_level=2)
 is_exported_expr = create_index_expression('is_**exported',
 'false')
 clause = create_index_clause([is_**exported_expr], count =
 5000)
 column_family.get_indexed_**slices(clause)

 Am I doing something wrong, but I expect this operation to work MUCH
 faster.

 Any ideas or suggestions?

 Some config info:
  - Cassandra 1.1.0
  - RandomPartitioner
  - I have 2 nodes and replication_factor = 2 (each server has a full
 data copy)
  - Using AWS EC2, large instances
  - Software raid0 on ephemeral drives

 Thanks in advance!








-- 
Tyler Hobbs
DataStax http://datastax.com/


RE: Cassandra - cqlsh

2012-08-30 Thread Morantus, James (PCLN-NW)
Ah... Thanks

From: Tyler Hobbs [mailto:ty...@datastax.com]
Sent: Thursday, August 30, 2012 2:42 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra - cqlsh

RHEL 5 only ships with Python 2.4, which is pretty ancient and below what cqlsh 
will accept.  You can install Python 2.6 with EPEL enabled: 
http://blog.nexcess.net/2011/02/25/python-2-6-for-centos-5/
On Thu, Aug 30, 2012 at 1:34 PM, Morantus, James (PCLN-NW) 
james.moran...@priceline.commailto:james.moran...@priceline.com wrote:
Red Hat Enterprise Linux Server release 5.8 (Tikanga)

Linux nw-mydb-s05 2.6.18-308.8.2.el5 #1 SMP Tue May 29 11:54:17 EDT 2012 x86_64 
x86_64 x86_64 GNU/Linux

Thanks


From: Tyler Hobbs [mailto:ty...@datastax.commailto:ty...@datastax.com]
Sent: Thursday, August 30, 2012 2:21 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Cassandra - cqlsh

What OS are you using?
On Thu, Aug 30, 2012 at 12:09 PM, Morantus, James (PCLN-NW) 
james.moran...@priceline.commailto:james.moran...@priceline.com wrote:
Hello all,

This is my first setup of Cassandra and I'm having some issues running the 
cqlsh tool.
Have any of you come across this error before? If so, please help.

/bin/cqlsh -h localhost -p 9160
No appropriate python interpreter found.

Thanks
James



--
Tyler Hobbs
DataStaxhttp://datastax.com/



--
Tyler Hobbs
DataStaxhttp://datastax.com/


Re: How to set LeveledCompactionStrategy for an existing table

2012-08-30 Thread Jean-Armel Luce
I tried as you said with cassandra-cli, and still unsuccessfully

[default@unknown] use test1;
Authenticated to keyspace: test1
[default@test1] UPDATE COLUMN FAMILY pns_credentials with
compaction_strategy='LeveledCompactionStrategy';
8ed12919-ef2b-327f-8f57-4c2de26c9d51
Waiting for schema agreement...
... schemas agree across the cluster

And then, when I check the compaction strategy, it is still
SizeTieredCompactionStrategy
[default@test1] describe pns_credentials;
ColumnFamily: pns_credentials
  Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
  Default column value validator:
org.apache.cassandra.db.marshal.UTF8Type
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  DC Local Read repair chance: 0.0
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: default
  Built indexes: []
  Column Metadata:
Column Name: isnew
  Validation Class: org.apache.cassandra.db.marshal.Int32Type
Column Name: ts
  Validation Class: org.apache.cassandra.db.marshal.DateType
Column Name: mergestatus
  Validation Class: org.apache.cassandra.db.marshal.Int32Type
Column Name: infranetaccount
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Column Name: user_level
  Validation Class: org.apache.cassandra.db.marshal.Int32Type
Column Name: msisdn
  Validation Class: org.apache.cassandra.db.marshal.LongType
Column Name: mergeusertype
  Validation Class: org.apache.cassandra.db.marshal.Int32Type
  Compaction Strategy:
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  Compression Options:
sstable_compression:
org.apache.cassandra.io.compress.SnappyCompressor



I tried also to create a new table with LeveledCompactionStrategy (using
cqlsh), and when I check the compaction strategy, the
SizeTieredCompactionStrategy is set for this table.

cqlsh:test1 CREATE TABLE pns_credentials3 (
 ...   ise text PRIMARY KEY,
 ...   isnew int,
 ...   ts timestamp,
 ...   mergestatus int,
 ...   infranetaccount text,
 ...   user_level int,
 ...   msisdn bigint,
 ...   mergeusertype int
 ... ) WITH
 ...   comment='' AND
 ...   read_repair_chance=0.10 AND
 ...   gc_grace_seconds=864000 AND
 ...   compaction_strategy_class='LeveledCompactionStrategy' AND
 ...
compression_parameters:sstable_compression='SnappyCompressor';
cqlsh:test1 describe table pns_credentials3

CREATE TABLE pns_credentials3 (
  ise text PRIMARY KEY,
  isnew int,
  ts timestamp,
  mergestatus int,
  infranetaccount text,
  user_level int,
  msisdn bigint,
  mergeusertype int
) WITH
  comment='' AND
  comparator=text AND
  read_repair_chance=0.10 AND
  gc_grace_seconds=864000 AND
  default_validation=text AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  replicate_on_write='true' AND
  compaction_strategy_class='SizeTieredCompactionStrategy' AND
  compression_parameters:sstable_compression='SnappyCompressor';

Maybe something is wrong in my server.
Any idea ?

Thanks.
Jean-Armel


2012/8/30 feedly team feedly...@gmail.com

 in cassandra-cli, i did something like:

 update column family xyz with
 compaction_strategy='LeveledCompactionStrategy'


 On Thu, Aug 30, 2012 at 5:20 AM, Jean-Armel Luce jaluc...@gmail.comwrote:


 Hello,

 I am using Cassandra 1.1.1 and CQL3.
 I have a cluster with 1 node (test environment)
 Could you tell how to set the compaction strategy to Leveled Strategy for
 an existing table ?

 I have a table pns_credentials

 jal@jal-VirtualBox:~/cassandra/apache-cassandra-1.1.1/bin$ ./cqlsh -3
 Connected to Test Cluster at localhost:9160.
 [cqlsh 2.2.0 | Cassandra 1.1.1 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
 Use HELP for help.
 cqlsh use test1;
 cqlsh:test1 describe table pns_credentials;

 CREATE TABLE pns_credentials (
   ise text PRIMARY KEY,
   isnew int,
   ts timestamp,
   mergestatus int,
   infranetaccount text,
   user_level int,
   msisdn bigint,
   mergeusertype int
 ) WITH
   comment='' AND
   comparator=text AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   default_validation=text AND
   min_compaction_threshold=4 AND
   max_compaction_threshold=32 AND
   replicate_on_write='true' AND
   compaction_strategy_class='SizeTieredCompactionStrategy' AND
   compression_parameters:sstable_compression='SnappyCompressor';

 I want to set the LeveledCompaction strategy for this table, so I execute
 the following ALTER TABLE :

 cqlsh:test1 alter table pns_credentials
  ... WITH compaction_strategy_class='LeveledCompactionStrategy'
  ... AND compaction_strategy_options:sstable_size_in_mb=10;

 In Cassandra logs, I see some informations :
  

Re: Why Cassandra secondary indexes are so slow on just 350k rows?

2012-08-30 Thread Edward Kibardin
Thanks Guys for the answers...

The main issue here seems not the secondary index, but speed of searching
for random keys in column family.
I've done the experiment and queried the same 5000 rows not using index but
providing a list of keys to Pycassa... the speed was the same.

Although, using SuperColumns I can get same 5000 rows (SuperColumns) like
in 1-2 seconds... It's understandable, as columns are stored sequentially.

So here the question, is it normal for Cassandra in general to search 5000
rows for 20 seconds or it's just something wrong with my instance?

Ed


On Thu, Aug 30, 2012 at 7:45 PM, Tyler Hobbs ty...@datastax.com wrote:

 pycassa already breaks up the query into smaller chunks, but you should
 try playing with the buffer_size kwarg for get_indexed_slices, perhaps
 lowering it to ~300, as Aaron suggests:
 http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices


 On Wed, Aug 29, 2012 at 11:40 PM, aaron morton aa...@thelastpickle.comwrote:

  *from 12 to 20 seconds (!!!) to find 5000 rows*.

 More is not always better.

 Cassandra must materialise the full 5000 rows and send them all over the
 wire to be materialised on the other side. Try asking for a few hundred at
 a time and see how it goes.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 29/08/2012, at 6:46 PM, Robin Verlangen ro...@us2.nl wrote:

 @Edward: I think you should consider a queue for exporting the new rows.
 Just store the rowkey in a queue (you might want to consider looking at
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
  )
 and process that row every couple of minutes. Then manually delete columns
 from that queue-row.

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Robin Verlangen ro...@us2.nl

 What this means is that eventually you will have 1 row in the
 secondary index table with 350K columns

 Is this really true? I would have expected that Cassandra used internal
 index sharding/bucketing?

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Dave Brosius dbros...@mebigfatguy.com

 If i understand you correctly, you are only ever querying for the rows
 where is_exported = false, and turning them into trues. What this means is
 that eventually you will have 1 row in the secondary index table with 350K
 columns that you will never look at.

 It seems to me you that perhaps you should just hold your own manual
 index cf that points to non exported rows, and just delete those columns
 when they are exported.



 On 08/28/2012 05:23 PM, Edward Kibardin wrote:

 I have a column family with the secondary index. The secondary index
 is basically a binary field, but I'm using a string for it. The field
 called *is_exported* and can be *'true'* or *'false'*. After request all
 loaded rows are updated with *is_exported = 'false'*.

 I'm polling this column table each ten minutes and exporting new rows
 as they appear.

 But here the problem: I'm seeing that time for this query grows pretty
 linear with amount of data in column table, and currently it takes *from 
 12
 to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
 request should not depend on number of rows in CF but from number of rows
 per one index value (cardinality), as it's just another hidden CF like:

 true : rowKey1 rowKey2 rowKey3 ...
 false: rowKey1 rowKey2 rowKey3 ...

 I'm using Pycassa to query the data, here the code I'm using:

 column_family = pycassa.ColumnFamily(**cassandra_pool,
 column_family_name, read_consistency_level=2)
 is_exported_expr = create_index_expression('is_**exported',
 'false')
 clause = create_index_clause([is_**exported_expr], count =
 5000)
 

Re: Why Cassandra secondary indexes are so slow on just 350k rows?

2012-08-30 Thread Hiller, Dean
It seems to me you may want to revisit the design(but not 100% sure as I am not 
sure I understand the entire context) a bit as I could see having partitions 
and a few clients that poll in each partition so you can scale to infinity 
basically with no issues.  If you are doing all this polling from one machine, 
it just won't scale very well.

playOrm does this for you but the basic pattern you can do yourself without 
playOrm would be….

Row 1
Row 2
Row 3
Row 4

Index row for partition 1 - val.row1, val.row4
Index row for partition 2 - val.row2, val.row3
…

Now each server is responsible for polling / scanning it's partitions index 
rows above.  If you have 2 servers and 2 partitions, each one would column scan 
the above index rows and then lookup the actual rows.  If it is unbalanced like 
5 severs and 28 partitions, you can use hash code of partition of course and 
number of servers to figure out if server owns that partition are not for 
polling.

All of this is automatic in playOrm with S-JQL (Scalable-JQL – one minor change 
to SQL to make it scalable).

Later,
Dean



From: Edward Kibardin infa...@gmail.commailto:infa...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, August 30, 2012 2:14 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Why Cassandra secondary indexes are so slow on just 350k rows?

t should not depend on number of rows in CF but from number of rows per one 
index value


Re: Store a timeline with uniques properties

2012-08-30 Thread aaron morton
Consider trying…

UserTimeline CF

row_key: user_id
column_names: timestamp, other_user_id, action
column_values: action details

To get the changes between two times specify the start and end timestamps and 
do not include the other components of the column name. 

e.g. from 1234, NULL, NULL to 6789, NULL, NULL

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 30/08/2012, at 11:32 PM, Morgan Segalis msega...@gmail.com wrote:

 Sorry for the scheme that has not keep the right tabulation for some people...
 Here's a space-version instead of a tabulation.
 
 user1 row :|   lte|  
 lte -1|   lte -2|  
 lte -3   |   lte -4   |
  values :| user2-name-change | user3-pic-change   | 
 user4-status-change | user2-pic-change | user2-status-change |
 
 If for example, user2 changes it's picture, the row should look like that : 
 
 user1 row :|lte   |   
 lte -1   |   lte -2   |   
  lte -3  |  lte -4|
values :  |   user2-pic-change| user2-name-change 
 | user3-pic-change   | user4-status-change | user2-status-change |
 
 Le 30 août 2012 à 13:22, Morgan Segalis a écrit :
 
 Hi everyone,
 
 I'm trying to use cassandra in order to store a timeline, but with values 
 that must be unique (replaced). (So not really a timeline, but didn't find a 
 better word for it)
 
 Let's me give you an example :
 
 - An user have a list of friends
 - Friends can change their nickname, status, profile picture, etc...
 
 at the beginning the CF will look like that for user1: 
 
 lte = latest-timestamp-entry, which is the timestamp of the entry (-1 -2 -3 
 means that the timestamp are older)
 
 user1 row :  |   lte |   
 lte -1  |   lte -2  |   lte 
 -3  |   lte -4  |
  values :| user2-name-change | user3-pic-change  
 | user4-status-change | user2-pic-change| user2-status-change |
 
 If for example, user2 changes it's picture, the row should look like that : 
 
 user1 row :  |   lte |   
 lte -1  |   lte -2  |   lte 
 -3  |   lte -4   |
  values :|   user2-pic-change| 
 user2-name-change | user3-pic-change  | user4-status-change | 
 user2-status-change |
 
 notice that user2-pic-change in the first representation (lte -3) has 
 moved to the (lte) on the second representation.
 
 That way when user1 connects again, It can retrieve only informations that 
 occurred between the last time he connected.
 
 e.g. : if the user1's last connexion date it between lte -2 and lte -3, 
 then he will only be notified that :
 
 - user2 has changed his picture
 - user2 has changed his name
 - user3 has changed his picture
 
 I would not keep the old data since the timeline is saved locally on the 
 client, and not on the server.
 I really would like not to search for each column in order to find the 
 user2-pic-change, that can be long especially if the user has many friends.
 
 Is there a simple way to do that with cassandra, or I am bound to create 
 another CF, with column title holding the action e.g. user2-pic-change and 
 for value the timestamp when it appears ?
 
 Thanks,
 
 Morgan.
 
 



Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade

2012-08-30 Thread aaron morton
 we are running somewhat queue-like with aggressive write-read patterns.
We'll need some more details…

How much data ?
How many machines ?
What is the machine spec ?
How many clients ?
Is there an example of a slow request ? 
How are you measuring that it's slow ? 
Is there anything unusual in the log ? 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 31/08/2012, at 3:30 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 If you move from 7.X to 0.8X or 1.0X you have to rebuild sstables as
 soon as possible. If you have large bloomfilters you can hit a bug
 where the bloom filters will not work properly.
 
 
 On Thu, Aug 30, 2012 at 9:44 AM, Илья Шипицин chipits...@gmail.com wrote:
 we are running somewhat queue-like with aggressive write-read patterns.
 I was looking for scripting queries from live Cassandra installation, but I
 didn't find any.
 
 is there something like thrift-proxy or other query logging/scripting engine
 ?
 
 2012/8/30 aaron morton aa...@thelastpickle.com
 
 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 
 We've not had any reports of a performance drop off. All tests so far have
 show improvements in both read and write performance.
 
 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 
 The sha1 is created so we can diagnose corruptions in the -Data component
 of the SSTables. They are not used to save network IO.
 It is calculated while streaming the Memtable to disk so has no impact on
 disk IO. While not the fasted algorithm I would assume it's CPU overhead in
 this case is minimal.
 
 there's already relatively small Bloom filter file, which can be used for
 saving network traffic instead of sha1 digest.
 
 Bloom filters are used to test if a row key may exist in an SSTable.
 
 any explanation ?
 
 If you can provide some more information on your use case we may be able
 to help.
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote:
 
 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 after some investigation carried out I noticed files with sha1 extension
 (which are missing for Cassandra-0.7.8)
 
 in maybeWriteDigest() function I see no option fot switching sha1 digests
 off.
 
 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 why to use one more digest (which have to be calculated), there's already
 relatively small Bloom filter file, which can be used for saving network
 traffic instead of sha1 digest.
 
 any explanation ?
 
 Ilya Shipitsin
 
 
 



Re: How to set LeveledCompactionStrategy for an existing table

2012-08-30 Thread aaron morton
Looks like a bug. 

Can you please create a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA and update the email thread ?

Can you include this: CFPropDefs.applyToCFMetadata() does not set the 
compaction class on CFM

Thanks


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 31/08/2012, at 7:05 AM, Jean-Armel Luce jaluc...@gmail.com wrote:

 I tried as you said with cassandra-cli, and still unsuccessfully
 
 [default@unknown] use test1;
 Authenticated to keyspace: test1
 [default@test1] UPDATE COLUMN FAMILY pns_credentials with 
 compaction_strategy='LeveledCompactionStrategy';
 8ed12919-ef2b-327f-8f57-4c2de26c9d51
 Waiting for schema agreement...
 ... schemas agree across the cluster
 
 And then, when I check the compaction strategy, it is still  
 SizeTieredCompactionStrategy
 [default@test1] describe pns_credentials;
 ColumnFamily: pns_credentials
   Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
   Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
   Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 0.1
   DC Local Read repair chance: 0.0
   Replicate on write: true
   Caching: KEYS_ONLY
   Bloom Filter FP chance: default
   Built indexes: []
   Column Metadata:
 Column Name: isnew
   Validation Class: org.apache.cassandra.db.marshal.Int32Type
 Column Name: ts
   Validation Class: org.apache.cassandra.db.marshal.DateType
 Column Name: mergestatus
   Validation Class: org.apache.cassandra.db.marshal.Int32Type
 Column Name: infranetaccount
   Validation Class: org.apache.cassandra.db.marshal.UTF8Type
 Column Name: user_level
   Validation Class: org.apache.cassandra.db.marshal.Int32Type
 Column Name: msisdn
   Validation Class: org.apache.cassandra.db.marshal.LongType
 Column Name: mergeusertype
   Validation Class: org.apache.cassandra.db.marshal.Int32Type
   Compaction Strategy: 
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
   Compression Options:
 sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
 
 
 
 I tried also to create a new table with LeveledCompactionStrategy (using 
 cqlsh), and when I check the compaction strategy, the 
 SizeTieredCompactionStrategy is set for this table.
 
 cqlsh:test1 CREATE TABLE pns_credentials3 (
  ...   ise text PRIMARY KEY,
  ...   isnew int,
  ...   ts timestamp,
  ...   mergestatus int,
  ...   infranetaccount text,
  ...   user_level int,
  ...   msisdn bigint,
  ...   mergeusertype int
  ... ) WITH
  ...   comment='' AND
  ...   read_repair_chance=0.10 AND
  ...   gc_grace_seconds=864000 AND
  ...   compaction_strategy_class='LeveledCompactionStrategy' AND
  ...   compression_parameters:sstable_compression='SnappyCompressor';
 cqlsh:test1 describe table pns_credentials3
 
 CREATE TABLE pns_credentials3 (
   ise text PRIMARY KEY,
   isnew int,
   ts timestamp,
   mergestatus int,
   infranetaccount text,
   user_level int,
   msisdn bigint,
   mergeusertype int
 ) WITH
   comment='' AND
   comparator=text AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   default_validation=text AND
   min_compaction_threshold=4 AND
   max_compaction_threshold=32 AND
   replicate_on_write='true' AND
   compaction_strategy_class='SizeTieredCompactionStrategy' AND
   compression_parameters:sstable_compression='SnappyCompressor';
 
 Maybe something is wrong in my server.
 Any idea ?
 
 Thanks.
 Jean-Armel
 
 
 2012/8/30 feedly team feedly...@gmail.com
 in cassandra-cli, i did something like: 
 
 update column family xyz with compaction_strategy='LeveledCompactionStrategy'
 
 
 On Thu, Aug 30, 2012 at 5:20 AM, Jean-Armel Luce jaluc...@gmail.com wrote:
 
 Hello,
 
 I am using Cassandra 1.1.1 and CQL3.
 I have a cluster with 1 node (test environment)
 Could you tell how to set the compaction strategy to Leveled Strategy for an 
 existing table ?
 
 I have a table pns_credentials
 
 jal@jal-VirtualBox:~/cassandra/apache-cassandra-1.1.1/bin$ ./cqlsh -3
 Connected to Test Cluster at localhost:9160.
 [cqlsh 2.2.0 | Cassandra 1.1.1 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
 Use HELP for help.
 cqlsh use test1;
 cqlsh:test1 describe table pns_credentials;
 
 CREATE TABLE pns_credentials (
   ise text PRIMARY KEY,
   isnew int,
   ts timestamp,
   mergestatus int,
   infranetaccount text,
   user_level int,
   msisdn bigint,
   mergeusertype int
 ) WITH
   comment='' AND
   comparator=text AND
   read_repair_chance=0.10 AND
   gc_grace_seconds=864000 AND
   default_validation=text AND
   min_compaction_threshold=4 AND
   

Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade

2012-08-30 Thread Илья Шипицин
we are using functional tests ( ~500 tests in time).
it is hard to tell which query is slower, it is slower in general.

same hardware. 1 node, 32Gb RAM, 8Gb heap. default cassandra settings.
as we are talking about functional tests, so we recreate KS just before
tests are run.

I do not know how to record queries (there are a lot of them), if you are
interested, I can set up a special stand for you.

2012/8/31 aaron morton aa...@thelastpickle.com

 we are running somewhat queue-like with aggressive write-read patterns.

 We'll need some more details...

 How much data ?
 How many machines ?
 What is the machine spec ?
 How many clients ?
 Is there an example of a slow request ?
 How are you measuring that it's slow ?
 Is there anything unusual in the log ?

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 31/08/2012, at 3:30 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

 If you move from 7.X to 0.8X or 1.0X you have to rebuild sstables as
 soon as possible. If you have large bloomfilters you can hit a bug
 where the bloom filters will not work properly.


 On Thu, Aug 30, 2012 at 9:44 AM, Илья Шипицин chipits...@gmail.com
 wrote:

 we are running somewhat queue-like with aggressive write-read patterns.
 I was looking for scripting queries from live Cassandra installation, but I
 didn't find any.

 is there something like thrift-proxy or other query logging/scripting
 engine
 ?

 2012/8/30 aaron morton aa...@thelastpickle.com


 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8

 We've not had any reports of a performance drop off. All tests so far have
 show improvements in both read and write performance.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.

 The sha1 is created so we can diagnose corruptions in the -Data component
 of the SSTables. They are not used to save network IO.
 It is calculated while streaming the Memtable to disk so has no impact on
 disk IO. While not the fasted algorithm I would assume it's CPU overhead in
 this case is minimal.

 there's already relatively small Bloom filter file, which can be used for
 saving network traffic instead of sha1 digest.

 Bloom filters are used to test if a row key may exist in an SSTable.

 any explanation ?

 If you can provide some more information on your use case we may be able
 to help.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote:

 in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!)
 times slower than cassandra-0.7.8
 after some investigation carried out I noticed files with sha1 extension
 (which are missing for Cassandra-0.7.8)

 in maybeWriteDigest() function I see no option fot switching sha1 digests
 off.

 I agree, such digests save some network IO, but they seem to be very bad
 in terms of CPU and disk IO.
 why to use one more digest (which have to be calculated), there's already
 relatively small Bloom filter file, which can be used for saving network
 traffic instead of sha1 digest.

 any explanation ?

 Ilya Shipitsin







Re: Memory Usage of a connection

2012-08-30 Thread rohit bhatia
PS: everything above is in bytes, not bits.

On Fri, Aug 31, 2012 at 11:03 AM, rohit bhatia rohit2...@gmail.com wrote:

 I was wondering how much would be the memory usage of an established
 connection in cassandra's heap space.

 We are noticing extremely frequent young generation garbage collections
 (3.2gb young generation, ParNew gc every 2 seconds) at a traffic of
 20,000qps for 8 nodes.
 We do connection pooling but with 1 connection for 6 requests with
 phpcassa.
 So, essentially every node has on an average 500 connections
 created/destroyed every second.
 Could these 500 connections/second cause (on average) 2600Mb memory usage
 per 2 second ~ 1300Mb/second.
 or For 1 connection around 2-3Mb.

 Is this value expected? (our write requests are simple counter increments
 and cannot take up 500KB per request as calculation suggests, rather should
 take up only a few hundred bytes).

 Thanks
 Rohit



Re: adding node to cluster

2012-08-30 Thread Casey Deccio
On Thu, Aug 30, 2012 at 11:21 AM, Rob Coli rc...@palominodb.com wrote:

 On Thu, Aug 30, 2012 at 10:18 AM, Casey Deccio ca...@deccio.net wrote:
  I'm adding a new node to an existing cluster that uses
  ByteOrderedPartitioner.  The documentation says that if I don't
 configure a
  token, then one will be automatically generated to take load from an
  existing node.
  What I'm finding is that when I add a new node, (super)
  column lookups begin failing (not sure if it was the row lookup failing
 or
  the supercolumn lookup failing), and I'm not sure why.

 1) You almost never actually want BOP.
 2) You never want Cassandra to pick a token for you. IMO and the
 opinion of many others, the fact that it does this is a bug. Specify a
 token with initial_token.
 3) You never want to use Supercolumns. The project does not support
 them but currently has no plan to deprecate them. Use composite row
 keys.
 4) Unless your existing cluster consists of one node, you almost never
 want to add only a single new node to a cluster. In general you want
 to double it.

 In summary, you are Doing It just about as Wrong as possible... but on
 to your actual question ... ! :)


Well, at least I'm consistent :)  Thanks for the hints.  Unfortunately,
when I first brought up my system--with the goal of getting it up
quickly--I thought BOP and Supercolumns were the way to go.  Plus, the
small cluster of nodes I was using was on a hodgepodge of hardware.  I've
since had a chance to think somewhat about redesigning and rearchitecting,
but it seems like there's no easy way to convert it properly.  Step one
was to migrate everything over to a single dedicated node on reasonable
hardware, so I could begin the process, which brought me to the issue I
initially posted about.  But the problem is that this is a live system, so
data loss is an issue I'd like to avoid.


 In what way are the lookups failing? Is there an exception?


No exception--just failing in that the data should be there, but isn't.

Casey


Re: Memory Usage of a connection

2012-08-30 Thread Peter Schuller
 Could these 500 connections/second cause (on average) 2600Mb memory usage
 per 2 second ~ 1300Mb/second.
 or For 1 connection around 2-3Mb.

In terms of garbage generated it's much less about number of
connections as it is about what you're doing with them. Are you for
example requesting large amounts of data? Large or many columns (or
both), etc. Essentially all working data that your request touches
is allocated on the heap and contributes to allocation rate and ParNew
frequency.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)