Re: size tiered compaction - improvement

2012-04-18 Thread Radim Kolar



Any compaction pass over A will first convert the TTL data into tombstones.

Then, any subsequent pass that includes A *and all other sstables
containing rows with the same key* will drop the tombstones.
thats why i proposed to attach TTL to entire CF. Tombstones would not be 
needed


blob fields, bynary or hexa?

2012-04-18 Thread mdione.ext

  We're building a database to stock the avatars for our users in three sizes. 
Thing is,
We planned to use the blob field with a ByteType validator, but if we try to 
inject the 
binary data as read from the image file, we get a cannot parse as hex bytes 
error. The 
same happens if we convert the binary data to its base64 representation. So far 
the only 
solutions we found is to actually convert the binary data to its 'string of 
hexa 
representations of each byte', meaning that each binary byte is actually 
stocked as two 
'ascii bytes'; or, on the other hand, convert to bas64 and store it in a ascii 
column.
Did we miss something?

--
Marcos Dione
SysAdmin
Astek Sud-Est
pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo
04 97 12 62 45 - mdione@orange.com



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: blob fields, bynary or hexa?

2012-04-18 Thread Erik Forkalsud

On 04/18/2012 03:02 AM, mdione@orange.com wrote:

   We're building a database to stock the avatars for our users in three sizes. 
Thing is,
We planned to use the blob field with a ByteType validator, but if we try to 
inject the
binary data as read from the image file, we get acannot parse as hex bytes  
error.


Which client are you using?  With Hector or straight thrift, your should 
be able to store byte[] directly.



- Erik -


RE: size tiered compaction - improvement

2012-04-18 Thread Viktor Jevdokimov
Our use case requires Column TTL, not CF TTL, because it is variable, not 
constant.


Best regards/ Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.-Original Message-
From: Radim Kolar [mailto:h...@filez.com]
Sent: Wednesday, April 18, 2012 12:57
To: user@cassandra.apache.org
Subject: Re: size tiered compaction - improvement


 Any compaction pass over A will first convert the TTL data into tombstones.

 Then, any subsequent pass that includes A *and all other sstables
 containing rows with the same key* will drop the tombstones.
thats why i proposed to attach TTL to entire CF. Tombstones would not be needed


RE: blob fields, bynary or hexa?

2012-04-18 Thread mdione.ext
De : Erik Forkalsud [mailto:eforkals...@cj.com]
 Which client are you using?  With Hector or straight thrift, your
 should
 be able to store byte[] directly.

  So far, cassandra-cli only, but we're also testing phpcassa with CQL 
support[1].

--
[1] https://github.com/thobbs/phpcassa
--
Marcos Dione
SysAdmin
Astek Sud-Est
pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo
04 97 12 62 45 - mdione@orange.com

_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: swap grows

2012-04-18 Thread ruslan usifov
Thanks for link. But for me still present question about  free memory. In
out cluster we have 200 IOPS in peaks, but still have about 3GB of free
memory on each server (cluster have 6 nodes tho there are 3*6=18 GB of
unused memry). I think that OS must fill all memory with pagecache (we do
backups throw DirectIO) of SStables, but it doesn't do that and i doesn't
understand  why. I can't find any sysctl that can tune pagecache thresholds
or ratio.

Any suggestion

2012/4/18 Jonathan Ellis jbel...@gmail.com

 what-is-the-linux-kernel-parameter-vm-swappinesshttp://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness


Re: size tiered compaction - improvement

2012-04-18 Thread Igor
For my use case it would be nice to have per CF TTL (to protect myself 
from application bug and from storage leak due to missed TTL), but seems 
you can't avoid tombstones even in this case and if you change CF TTL 
during runtime.


On 04/18/2012 03:06 PM, Viktor Jevdokimov wrote:

Our use case requires Column TTL, not CF TTL, because it is variable, not 
constant.


Best regards/ Pagarbiai

Viktor Jevdokimov
Senior Developer

Email: viktor.jevdoki...@adform.com
Phone: +370 5 212 3063
Fax: +370 5 261 0453

J. Jasinskio 16C,
LT-01112 Vilnius,
Lithuania



Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies.-Original Message-
From: Radim Kolar [mailto:h...@filez.com]
Sent: Wednesday, April 18, 2012 12:57
To: user@cassandra.apache.org
Subject: Re: size tiered compaction - improvement



Any compaction pass over A will first convert the TTL data into tombstones.

Then, any subsequent pass that includes A *and all other sstables
containing rows with the same key* will drop the tombstones.

thats why i proposed to attach TTL to entire CF. Tombstones would not be needed




Re: Counter column family

2012-04-18 Thread Tamar Fraenkel
My problem was the result of Hector bug, see
http://groups.google.com/group/hector-users/browse_thread/thread/8359538ed387564e

So please ignore question,
Thanks,

*Tamar Fraenkel *
Senior Software Engineer, TOK Media

[image: Inline image 1]

ta...@tok-media.com
Tel:   +972 2 6409736
Mob:  +972 54 8356490
Fax:   +972 2 5612956





On Tue, Apr 17, 2012 at 6:59 PM, Tamar Fraenkel ta...@tok-media.com wrote:

 Hi!
 I want to understand how incrementing of counter works.


- I have a 3 node ring,
- I use FailoverPolicy.FAIL_FAST,
- RF is 2,

 I have the following counter column family
 ColumnFamily: tk_counters
   Key Validation Class: org.apache.cassandra.db.marshal.CompositeType(
 org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.
 UUIDType)
   Default column value validator: org.apache.cassandra.db.marshal.
 CounterColumnType
   Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
   Row cache size / save period in seconds / keys to save : 0.0/0/all
   Row Cache Provider: org.apache.cassandra.cache.
 SerializingCacheProvider
   Key cache size / save period in seconds: 0.0/14400
   GC grace seconds: 864000
   Compaction min/max thresholds: 4/32
   Read repair chance: 1.0
   Replicate on write: true
   Bloom Filter FP chance: default
   Built indexes: []
   Compaction Strategy:
 org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

 My CL for this column family is Write=2, Read=1.

 When I increment a counter (using hector mutator), and execute returns
 without errors, what is the status of the nodes at that stage.
 Can execute return before the nodes are really updated? So that if a read
 is done immediately after the increment it will still read the previous
 values?
 Thanks,

 *Tamar Fraenkel *
 Senior Software Engineer, TOK Media

 [image: Inline image 1]

 ta...@tok-media.com
 Tel:   +972 2 6409736
 Mob:  +972 54 8356490
 Fax:   +972 2 5612956




tokLogo.png

Re: size tiered compaction - improvement

2012-04-18 Thread Jonathan Ellis
It's not that simple, unless you have an append-only workload.  (See
discussion on
https://issues.apache.org/jira/browse/CASSANDRA-3974.)

On Wed, Apr 18, 2012 at 4:57 AM, Radim Kolar h...@filez.com wrote:

 Any compaction pass over A will first convert the TTL data into
 tombstones.

 Then, any subsequent pass that includes A *and all other sstables
 containing rows with the same key* will drop the tombstones.

 thats why i proposed to attach TTL to entire CF. Tombstones would not be
 needed



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: blob fields, bynary or hexa?

2012-04-18 Thread phuduc nguyen
How are you passing a blob or binary stream to the CLI? It sounds like
you're passing in a representation of a binary stream as ascii/UTF8 which
will create the problems you describe.


Regards,
Duc


On 4/18/12 6:08 AM, mdione@orange.com mdione@orange.com wrote:

 De : Erik Forkalsud [mailto:eforkals...@cj.com]
 Which client are you using?  With Hector or straight thrift, your
 should
 be able to store byte[] directly.
 
   So far, cassandra-cli only, but we're also testing phpcassa with CQL
 support[1].
 
 --
 [1] https://github.com/thobbs/phpcassa
 --
 Marcos Dione
 SysAdmin
 Astek Sud-Est
 pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo
 04 97 12 62 45 - mdione@orange.com
 
 __
 ___
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce
 message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and delete
 this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for messages
 that have been modified, changed or falsified.
 Thank you.
 



Single Vs. Multiple Keyspaces

2012-04-18 Thread Trevor Francis
We are launching a data-intensive application that will store in upwards of 50 
million 150-byte records per day per user. We have identified Cassandra as our 
database technology and Flume as what we will use to seed the data from log 
files into the database. 

Each user is given their own server instance, but the schema of the data for 
each user will be the same.

We will be performing realtime analysis on this information as part of our 
application and was considering the advantages/disadvantages of all users using 
the same keyspace. All data will be treated the same as far as replication 
factor and the only difference is we won't be displaying one user's info to 
another user. They will be compartmentalized and one user's data will not 
affect or ever be compared against another user.

Conceptualize this as a each user has their own Apache server and that server 
spits out 50 million records per day and each user will only be analyzing the 
data for their particular server, not anyone elses. The log formats are exactly 
the same.

My experience lies in relational databases and not key-value stores, like 
Cassandra. So, in the mysql world we would put each user in their own database 
to avoid the locking contention and to make queries faster. 

If we don't post info into different keyspaces, i assume we will have to add an 
additional field to our records to identify the user that owns that particular 
record. How does a single large Keyspace affect query speed, etc. etc.



Trevor Francis




Re: Resident size growth

2012-04-18 Thread Rob Coli
On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote:
 mmap doesn't depend on jna

FWIW, this confusion is as a result of the use of *mlockall*, which is
used to prevent mmapped files from being swapped, which does depend on
JNA.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb


RE: size tiered compaction - improvement

2012-04-18 Thread Bryce Godfrey
Per CF or per Row TTL would be very usefull for me also with our timeseries 
data.

-Original Message-
From: Igor [mailto:i...@4friends.od.ua] 
Sent: Wednesday, April 18, 2012 6:06 AM
To: user@cassandra.apache.org
Subject: Re: size tiered compaction - improvement

For my use case it would be nice to have per CF TTL (to protect myself from 
application bug and from storage leak due to missed TTL), but seems you can't 
avoid tombstones even in this case and if you change CF TTL during runtime.

On 04/18/2012 03:06 PM, Viktor Jevdokimov wrote:
 Our use case requires Column TTL, not CF TTL, because it is variable, not 
 constant.


 Best regards/ Pagarbiai

 Viktor Jevdokimov
 Senior Developer

 Email: viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063
 Fax: +370 5 261 0453

 J. Jasinskio 16C,
 LT-01112 Vilnius,
 Lithuania



 Disclaimer: The information contained in this message and attachments 
 is intended solely for the attention and use of the named addressee 
 and may be confidential. If you are not the intended recipient, you 
 are reminded that the information remains the property of the sender. 
 You must not use, disclose, distribute, copy, print or rely on this 
 e-mail. If you have received this message in error, please contact the 
 sender immediately and irrevocably delete this message and any 
 copies.-Original Message-
 From: Radim Kolar [mailto:h...@filez.com]
 Sent: Wednesday, April 18, 2012 12:57
 To: user@cassandra.apache.org
 Subject: Re: size tiered compaction - improvement


 Any compaction pass over A will first convert the TTL data into tombstones.

 Then, any subsequent pass that includes A *and all other sstables 
 containing rows with the same key* will drop the tombstones.
 thats why i proposed to attach TTL to entire CF. Tombstones would not 
 be needed



Re: Resident size growth

2012-04-18 Thread Jonathan Ellis
On Wed, Apr 18, 2012 at 12:44 PM, Rob Coli rc...@palominodb.com wrote:
 On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com 
 wrote:
 mmap doesn't depend on jna

 FWIW, this confusion is as a result of the use of *mlockall*, which is
 used to prevent mmapped files from being swapped, which does depend on
 JNA.

mlockall does depend on JNA, but we only lock the JVM itself in
memory.  The OS is free to page data files in and out as needed.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Column Family per User

2012-04-18 Thread Trevor Francis
Our application has users that can write in upwards of 50 million records per 
day. However, they all write the same format of records (20 fields…columns). 
Should I put each user in their own column family, even though the column 
family schema will be the same per user?

Would this help with dimensioning, if each user is querying their keyspace and 
only their keyspace?


Trevor Francis




Re: Column Family per User

2012-04-18 Thread Janne Jalkanen

Each CF takes a fair chunk of memory regardless of how much data it has, so 
this is probably not a good idea, if you have lots of users. Also using a 
single CF means that compression is likely to work better (more redundant data).

However, Cassandra distributes the load across different nodes based on the row 
key, and the writes scale roughly linearly according to the number of nodes. So 
if you can make sure that no single row gets overly burdened by writes (50 
million writes/day to a single row would always go to the same nodes - this is 
in the order of 600 writes/second/node, which shouldn't really pose a problem, 
IMHO). The main problem is that if a single row gets lots of columns it'll 
start to slow down at some point, and your row caches become less useful, as 
they cache the entire row.

Keep your rows suitably sized and you should be fine. To partition the data, 
you can either distribute it to a few CFs based on use or use some other 
distribution method (like user:1234:00 where the 00 is the hour-of-the-day.

(There's a great article by Aaron Morton on how wide rows impact performance at 
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, 
running your own tests to determine the optimal setup is recommended.)

/Janne

On Apr 18, 2012, at 21:20 , Trevor Francis wrote:

 Our application has users that can write in upwards of 50 million records per 
 day. However, they all write the same format of records (20 fields…columns). 
 Should I put each user in their own column family, even though the column 
 family schema will be the same per user?
 
 Would this help with dimensioning, if each user is querying their keyspace 
 and only their keyspace?
 
 
 Trevor Francis
 
 



Re: Column Family per User

2012-04-18 Thread Trevor Francis
Janne,


Of course, I am new to the Cassandra world, so it is taking some getting used 
to understand how everything translates into my MYSQL head.

We are building an enterprise application that will ingest log information and 
provide metrics and trending based upon the data contained in the logs. The 
application is transactional in nature such that a record will be written to a 
log and our system will need to query that record and assign two values to it 
in addition to using the information to develop trending metrics. 

The logs are being fed into cassandra by Flume.

Each of our users will be assigned their own piece of hardware that generates 
these log events, some of which can peak at up to 2500 transactions per second 
for a couple of hours. The log entries are around 150-bytes each and contain 
around 20 different pieces of information. Neither us, nor our users are 
interested in generating any queries across the entire database. Users are only 
concerned with the data that their particular piece of hardware generates. 

Should I just setup a single column family with 20 columns, the first of which 
being the row key and make the row key the username of that user?

We would also need probably 2 more columns to store Value A and Value B 
assigned to that particular record.

Our metrics will be be something like this: For this particular user, during 
this particular timeframe, what is the average of field X? And then store 
that value, which we can generate historical trending over the course a week. 
We will do this every 15 minutes. 

Any suggestions on where I should head to start my journey into Cassandra for 
my particular application?


Trevor Francis


On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:

 
 Each CF takes a fair chunk of memory regardless of how much data it has, so 
 this is probably not a good idea, if you have lots of users. Also using a 
 single CF means that compression is likely to work better (more redundant 
 data).
 
 However, Cassandra distributes the load across different nodes based on the 
 row key, and the writes scale roughly linearly according to the number of 
 nodes. So if you can make sure that no single row gets overly burdened by 
 writes (50 million writes/day to a single row would always go to the same 
 nodes - this is in the order of 600 writes/second/node, which shouldn't 
 really pose a problem, IMHO). The main problem is that if a single row gets 
 lots of columns it'll start to slow down at some point, and your row caches 
 become less useful, as they cache the entire row.
 
 Keep your rows suitably sized and you should be fine. To partition the data, 
 you can either distribute it to a few CFs based on use or use some other 
 distribution method (like user:1234:00 where the 00 is the 
 hour-of-the-day.
 
 (There's a great article by Aaron Morton on how wide rows impact performance 
 at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, 
 running your own tests to determine the optimal setup is recommended.)
 
 /Janne
 
 On Apr 18, 2012, at 21:20 , Trevor Francis wrote:
 
 Our application has users that can write in upwards of 50 million records 
 per day. However, they all write the same format of records (20 
 fields…columns). Should I put each user in their own column family, even 
 though the column family schema will be the same per user?
 
 Would this help with dimensioning, if each user is querying their keyspace 
 and only their keyspace?
 
 
 Trevor Francis
 
 
 



Re: Column Family per User

2012-04-18 Thread Dave Brosius
 Your design should be around how you want to query. If you are only querying 
by user, then having a user as part of the row key makes sense. To manage row 
size, you should think of a row as being a bucket of time. Cassandra supports a 
large (but not without bounds) row size. To manage row size you might say that 
this row is for user fred for the month of april, or if that's too much perhaps 
the row is for user fred for the day 4/18/12. To do this you can use composite 
keys to hold both pieces of information in the key. (user, bucketpos)The nice 
thing is that once the time period has come and gone, that row is complete, and 
you can perform background jobs against that row and store summary information 
for that time period.  - Original Message -From: quot;Trevor 
Francisquot; ;trevor.fran...@tgrahamcapital.com 

Re: Column Family per User

2012-04-18 Thread Trevor Francis
I am trying to grasp this concept….so let me try a scenario.

Lets say I have 5 data points being captured in the log file. Here would be a 
typical table schema in mysql.

Id, Username, Time, Wind, Rain, Sunshine

Select * from table; would reveal:

1, george, 2012-04-12T12:22:23.293, 55, 45, 10
2, george, 2012-04-12T12:22:24.293, 45, 25, 25
3, george, 2012-04-12T12:22:25.293, 35, 15, 11
4, george, 2012-04-12T12:22:26.293, 55, 65, 16
5, george, 2012-04-12T12:22:27.293, 12, 5, 22

And it would just continue from there adding rows as log files are imported.

A select * from table where sunshine=16 would yield:

4, george, 2012-04-12T12:22:26.293, 55, 65, 16

  
Now, you are saying that in Cassandra, Instead of having a bunch of rows 
containing ordered information (which is what I would have), I would have a 
single row with multiple columns:

George | 2012-04-12T12:22:23.293, wind=55 | 2012-04-12T12:22:23.293, Rain=45 | 
2012-04-12T12:22:23.293, Sunshine=10 | .continued.

So George would be the row and the columns would be the actual data. The data 
would be oriented horizontally, vs vertically (mysql).

So for instance, log generation on our application isn't linear as it peaks at 
certain times of the day. A user generating at peak 2500 would typically 
generate 60M log entries per day. Multiply that times 20 data pieces and you 
are looking at 1.2B Columns in a given day for that user. Assuming we batches 
the writes every minute, can a node handle this sort of load?

Also, can we rotate the row every day? Would it make more sense to rotate 
hourly? At peak, hourly rotation would decrease the row size to 180M data 
points vs. 1.2B.

At max, we may only have 500 users on our platform. That means that if we did 
hourly row rotation, that would be 12,000 rows per day…with the maximum column 
size of 180M columns.


Am I grasping this concept properly?

Trevor Francis


On Apr 18, 2012, at 3:06 PM, Dave Brosius wrote:

 
 Your design should be around how you want to query. If you are only querying 
 by user, then having a user as part of the row key makes sense. To manage row 
 size, you should think of a row as being a bucket of time. Cassandra supports 
 a large (but not without bounds) row size. To manage row size you might say 
 that this row is for user fred for the month of april, or if that's too much 
 perhaps the row is for user fred for the day 4/18/12. To do this you can use 
 composite keys to hold both pieces of information in the key. (user, 
 bucketpos)
 
 The nice thing is that once the time period ha s come and gone, that row is 
 complete, and you can perform background jobs against that row and store 
 summary information for that time period.
 
 
 - Original Message -
 From: Trevor Francis trevor.fran...@tgrahamcapital.com 
 Sent: Wed, April 18, 2012 15:48
 Subject: Re: Column Family per User
 
 Janne,
  
  
 Of course, I am new to the Cassandra world, so it is taking some getting used 
 to understand how everything translates into my MYSQL head.
  
 We are building an enterprise application that will ingest log inf ormation 
 and provide metrics and trending based upon the data contained in the logs. 
 The application is transactional in nature such that a record will be written 
 to a log and our system will need to query that record and assign two values 
 to it in addition to using the information to develop trending metrics. 
  
 The logs are being fed into cassandra by Flume.
  
 Each of our users will be assigned their own piece of hardware that generates 
 these log events, some of which can peak at up to 2500 transactions per 
 second for a couple of hours. The log entries are around 150-bytes each and 
 contain around 20 different pieces of information. Neither us, nor our users 
 are interested in generating any queries across the entire database. Users 
 are only concerned with the data that their particular piece of hardware 
 generates. 
  
 Should I just setup a single column family with 20 columns, the first of 
 which bei ng the row key and make the row key the username of that user?
  
 We would also need probably 2 more columns to store Value A and Value B 
 assigned to that particular record.
  
 Our metrics will be be something like this: For this particular user, during 
 this particular timeframe, what is the average of field X? And then store 
 that value, which we can generate historical trending over the course a week. 
 We will do this every 15 minutes. 
  
 Any suggestions on where I should head to start my journey into Cassandra for 
 my particular application?
  
 
 Trevor Francis
  
 
 On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote:
 
  
 Each CF takes a fair chunk of memory regardless of how much data it has, so 
 this is probably not a good idea, if you have lots of users. Also using a 
 single CF means that compression is likely to work better (more redundant 
 data).
  
 However, Cassandra distributes the load across different nodes 

Re: Column Family per User

2012-04-18 Thread Dave Brosius
Yes in this cassandra model, time wouldn't be a column value, it would be part 
of the column name. Depending on how you want to access your data (give me all 
data points for time X) and how many separate datapoints you have for time X, 
you might consider packing all the data for a time in one column thru composite 
columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a human 
readable representation of the composite separator) in this case there wouldn't 
actually be a value, the data is just encoded in the column name.Obviously if 
you are storing dozens of separate datapoints for a timestamp than this gets 
out of hand quickly, and perhaps you need to go back to column names with 
time/fieldname format with a real value.the advantage tho of the composite key 
is that you eliminate all that constant blather about 'Wind' 'Rain' 'Sunshine' 
in your data and only hold real data. (granted compression will probably help 
here, but not having it all is even better).as for row size, obv
 iously that takes some experimentation on you part. You can bucket a row to be 
any time frame you want. If you feel that 15 minutes is the correct length of 
time given the amount of data you will write, then use 15 minutes. It it's 1 
hour, use 1 hour. The only thing you have to figure out is a 'bucket time' 
definition that you understand, likely it's the timestamp of when that time 
period starts.As for 'rotating the row', perhaps it's just semantics, but there 
really is no such concept. You are at some point in time, and you want to write 
some data to the database.The steps are1) get the user2) get the timestamp of 
the current bucket based on 'now'3) build a composite key4) insert the data 
with that keyWhether that row existed before or is a new row has no bearing on 
your client code.  - Original Message -From: quot;Trevor Francisquot; 
;trevor.fran...@tgrahamcapital.com 

Re: Column Family per User

2012-04-18 Thread Dave Brosius
  Yes in this cassandra model, time wouldn't be a column value, it would be 
part of the column name. Depending on how you want to access your data (give me 
all data points for time X) and how many separate datapoints you have for time 
X, you might consider packing all the data for a time in one column thru 
composite columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a 
human readable representation of the composite separator) in this case there 
wouldn't actually be a value, the data is just encoded in the column 
name.Obviously if you are storing dozens of separate datapoints for a timestamp 
than this gets out of hand quickly, and perhaps you need to go back to column 
names with time/fieldname format with a real value.the advantage tho of the 
composite key is that you eliminate all that constant blather about 'Wind' 
'Rain' 'Sunshine' in your data and only hold real data. (granted compression 
will probably help   here, but not having it all is even better).as for row 
size,
  obviously that takes some experimentation on you part. You can bucket a row 
to be any time frame you want. If you feel that 15 minutes is the correct 
length of time given the amount of data you will write, then use 15 minutes. It 
it's 1 hour, use 1 hour. The only thing you have to figure out is a 'bucket 
time' definition that you understand, likely it's the timestamp of when that 
time period starts.As for 'rotating the row', perhaps it's just semantics, but 
there really is no such concept. You are at some point in time, and you want to 
write some data to the database.The steps are1) get the user2) get the 
timestamp of the current bucket based on 'now'3) build a composite key4) insert 
the data with that keyWhether that row existed before or is a new row has no 
bearing on your client code.  - Original Message -From: quot;Trevor 
Francisquot; 
;trevor.fran...@tgrahamcapital.com;trevor.fran...@tgrahamcapital.com

Re: Column Family per User

2012-04-18 Thread Trevor Francis
Regarding Rotating, I was thinking about the concept of log rotate, where you 
write to a file for a specific period of time, then you create a new file and 
write to it after a specific set of time. So yes, it closes a row and opens 
another row.

Since I will be generating analytics every 15 minutes, its would make sense to 
me to bucket a row every 15 minutes. Since I would only have at most 500 users, 
this doesn't strike me as too many rows in a given day (48,000). Potential 
downsides to doing this?

Since I am analyzing 20 separate data points for a given log entry, it would 
make sense that querying based upon a specific metric (wind, rain, sunshine) 
would be easier if the data was separated. However, couldn't we build composite 
columns for time and value where all that would be left in data?

So composite row key would be:

george
2012-04-12T12:20

And Columns would be: 

12:22:23.293/Wind

12:22:23.293/Rain

12:22:23.293/Sunshine

Data would be:
55
45
10


Our the columns could be 12:22:23.293

Data:
Wind/55/45/35


Or something like that….Am I headed in the right direction?


Trevor Francis


On Apr 18, 2012, at 3:10 PM, Janne Jalkanen wrote:

 
 Hi!
 
 A simple model to do this would be
 
 * ColumnFamily Data
   * key: userid
   * column: Composite( timestamp, entrytype ) = value
 
 For example, userid janne would have columns 
(2012-04-12T12:22:23.293,speed) = 24;
(2012-04-12T12:22:23.293,temperature) = 12.4
(2012-04-12T12:22:23.293,direction) = 356;
(2012-04-12T12:22:23.295,speed) = 24.1;
(2012-04-12T12:22:23.295,temperature) = 12.3
(2012-04-12T12:22:23.295,direction) = 352;
 
 Note that Cassandra does not require you to know which columns you're going 
 to put in it (unlike MySQL). You can declare types ahead if you know what 
 they are, but if you'll need to start adding a new column, just start writing 
 it and Cassandra should do the right things.
 
 However, there are a few points which you might want to consider
 * Using ISO dates for timestamps have a minor problem: if two events occur 
 during the same millisecond, they'll overwrite each other. This is why most 
 time series in C* use TimeUUIDs, which contain a millisecond timestamp + a 
 random component. 
 (http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/)
 * This will generate timestamp*entrytype columns. So for 2500 entries/second 
 and 20 columns this means about 2500*20 = 5 wps (granted that you will 
 most probably batch the writes though). You will need to performance test 
 your cluster to see if this schema is right for you. If not, you might want 
 to try and see how you can distribute the keys differently, e.g. by bucketing 
 the data somehow. However, I recommend that you build a first-shot of your 
 app structure, then load test it until it breaks and that should give you 
 pretty good understanding of what exactly cassandra is doing.
 
 To do then analytics multiple options are possible; a popular one is to run 
 MapReduce queries using a tool like Apache Pig on regular intervals. DataStax 
 has good documentation and you probably want to take a look at their offering 
 as well, since they have pretty good Hadoop/MapReduce support for Cassandra.
 
 CLI syntax to try with:
 
 create keyspace DataTest with 
 placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and 
 strategy_options = {replication_factor:1};
 use DataTest;
 create column family Data with key_validation_class=UTF8Type and 
 comparator='CompositeType(UUIDType,UTF8Type)';
 
 Then start writing using your fav client.
 
 /Janne
 
 On Apr 18, 2012, at 22:36 , Trevor Francis wrote:
 
 Janne,
 
 
 Of course, I am new to the Cassandra world, so it is taking some getting 
 used to understand how everything translates into my MYSQL head.
 
 We are building an enterprise application that will ingest log information 
 and provide metrics and trending based upon the data contained in the logs. 
 The application is transactional in nature such that a record will be 
 written to a log and our system will need to query that record and assign 
 two values to it in addition to using the information to develop trending 
 metrics. 
 
 The logs are being fed into cassandra by Flume.
 
 Each of our users will be assigned their own piece of hardware that 
 generates these log events, some of which can peak at up to 2500 
 transactions per second for a couple of hours. The log entries are around 
 150-bytes each and contain around 20 different pieces of information. 
 Neither us, nor our users are interested in generating any queries across 
 the entire database. Users are only concerned with the data that their 
 particular piece of hardware generates. 
 
 Should I just setup a single column family with 20 columns, the first of 
 which being the row key and make the row key the username of that user?
 
 We would also need probably 2 more columns to store Value A and Value B 
 assigned to that particular record.
 
 

Re: Column Family per User

2012-04-18 Thread Dave Brosius
It seems to me you are on the right track. Finding the right balance of # rows 
vs row width is the part that will take the most experimentation.  - 
Original Message -From: quot;Trevor Francisquot; 
;trevor.fran...@tgrahamcapital.com 

DataStax Opscenter 2.0 question

2012-04-18 Thread Jay Parashar
I am having trouble in running the OpsCenter. It starts without any error
but the GUI stays in the index page and just shows Loading OpsCenter..
Firebug shows an error this._onClusterSave.bind is not a function.

I have the log turned on DEBUG it shows no error (pasted below). This is the
only change I have made to the opscenterd.conf file.

 

Cassandra 1.0.8 included in DSE 2.0 is running fine. Would appreciate any
help. Previously, I had no problems configuring/running the OpsCenter 1.4.1.


 

2012-04-18 23:40:32+0200 []  INFO: twistd 10.2.0 (/usr/bin/python2.6 2.6.6)
starting up.

2012-04-18 23:40:32+0200 []  INFO: reactor class:
twisted.internet.selectreactor.SelectReactor.

2012-04-18 23:40:32+0200 []  INFO: Logging level set to 'debug'

2012-04-18 23:40:32+0200 []  INFO: OpsCenterdService startService

2012-04-18 23:40:32+0200 []  INFO: OpsCenter version: 2.0

2012-04-18 23:40:32+0200 []  INFO: Compatible agent version: 2.7

2012-04-18 23:40:32+0200 [] DEBUG: Main config options:

2012-04-18 23:40:32+0200 [] DEBUG: agents : [('ssl_certfile',
'./ssl/opscenter.pem'), ('tmp_dir', './tmp'), ('path_to_installscript',
'./bin/install_agent.sh'), ('path_to_sudowrap', './bin/sudo_with_pass.py'),
('agent_keyfile', './ssl/agentKeyStore'), ('path_to_deb', 'NONE'),
('ssl_keyfile', './ssl/opscenter.key'), ('agent_certfile',
'./ssl/agentKeyStore.pem'), ('path_to_rpm', 'NONE')], webserver :
[('interface', '127.0.0.1'), ('staticdir', './content'), ('port', ''),
('log_path', './log/http.log')], stat_reporter : [('ssl_key',
'./ssl/stats.pem')], logging : [('log_path', './log/opscenterd.log'),
('level', 'DEBUG')], authentication : [('passwd_file', './passwds')], 

2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files

2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files

2012-04-18 23:40:32+0200 []  INFO: No clusters are configured yet, checking
to see if a config migration is needed

2012-04-18 23:40:32+0200 []  INFO: Main config does not appear to include a
cluster configuration, skipping migration

2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files

2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files

2012-04-18 23:40:32+0200 []  INFO: No clusters are configured

2012-04-18 23:40:32+0200 []  INFO: HTTP BASIC authentication disabled

2012-04-18 23:40:32+0200 []  INFO: SSL enabled

2012-04-18 23:40:32+0200 []  INFO: opscenterd.WebServer.OpsCenterdWebServer
starting on 

2012-04-18 23:40:32+0200 []  INFO: Starting factory
opscenterd.WebServer.OpsCenterdWebServer instance at 0x9e35c2c

2012-04-18 23:40:32+0200 []  INFO: morbid.morbid.StompFactory starting on
61619

2012-04-18 23:40:32+0200 []  INFO: Starting factory
morbid.morbid.StompFactory instance at 0x9e96a4c

2012-04-18 23:40:32+0200 []  INFO: Configuring agent communication with ssl
support enabled.

2012-04-18 23:40:32+0200 []  INFO: morbid.morbid.StompFactory starting on
61620

2012-04-18 23:40:32+0200 []  INFO: OS Version: Linux version
2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5 (Ubuntu/Linaro
4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010

2012-04-18 23:40:32+0200 []  INFO: CPU Info: ['2197.558', '2197.558']

2012-04-18 23:40:33+0200 []  INFO: Mem Info: 2966MB

2012-04-18 23:40:33+0200 []  INFO: Package Manager: apt

2012-04-18 23:41:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.23%,
memory usage: 16 MB

2012-04-18 23:42:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.07%,
memory usage: 16 MB

 

 

Thanks

Jay



Re: DataStax Opscenter 2.0 question

2012-04-18 Thread Nick Bailey
What version of firefox? Someone has reported a similar issue with
firefox 3.6? Can you try with chrome or perhaps a more recent version
of firefox (assuming you are also on an older version)?

On Wed, Apr 18, 2012 at 4:51 PM, Jay Parashar jparas...@itscape.com wrote:
 I am having trouble in running the OpsCenter. It starts without any error
 but the GUI stays in the index page and just shows “Loading OpsCenter…”.
 Firebug shows an error “this._onClusterSave.bind is not a function”.

 I have the log turned on DEBUG it shows no error (pasted below). This is the
 only change I have made to the “opscenterd.conf” file.



 Cassandra 1.0.8 included in DSE 2.0 is running fine. Would appreciate any
 help. Previously, I had no problems configuring/running the OpsCenter 1.4.1.



 2012-04-18 23:40:32+0200 []  INFO: twistd 10.2.0 (/usr/bin/python2.6 2.6.6)
 starting up.

 2012-04-18 23:40:32+0200 []  INFO: reactor class:
 twisted.internet.selectreactor.SelectReactor.

 2012-04-18 23:40:32+0200 []  INFO: Logging level set to 'debug'

 2012-04-18 23:40:32+0200 []  INFO: OpsCenterdService startService

 2012-04-18 23:40:32+0200 []  INFO: OpsCenter version: 2.0

 2012-04-18 23:40:32+0200 []  INFO: Compatible agent version: 2.7

 2012-04-18 23:40:32+0200 [] DEBUG: Main config options:

 2012-04-18 23:40:32+0200 [] DEBUG: agents : [('ssl_certfile',
 './ssl/opscenter.pem'), ('tmp_dir', './tmp'), ('path_to_installscript',
 './bin/install_agent.sh'), ('path_to_sudowrap', './bin/sudo_with_pass.py'),
 ('agent_keyfile', './ssl/agentKeyStore'), ('path_to_deb', 'NONE'),
 ('ssl_keyfile', './ssl/opscenter.key'), ('agent_certfile',
 './ssl/agentKeyStore.pem'), ('path_to_rpm', 'NONE')], webserver :
 [('interface', '127.0.0.1'), ('staticdir', './content'), ('port', ''),
 ('log_path', './log/http.log')], stat_reporter : [('ssl_key',
 './ssl/stats.pem')], logging : [('log_path', './log/opscenterd.log'),
 ('level', 'DEBUG')], authentication : [('passwd_file', './passwds')],

 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files

 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files

 2012-04-18 23:40:32+0200 []  INFO: No clusters are configured yet, checking
 to see if a config migration is needed

 2012-04-18 23:40:32+0200 []  INFO: Main config does not appear to include a
 cluster configuration, skipping migration

 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files

 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files

 2012-04-18 23:40:32+0200 []  INFO: No clusters are configured

 2012-04-18 23:40:32+0200 []  INFO: HTTP BASIC authentication disabled

 2012-04-18 23:40:32+0200 []  INFO: SSL enabled

 2012-04-18 23:40:32+0200 []  INFO: opscenterd.WebServer.OpsCenterdWebServer
 starting on 

 2012-04-18 23:40:32+0200 []  INFO: Starting factory
 opscenterd.WebServer.OpsCenterdWebServer instance at 0x9e35c2c

 2012-04-18 23:40:32+0200 []  INFO: morbid.morbid.StompFactory starting on
 61619

 2012-04-18 23:40:32+0200 []  INFO: Starting factory
 morbid.morbid.StompFactory instance at 0x9e96a4c

 2012-04-18 23:40:32+0200 []  INFO: Configuring agent communication with ssl
 support enabled.

 2012-04-18 23:40:32+0200 []  INFO: morbid.morbid.StompFactory starting on
 61620

 2012-04-18 23:40:32+0200 []  INFO: OS Version: Linux version
 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5 (Ubuntu/Linaro
 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010

 2012-04-18 23:40:32+0200 []  INFO: CPU Info: ['2197.558', '2197.558']

 2012-04-18 23:40:33+0200 []  INFO: Mem Info: 2966MB

 2012-04-18 23:40:33+0200 []  INFO: Package Manager: apt

 2012-04-18 23:41:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.23%,
 memory usage: 16 MB

 2012-04-18 23:42:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.07%,
 memory usage: 16 MB





 Thanks

 Jay


Cassandra read optimization

2012-04-18 Thread Dan Feldman
Hi all,

I'm trying to optimize moving data from Cassandra to HDFS using either Ruby
or Python client. Right now, I'm playing around on my staging server, an 8
GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for
now) with ~150k super columns each (I know, I know - super columns are
bad). Every super column has ~25 columns totaling ~800 bytes per super
column.

I should also mention that currently the database is static - there are no
writes/updates, only reads.

Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns
long from a single row.  It takes 13 seconds with ruby and 8 seconds with
pycassa to get a single slice. Or, in other words, it's currently reading
at speeds of less than 500 kB per second. The speed seems to be linear with
the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
nodetool cfstats while my script is running, it tells me that my read
latency on the column family is ~300ms.

I assume that this is not normal and thus was wondering what parameters I
could tweak to improve the performance.

Thanks,
Dan F.


Re: Cassandra read optimization

2012-04-18 Thread Aaron Turner
On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote:
 Hi all,

 I'm trying to optimize moving data from Cassandra to HDFS using either Ruby
 or Python client. Right now, I'm playing around on my staging server, an 8
 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for
 now) with ~150k super columns each (I know, I know - super columns are bad).
 Every super column has ~25 columns totaling ~800 bytes per super column.

 I should also mention that currently the database is static - there are no
 writes/updates, only reads.

 Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns
 long from a single row.  It takes 13 seconds with ruby and 8 seconds with
 pycassa to get a single slice. Or, in other words, it's currently reading at
 speeds of less than 500 kB per second. The speed seems to be linear with the
 length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool
 cfstats while my script is running, it tells me that my read latency on the
 column family is ~300ms.

 I assume that this is not normal and thus was wondering what parameters I
 could tweak to improve the performance.


Is your client mult-threaded?  The single threaded performance of
Cassandra isn't at all impressive and it really is designed for
dealing with a lot of simultaneous requests.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
carpe diem quam minimum credula postero


Re: RMI/JMX errors, weird

2012-04-18 Thread Maxim Potekhin
Server log below. Mind you that all the nodes are still up -- even 
though reported as dead in this log.

What's going on here?

Thanks!

 INFO [GossipTasks:1] 2012-04-18 22:18:26,487 Gossiper.java (line 719) 
InetAddress /130.199.185.193 is now dead.
 INFO [ScheduledTasks:1] 2012-04-18 22:18:26,487 StatusLogger.java 
(line 50) Pool NameActive   Pending   Blocked
ERROR [GossipTasks:1] 2012-04-18 22:18:26,488 AntiEntropyService.java 
(line 722) Problem during repair session manual-repair-1b3453b

6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 died
ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 22:18:26,488 
StorageService.java (line 1607) Repair session org.apache.cas

sandra.service.AntiEntropyService$RepairSession@4cc9e2bc failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
java.io.IOException: Problem during repair session manual-repai

r-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.185.195 died
at 
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1603)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
at 
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
at 
com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
at 
com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
at 
com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
at 
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
at 
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
at 
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
at 
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360)
at 
javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)

at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)

at sun.rmi.transport.Transport$1.run(Transport.java:159)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: java.io.IOException: Problem 
during repair session manual-repair-43545b22-ffe8-4243-8a98-509b

bfec9872, endpoint /130.199.185.195 died
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)
... 3 more
Caused by: java.io.IOException: Problem during repair session 
manual-repair-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.

185.195 died
at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:723)
at 
org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:760)
at 
org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:165)
at 
org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538)

at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
at 
org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)
at 

Lexer error at char '\u201C'

2012-04-18 Thread Trevor Francis
Trying to add an agent config through the master web server to point to a 
collector node, getting:

FAILED  config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), 
agentDFOSink(“10.38.20.203”,35853)]  Attempted to write an invalid 
sink/source: Lexer error at char '\u201C' at line 1 char 13


Any ideas?

Trevor Francis




Re: Lexer error at char '\u201C'

2012-04-18 Thread Tyler Hobbs
This... looks like Flume.  Are you sure you've got the right mailing list?

On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis 
trevor.fran...@tgrahamcapital.com wrote:

 Trying to add an agent config through the master web server to point to a
 collector node, getting:

 FAILEDconfig [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0),
 agentDFOSink(“10.38.20.203”,35853)]Attempted to write an invalid
 sink/source: Lexer error at char '\u201C' at line 1 char 13


 Any ideas?

 Trevor Francis





-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Lexer error at char '\u201C'

2012-04-18 Thread Trevor Francis
…..slaps himself.

Oh you guys at Datastax are great. I have deployed a small Cassandra cluster 
using your community edition. Actually currently working on making Flume use 
cassandra as a sink…..unsuccessfully. However, I did just get this Flume error 
fixed.

Are you aware of any cassandra sinks that actually work for Flume?


Trevor Francis


On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote:

 This... looks like Flume.  Are you sure you've got the right mailing list?
 
 On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:
 Trying to add an agent config through the master web server to point to a 
 collector node, getting:
 
 FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), 
 agentDFOSink(“10.38.20.203”,35853)]  Attempted to write an invalid 
 sink/source: Lexer error at char '\u201C' at line 1 char 13
 
 
 Any ideas?
 
 Trevor Francis
 
 
 
 
 
 -- 
 Tyler Hobbs
 DataStax
 



Re: Lexer error at char '\u201C'

2012-04-18 Thread Nick Bailey
https://github.com/thobbs/flume-cassandra-plugin

I think that is fairly up to date, right Tyler?

On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis 
trevor.fran...@tgrahamcapital.com wrote:

 …..slaps himself.

 Oh you guys at Datastax are great. I have deployed a small Cassandra
 cluster using your community edition. Actually currently working on making
 Flume use cassandra as a sink…..unsuccessfully. However, I did just get
 this Flume error fixed.

 Are you aware of any cassandra sinks that actually work for Flume?


  Trevor Francis


 On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote:

 This... looks like Flume.  Are you sure you've got the right mailing list?

 On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:

 Trying to add an agent config through the master web server to point to a
 collector node, getting:

 FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0),
 agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid
 sink/source: Lexer error at char '\u201C' at line 1 char 13


 Any ideas?

  Trevor Francis





 --
 Tyler Hobbs
 DataStax http://datastax.com/





Re: Lexer error at char '\u201C'

2012-04-18 Thread Tyler Hobbs
Yup, you beat me to the punch by a minute.

On Wed, Apr 18, 2012 at 11:39 PM, Nick Bailey n...@datastax.com wrote:

 https://github.com/thobbs/flume-cassandra-plugin

 I think that is fairly up to date, right Tyler?


 On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:

 …..slaps himself.

 Oh you guys at Datastax are great. I have deployed a small Cassandra
 cluster using your community edition. Actually currently working on making
 Flume use cassandra as a sink…..unsuccessfully. However, I did just get
 this Flume error fixed.

 Are you aware of any cassandra sinks that actually work for Flume?


  Trevor Francis


 On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote:

 This... looks like Flume.  Are you sure you've got the right mailing list?

 On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:

 Trying to add an agent config through the master web server to point to
 a collector node, getting:

 FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0),
 agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid
 sink/source: Lexer error at char '\u201C' at line 1 char 13


 Any ideas?

  Trevor Francis





 --
 Tyler Hobbs
 DataStax http://datastax.com/






-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Cassandra read optimization

2012-04-18 Thread Tyler Hobbs
I tested this out with a small pycassa script:
https://gist.github.com/2418598

On my not-very-impressive laptop, I can read 5000 of the super columns in 3
seconds (cold) or 1.5 (warm).  Reading in batches of 1000 super columns at
a time gives much better performance; I definitely recommend going with a
smaller batch size.

Make sure that the timeout on your ConnectionPool isn't too low to handle a
big request in pycassa.  If you turn on logging (as it is in the script I
linked), you should be able to see if the request is timing out a couple of
times before it succeeds.

It might also be good to make sure that you've got JNA in place and your
heap size is sufficient.

On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com wrote:

 On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote:
  Hi all,
 
  I'm trying to optimize moving data from Cassandra to HDFS using either
 Ruby
  or Python client. Right now, I'm playing around on my staging server, an
 8
  GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows
 (for
  now) with ~150k super columns each (I know, I know - super columns are
 bad).
  Every super column has ~25 columns totaling ~800 bytes per super column.
 
  I should also mention that currently the database is static - there are
 no
  writes/updates, only reads.
 
  Anyways, in my python/ruby scripts, I'm taking slices of 5000
 supercolumns
  long from a single row.  It takes 13 seconds with ruby and 8 seconds with
  pycassa to get a single slice. Or, in other words, it's currently
 reading at
  speeds of less than 500 kB per second. The speed seems to be linear with
 the
  length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
 nodetool
  cfstats while my script is running, it tells me that my read latency on
 the
  column family is ~300ms.
 
  I assume that this is not normal and thus was wondering what parameters I
  could tweak to improve the performance.
 

 Is your client mult-threaded?  The single threaded performance of
 Cassandra isn't at all impressive and it really is designed for
 dealing with a lot of simultaneous requests.


 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Lexer error at char '\u201C'

2012-04-18 Thread Trevor Francis
it pukes:

ERROR com.cloudera.flume.conf.SinkFactoryImpl: Could not find class 
org.apache.cassandra.plugins.flume.sink.LogsandraSyslogSink for plugin loading

followed the read me to a t

property
nameflume.plugin.classes/name

valueorg.apache.cassandra.plugins.flume.sink.SimpleCassandraSink,org.apache.cassandra.plugins.flume.sink.LogsandraSyslogSink/value
descriptionComma separated list of plugin classes/description
  /property



Weird: 

Trevor Francis


On Apr 18, 2012, at 11:39 PM, Nick Bailey wrote:

 https://github.com/thobbs/flume-cassandra-plugin
 
 I think that is fairly up to date, right Tyler?
 
 On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:
 …..slaps himself.
 
 Oh you guys at Datastax are great. I have deployed a small Cassandra cluster 
 using your community edition. Actually currently working on making Flume use 
 cassandra as a sink…..unsuccessfully. However, I did just get this Flume 
 error fixed.
 
 Are you aware of any cassandra sinks that actually work for Flume?
 
 
 Trevor Francis
 
 
 On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote:
 
 This... looks like Flume.  Are you sure you've got the right mailing list?
 
 On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis 
 trevor.fran...@tgrahamcapital.com wrote:
 Trying to add an agent config through the master web server to point to a 
 collector node, getting:
 
 FAILEDconfig [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), 
 agentDFOSink(“10.38.20.203”,35853)]  Attempted to write an invalid 
 sink/source: Lexer error at char '\u201C' at line 1 char 13
 
 
 Any ideas?
 
 Trevor Francis
 
 
 
 
 
 -- 
 Tyler Hobbs
 DataStax