Re: size tiered compaction - improvement
Any compaction pass over A will first convert the TTL data into tombstones. Then, any subsequent pass that includes A *and all other sstables containing rows with the same key* will drop the tombstones. thats why i proposed to attach TTL to entire CF. Tombstones would not be needed
blob fields, bynary or hexa?
We're building a database to stock the avatars for our users in three sizes. Thing is, We planned to use the blob field with a ByteType validator, but if we try to inject the binary data as read from the image file, we get a cannot parse as hex bytes error. The same happens if we convert the binary data to its base64 representation. So far the only solutions we found is to actually convert the binary data to its 'string of hexa representations of each byte', meaning that each binary byte is actually stocked as two 'ascii bytes'; or, on the other hand, convert to bas64 and store it in a ascii column. Did we miss something? -- Marcos Dione SysAdmin Astek Sud-Est pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo 04 97 12 62 45 - mdione@orange.com _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: blob fields, bynary or hexa?
On 04/18/2012 03:02 AM, mdione@orange.com wrote: We're building a database to stock the avatars for our users in three sizes. Thing is, We planned to use the blob field with a ByteType validator, but if we try to inject the binary data as read from the image file, we get acannot parse as hex bytes error. Which client are you using? With Hector or straight thrift, your should be able to store byte[] directly. - Erik -
RE: size tiered compaction - improvement
Our use case requires Column TTL, not CF TTL, because it is variable, not constant. Best regards/ Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.-Original Message- From: Radim Kolar [mailto:h...@filez.com] Sent: Wednesday, April 18, 2012 12:57 To: user@cassandra.apache.org Subject: Re: size tiered compaction - improvement Any compaction pass over A will first convert the TTL data into tombstones. Then, any subsequent pass that includes A *and all other sstables containing rows with the same key* will drop the tombstones. thats why i proposed to attach TTL to entire CF. Tombstones would not be needed
RE: blob fields, bynary or hexa?
De : Erik Forkalsud [mailto:eforkals...@cj.com] Which client are you using? With Hector or straight thrift, your should be able to store byte[] directly. So far, cassandra-cli only, but we're also testing phpcassa with CQL support[1]. -- [1] https://github.com/thobbs/phpcassa -- Marcos Dione SysAdmin Astek Sud-Est pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo 04 97 12 62 45 - mdione@orange.com _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: swap grows
Thanks for link. But for me still present question about free memory. In out cluster we have 200 IOPS in peaks, but still have about 3GB of free memory on each server (cluster have 6 nodes tho there are 3*6=18 GB of unused memry). I think that OS must fill all memory with pagecache (we do backups throw DirectIO) of SStables, but it doesn't do that and i doesn't understand why. I can't find any sysctl that can tune pagecache thresholds or ratio. Any suggestion 2012/4/18 Jonathan Ellis jbel...@gmail.com what-is-the-linux-kernel-parameter-vm-swappinesshttp://www.linuxvox.com/2009/10/what-is-the-linux-kernel-parameter-vm-swappiness
Re: size tiered compaction - improvement
For my use case it would be nice to have per CF TTL (to protect myself from application bug and from storage leak due to missed TTL), but seems you can't avoid tombstones even in this case and if you change CF TTL during runtime. On 04/18/2012 03:06 PM, Viktor Jevdokimov wrote: Our use case requires Column TTL, not CF TTL, because it is variable, not constant. Best regards/ Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.-Original Message- From: Radim Kolar [mailto:h...@filez.com] Sent: Wednesday, April 18, 2012 12:57 To: user@cassandra.apache.org Subject: Re: size tiered compaction - improvement Any compaction pass over A will first convert the TTL data into tombstones. Then, any subsequent pass that includes A *and all other sstables containing rows with the same key* will drop the tombstones. thats why i proposed to attach TTL to entire CF. Tombstones would not be needed
Re: Counter column family
My problem was the result of Hector bug, see http://groups.google.com/group/hector-users/browse_thread/thread/8359538ed387564e So please ignore question, Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 On Tue, Apr 17, 2012 at 6:59 PM, Tamar Fraenkel ta...@tok-media.com wrote: Hi! I want to understand how incrementing of counter works. - I have a 3 node ring, - I use FailoverPolicy.FAIL_FAST, - RF is 2, I have the following counter column family ColumnFamily: tk_counters Key Validation Class: org.apache.cassandra.db.marshal.CompositeType( org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal. UUIDType) Default column value validator: org.apache.cassandra.db.marshal. CounterColumnType Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache. SerializingCacheProvider Key cache size / save period in seconds: 0.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy My CL for this column family is Write=2, Read=1. When I increment a counter (using hector mutator), and execute returns without errors, what is the status of the nodes at that stage. Can execute return before the nodes are really updated? So that if a read is done immediately after the increment it will still read the previous values? Thanks, *Tamar Fraenkel * Senior Software Engineer, TOK Media [image: Inline image 1] ta...@tok-media.com Tel: +972 2 6409736 Mob: +972 54 8356490 Fax: +972 2 5612956 tokLogo.png
Re: size tiered compaction - improvement
It's not that simple, unless you have an append-only workload. (See discussion on https://issues.apache.org/jira/browse/CASSANDRA-3974.) On Wed, Apr 18, 2012 at 4:57 AM, Radim Kolar h...@filez.com wrote: Any compaction pass over A will first convert the TTL data into tombstones. Then, any subsequent pass that includes A *and all other sstables containing rows with the same key* will drop the tombstones. thats why i proposed to attach TTL to entire CF. Tombstones would not be needed -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: blob fields, bynary or hexa?
How are you passing a blob or binary stream to the CLI? It sounds like you're passing in a representation of a binary stream as ascii/UTF8 which will create the problems you describe. Regards, Duc On 4/18/12 6:08 AM, mdione@orange.com mdione@orange.com wrote: De : Erik Forkalsud [mailto:eforkals...@cj.com] Which client are you using? With Hector or straight thrift, your should be able to store byte[] directly. So far, cassandra-cli only, but we're also testing phpcassa with CQL support[1]. -- [1] https://github.com/thobbs/phpcassa -- Marcos Dione SysAdmin Astek Sud-Est pour FT/TGPF/OPF/PORTAIL/DOP/HEBEX @ Marco Polo 04 97 12 62 45 - mdione@orange.com __ ___ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Single Vs. Multiple Keyspaces
We are launching a data-intensive application that will store in upwards of 50 million 150-byte records per day per user. We have identified Cassandra as our database technology and Flume as what we will use to seed the data from log files into the database. Each user is given their own server instance, but the schema of the data for each user will be the same. We will be performing realtime analysis on this information as part of our application and was considering the advantages/disadvantages of all users using the same keyspace. All data will be treated the same as far as replication factor and the only difference is we won't be displaying one user's info to another user. They will be compartmentalized and one user's data will not affect or ever be compared against another user. Conceptualize this as a each user has their own Apache server and that server spits out 50 million records per day and each user will only be analyzing the data for their particular server, not anyone elses. The log formats are exactly the same. My experience lies in relational databases and not key-value stores, like Cassandra. So, in the mysql world we would put each user in their own database to avoid the locking contention and to make queries faster. If we don't post info into different keyspaces, i assume we will have to add an additional field to our records to identify the user that owns that particular record. How does a single large Keyspace affect query speed, etc. etc. Trevor Francis
Re: Resident size growth
On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote: mmap doesn't depend on jna FWIW, this confusion is as a result of the use of *mlockall*, which is used to prevent mmapped files from being swapped, which does depend on JNA. =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
RE: size tiered compaction - improvement
Per CF or per Row TTL would be very usefull for me also with our timeseries data. -Original Message- From: Igor [mailto:i...@4friends.od.ua] Sent: Wednesday, April 18, 2012 6:06 AM To: user@cassandra.apache.org Subject: Re: size tiered compaction - improvement For my use case it would be nice to have per CF TTL (to protect myself from application bug and from storage leak due to missed TTL), but seems you can't avoid tombstones even in this case and if you change CF TTL during runtime. On 04/18/2012 03:06 PM, Viktor Jevdokimov wrote: Our use case requires Column TTL, not CF TTL, because it is variable, not constant. Best regards/ Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063 Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.-Original Message- From: Radim Kolar [mailto:h...@filez.com] Sent: Wednesday, April 18, 2012 12:57 To: user@cassandra.apache.org Subject: Re: size tiered compaction - improvement Any compaction pass over A will first convert the TTL data into tombstones. Then, any subsequent pass that includes A *and all other sstables containing rows with the same key* will drop the tombstones. thats why i proposed to attach TTL to entire CF. Tombstones would not be needed
Re: Resident size growth
On Wed, Apr 18, 2012 at 12:44 PM, Rob Coli rc...@palominodb.com wrote: On Tue, Apr 10, 2012 at 8:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote: mmap doesn't depend on jna FWIW, this confusion is as a result of the use of *mlockall*, which is used to prevent mmapped files from being swapped, which does depend on JNA. mlockall does depend on JNA, but we only lock the JVM itself in memory. The OS is free to page data files in and out as needed. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Column Family per User
Our application has users that can write in upwards of 50 million records per day. However, they all write the same format of records (20 fields…columns). Should I put each user in their own column family, even though the column family schema will be the same per user? Would this help with dimensioning, if each user is querying their keyspace and only their keyspace? Trevor Francis
Re: Column Family per User
Each CF takes a fair chunk of memory regardless of how much data it has, so this is probably not a good idea, if you have lots of users. Also using a single CF means that compression is likely to work better (more redundant data). However, Cassandra distributes the load across different nodes based on the row key, and the writes scale roughly linearly according to the number of nodes. So if you can make sure that no single row gets overly burdened by writes (50 million writes/day to a single row would always go to the same nodes - this is in the order of 600 writes/second/node, which shouldn't really pose a problem, IMHO). The main problem is that if a single row gets lots of columns it'll start to slow down at some point, and your row caches become less useful, as they cache the entire row. Keep your rows suitably sized and you should be fine. To partition the data, you can either distribute it to a few CFs based on use or use some other distribution method (like user:1234:00 where the 00 is the hour-of-the-day. (There's a great article by Aaron Morton on how wide rows impact performance at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, running your own tests to determine the optimal setup is recommended.) /Janne On Apr 18, 2012, at 21:20 , Trevor Francis wrote: Our application has users that can write in upwards of 50 million records per day. However, they all write the same format of records (20 fields…columns). Should I put each user in their own column family, even though the column family schema will be the same per user? Would this help with dimensioning, if each user is querying their keyspace and only their keyspace? Trevor Francis
Re: Column Family per User
Janne, Of course, I am new to the Cassandra world, so it is taking some getting used to understand how everything translates into my MYSQL head. We are building an enterprise application that will ingest log information and provide metrics and trending based upon the data contained in the logs. The application is transactional in nature such that a record will be written to a log and our system will need to query that record and assign two values to it in addition to using the information to develop trending metrics. The logs are being fed into cassandra by Flume. Each of our users will be assigned their own piece of hardware that generates these log events, some of which can peak at up to 2500 transactions per second for a couple of hours. The log entries are around 150-bytes each and contain around 20 different pieces of information. Neither us, nor our users are interested in generating any queries across the entire database. Users are only concerned with the data that their particular piece of hardware generates. Should I just setup a single column family with 20 columns, the first of which being the row key and make the row key the username of that user? We would also need probably 2 more columns to store Value A and Value B assigned to that particular record. Our metrics will be be something like this: For this particular user, during this particular timeframe, what is the average of field X? And then store that value, which we can generate historical trending over the course a week. We will do this every 15 minutes. Any suggestions on where I should head to start my journey into Cassandra for my particular application? Trevor Francis On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote: Each CF takes a fair chunk of memory regardless of how much data it has, so this is probably not a good idea, if you have lots of users. Also using a single CF means that compression is likely to work better (more redundant data). However, Cassandra distributes the load across different nodes based on the row key, and the writes scale roughly linearly according to the number of nodes. So if you can make sure that no single row gets overly burdened by writes (50 million writes/day to a single row would always go to the same nodes - this is in the order of 600 writes/second/node, which shouldn't really pose a problem, IMHO). The main problem is that if a single row gets lots of columns it'll start to slow down at some point, and your row caches become less useful, as they cache the entire row. Keep your rows suitably sized and you should be fine. To partition the data, you can either distribute it to a few CFs based on use or use some other distribution method (like user:1234:00 where the 00 is the hour-of-the-day. (There's a great article by Aaron Morton on how wide rows impact performance at http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/, but as always, running your own tests to determine the optimal setup is recommended.) /Janne On Apr 18, 2012, at 21:20 , Trevor Francis wrote: Our application has users that can write in upwards of 50 million records per day. However, they all write the same format of records (20 fields…columns). Should I put each user in their own column family, even though the column family schema will be the same per user? Would this help with dimensioning, if each user is querying their keyspace and only their keyspace? Trevor Francis
Re: Column Family per User
Your design should be around how you want to query. If you are only querying by user, then having a user as part of the row key makes sense. To manage row size, you should think of a row as being a bucket of time. Cassandra supports a large (but not without bounds) row size. To manage row size you might say that this row is for user fred for the month of april, or if that's too much perhaps the row is for user fred for the day 4/18/12. To do this you can use composite keys to hold both pieces of information in the key. (user, bucketpos)The nice thing is that once the time period has come and gone, that row is complete, and you can perform background jobs against that row and store summary information for that time period. - Original Message -From: quot;Trevor Francisquot; ;trevor.fran...@tgrahamcapital.com
Re: Column Family per User
I am trying to grasp this concept….so let me try a scenario. Lets say I have 5 data points being captured in the log file. Here would be a typical table schema in mysql. Id, Username, Time, Wind, Rain, Sunshine Select * from table; would reveal: 1, george, 2012-04-12T12:22:23.293, 55, 45, 10 2, george, 2012-04-12T12:22:24.293, 45, 25, 25 3, george, 2012-04-12T12:22:25.293, 35, 15, 11 4, george, 2012-04-12T12:22:26.293, 55, 65, 16 5, george, 2012-04-12T12:22:27.293, 12, 5, 22 And it would just continue from there adding rows as log files are imported. A select * from table where sunshine=16 would yield: 4, george, 2012-04-12T12:22:26.293, 55, 65, 16 Now, you are saying that in Cassandra, Instead of having a bunch of rows containing ordered information (which is what I would have), I would have a single row with multiple columns: George | 2012-04-12T12:22:23.293, wind=55 | 2012-04-12T12:22:23.293, Rain=45 | 2012-04-12T12:22:23.293, Sunshine=10 | .continued. So George would be the row and the columns would be the actual data. The data would be oriented horizontally, vs vertically (mysql). So for instance, log generation on our application isn't linear as it peaks at certain times of the day. A user generating at peak 2500 would typically generate 60M log entries per day. Multiply that times 20 data pieces and you are looking at 1.2B Columns in a given day for that user. Assuming we batches the writes every minute, can a node handle this sort of load? Also, can we rotate the row every day? Would it make more sense to rotate hourly? At peak, hourly rotation would decrease the row size to 180M data points vs. 1.2B. At max, we may only have 500 users on our platform. That means that if we did hourly row rotation, that would be 12,000 rows per day…with the maximum column size of 180M columns. Am I grasping this concept properly? Trevor Francis On Apr 18, 2012, at 3:06 PM, Dave Brosius wrote: Your design should be around how you want to query. If you are only querying by user, then having a user as part of the row key makes sense. To manage row size, you should think of a row as being a bucket of time. Cassandra supports a large (but not without bounds) row size. To manage row size you might say that this row is for user fred for the month of april, or if that's too much perhaps the row is for user fred for the day 4/18/12. To do this you can use composite keys to hold both pieces of information in the key. (user, bucketpos) The nice thing is that once the time period ha s come and gone, that row is complete, and you can perform background jobs against that row and store summary information for that time period. - Original Message - From: Trevor Francis trevor.fran...@tgrahamcapital.com Sent: Wed, April 18, 2012 15:48 Subject: Re: Column Family per User Janne, Of course, I am new to the Cassandra world, so it is taking some getting used to understand how everything translates into my MYSQL head. We are building an enterprise application that will ingest log inf ormation and provide metrics and trending based upon the data contained in the logs. The application is transactional in nature such that a record will be written to a log and our system will need to query that record and assign two values to it in addition to using the information to develop trending metrics. The logs are being fed into cassandra by Flume. Each of our users will be assigned their own piece of hardware that generates these log events, some of which can peak at up to 2500 transactions per second for a couple of hours. The log entries are around 150-bytes each and contain around 20 different pieces of information. Neither us, nor our users are interested in generating any queries across the entire database. Users are only concerned with the data that their particular piece of hardware generates. Should I just setup a single column family with 20 columns, the first of which bei ng the row key and make the row key the username of that user? We would also need probably 2 more columns to store Value A and Value B assigned to that particular record. Our metrics will be be something like this: For this particular user, during this particular timeframe, what is the average of field X? And then store that value, which we can generate historical trending over the course a week. We will do this every 15 minutes. Any suggestions on where I should head to start my journey into Cassandra for my particular application? Trevor Francis On Apr 18, 2012, at 2:14 PM, Janne Jalkanen wrote: Each CF takes a fair chunk of memory regardless of how much data it has, so this is probably not a good idea, if you have lots of users. Also using a single CF means that compression is likely to work better (more redundant data). However, Cassandra distributes the load across different nodes
Re: Column Family per User
Yes in this cassandra model, time wouldn't be a column value, it would be part of the column name. Depending on how you want to access your data (give me all data points for time X) and how many separate datapoints you have for time X, you might consider packing all the data for a time in one column thru composite columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a human readable representation of the composite separator) in this case there wouldn't actually be a value, the data is just encoded in the column name.Obviously if you are storing dozens of separate datapoints for a timestamp than this gets out of hand quickly, and perhaps you need to go back to column names with time/fieldname format with a real value.the advantage tho of the composite key is that you eliminate all that constant blather about 'Wind' 'Rain' 'Sunshine' in your data and only hold real data. (granted compression will probably help here, but not having it all is even better).as for row size, obv iously that takes some experimentation on you part. You can bucket a row to be any time frame you want. If you feel that 15 minutes is the correct length of time given the amount of data you will write, then use 15 minutes. It it's 1 hour, use 1 hour. The only thing you have to figure out is a 'bucket time' definition that you understand, likely it's the timestamp of when that time period starts.As for 'rotating the row', perhaps it's just semantics, but there really is no such concept. You are at some point in time, and you want to write some data to the database.The steps are1) get the user2) get the timestamp of the current bucket based on 'now'3) build a composite key4) insert the data with that keyWhether that row existed before or is a new row has no bearing on your client code. - Original Message -From: quot;Trevor Francisquot; ;trevor.fran...@tgrahamcapital.com
Re: Column Family per User
Yes in this cassandra model, time wouldn't be a column value, it would be part of the column name. Depending on how you want to access your data (give me all data points for time X) and how many separate datapoints you have for time X, you might consider packing all the data for a time in one column thru composite columnscolumn name: 2012-04-12T12:22:23.293/55/45/10 (where / is a human readable representation of the composite separator) in this case there wouldn't actually be a value, the data is just encoded in the column name.Obviously if you are storing dozens of separate datapoints for a timestamp than this gets out of hand quickly, and perhaps you need to go back to column names with time/fieldname format with a real value.the advantage tho of the composite key is that you eliminate all that constant blather about 'Wind' 'Rain' 'Sunshine' in your data and only hold real data. (granted compression will probably help here, but not having it all is even better).as for row size, obviously that takes some experimentation on you part. You can bucket a row to be any time frame you want. If you feel that 15 minutes is the correct length of time given the amount of data you will write, then use 15 minutes. It it's 1 hour, use 1 hour. The only thing you have to figure out is a 'bucket time' definition that you understand, likely it's the timestamp of when that time period starts.As for 'rotating the row', perhaps it's just semantics, but there really is no such concept. You are at some point in time, and you want to write some data to the database.The steps are1) get the user2) get the timestamp of the current bucket based on 'now'3) build a composite key4) insert the data with that keyWhether that row existed before or is a new row has no bearing on your client code. - Original Message -From: quot;Trevor Francisquot; ;trevor.fran...@tgrahamcapital.com;trevor.fran...@tgrahamcapital.com
Re: Column Family per User
Regarding Rotating, I was thinking about the concept of log rotate, where you write to a file for a specific period of time, then you create a new file and write to it after a specific set of time. So yes, it closes a row and opens another row. Since I will be generating analytics every 15 minutes, its would make sense to me to bucket a row every 15 minutes. Since I would only have at most 500 users, this doesn't strike me as too many rows in a given day (48,000). Potential downsides to doing this? Since I am analyzing 20 separate data points for a given log entry, it would make sense that querying based upon a specific metric (wind, rain, sunshine) would be easier if the data was separated. However, couldn't we build composite columns for time and value where all that would be left in data? So composite row key would be: george 2012-04-12T12:20 And Columns would be: 12:22:23.293/Wind 12:22:23.293/Rain 12:22:23.293/Sunshine Data would be: 55 45 10 Our the columns could be 12:22:23.293 Data: Wind/55/45/35 Or something like that….Am I headed in the right direction? Trevor Francis On Apr 18, 2012, at 3:10 PM, Janne Jalkanen wrote: Hi! A simple model to do this would be * ColumnFamily Data * key: userid * column: Composite( timestamp, entrytype ) = value For example, userid janne would have columns (2012-04-12T12:22:23.293,speed) = 24; (2012-04-12T12:22:23.293,temperature) = 12.4 (2012-04-12T12:22:23.293,direction) = 356; (2012-04-12T12:22:23.295,speed) = 24.1; (2012-04-12T12:22:23.295,temperature) = 12.3 (2012-04-12T12:22:23.295,direction) = 352; Note that Cassandra does not require you to know which columns you're going to put in it (unlike MySQL). You can declare types ahead if you know what they are, but if you'll need to start adding a new column, just start writing it and Cassandra should do the right things. However, there are a few points which you might want to consider * Using ISO dates for timestamps have a minor problem: if two events occur during the same millisecond, they'll overwrite each other. This is why most time series in C* use TimeUUIDs, which contain a millisecond timestamp + a random component. (http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/) * This will generate timestamp*entrytype columns. So for 2500 entries/second and 20 columns this means about 2500*20 = 5 wps (granted that you will most probably batch the writes though). You will need to performance test your cluster to see if this schema is right for you. If not, you might want to try and see how you can distribute the keys differently, e.g. by bucketing the data somehow. However, I recommend that you build a first-shot of your app structure, then load test it until it breaks and that should give you pretty good understanding of what exactly cassandra is doing. To do then analytics multiple options are possible; a popular one is to run MapReduce queries using a tool like Apache Pig on regular intervals. DataStax has good documentation and you probably want to take a look at their offering as well, since they have pretty good Hadoop/MapReduce support for Cassandra. CLI syntax to try with: create keyspace DataTest with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use DataTest; create column family Data with key_validation_class=UTF8Type and comparator='CompositeType(UUIDType,UTF8Type)'; Then start writing using your fav client. /Janne On Apr 18, 2012, at 22:36 , Trevor Francis wrote: Janne, Of course, I am new to the Cassandra world, so it is taking some getting used to understand how everything translates into my MYSQL head. We are building an enterprise application that will ingest log information and provide metrics and trending based upon the data contained in the logs. The application is transactional in nature such that a record will be written to a log and our system will need to query that record and assign two values to it in addition to using the information to develop trending metrics. The logs are being fed into cassandra by Flume. Each of our users will be assigned their own piece of hardware that generates these log events, some of which can peak at up to 2500 transactions per second for a couple of hours. The log entries are around 150-bytes each and contain around 20 different pieces of information. Neither us, nor our users are interested in generating any queries across the entire database. Users are only concerned with the data that their particular piece of hardware generates. Should I just setup a single column family with 20 columns, the first of which being the row key and make the row key the username of that user? We would also need probably 2 more columns to store Value A and Value B assigned to that particular record.
Re: Column Family per User
It seems to me you are on the right track. Finding the right balance of # rows vs row width is the part that will take the most experimentation. - Original Message -From: quot;Trevor Francisquot; ;trevor.fran...@tgrahamcapital.com
DataStax Opscenter 2.0 question
I am having trouble in running the OpsCenter. It starts without any error but the GUI stays in the index page and just shows Loading OpsCenter.. Firebug shows an error this._onClusterSave.bind is not a function. I have the log turned on DEBUG it shows no error (pasted below). This is the only change I have made to the opscenterd.conf file. Cassandra 1.0.8 included in DSE 2.0 is running fine. Would appreciate any help. Previously, I had no problems configuring/running the OpsCenter 1.4.1. 2012-04-18 23:40:32+0200 [] INFO: twistd 10.2.0 (/usr/bin/python2.6 2.6.6) starting up. 2012-04-18 23:40:32+0200 [] INFO: reactor class: twisted.internet.selectreactor.SelectReactor. 2012-04-18 23:40:32+0200 [] INFO: Logging level set to 'debug' 2012-04-18 23:40:32+0200 [] INFO: OpsCenterdService startService 2012-04-18 23:40:32+0200 [] INFO: OpsCenter version: 2.0 2012-04-18 23:40:32+0200 [] INFO: Compatible agent version: 2.7 2012-04-18 23:40:32+0200 [] DEBUG: Main config options: 2012-04-18 23:40:32+0200 [] DEBUG: agents : [('ssl_certfile', './ssl/opscenter.pem'), ('tmp_dir', './tmp'), ('path_to_installscript', './bin/install_agent.sh'), ('path_to_sudowrap', './bin/sudo_with_pass.py'), ('agent_keyfile', './ssl/agentKeyStore'), ('path_to_deb', 'NONE'), ('ssl_keyfile', './ssl/opscenter.key'), ('agent_certfile', './ssl/agentKeyStore.pem'), ('path_to_rpm', 'NONE')], webserver : [('interface', '127.0.0.1'), ('staticdir', './content'), ('port', ''), ('log_path', './log/http.log')], stat_reporter : [('ssl_key', './ssl/stats.pem')], logging : [('log_path', './log/opscenterd.log'), ('level', 'DEBUG')], authentication : [('passwd_file', './passwds')], 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files 2012-04-18 23:40:32+0200 [] INFO: No clusters are configured yet, checking to see if a config migration is needed 2012-04-18 23:40:32+0200 [] INFO: Main config does not appear to include a cluster configuration, skipping migration 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files 2012-04-18 23:40:32+0200 [] INFO: No clusters are configured 2012-04-18 23:40:32+0200 [] INFO: HTTP BASIC authentication disabled 2012-04-18 23:40:32+0200 [] INFO: SSL enabled 2012-04-18 23:40:32+0200 [] INFO: opscenterd.WebServer.OpsCenterdWebServer starting on 2012-04-18 23:40:32+0200 [] INFO: Starting factory opscenterd.WebServer.OpsCenterdWebServer instance at 0x9e35c2c 2012-04-18 23:40:32+0200 [] INFO: morbid.morbid.StompFactory starting on 61619 2012-04-18 23:40:32+0200 [] INFO: Starting factory morbid.morbid.StompFactory instance at 0x9e96a4c 2012-04-18 23:40:32+0200 [] INFO: Configuring agent communication with ssl support enabled. 2012-04-18 23:40:32+0200 [] INFO: morbid.morbid.StompFactory starting on 61620 2012-04-18 23:40:32+0200 [] INFO: OS Version: Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010 2012-04-18 23:40:32+0200 [] INFO: CPU Info: ['2197.558', '2197.558'] 2012-04-18 23:40:33+0200 [] INFO: Mem Info: 2966MB 2012-04-18 23:40:33+0200 [] INFO: Package Manager: apt 2012-04-18 23:41:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.23%, memory usage: 16 MB 2012-04-18 23:42:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.07%, memory usage: 16 MB Thanks Jay
Re: DataStax Opscenter 2.0 question
What version of firefox? Someone has reported a similar issue with firefox 3.6? Can you try with chrome or perhaps a more recent version of firefox (assuming you are also on an older version)? On Wed, Apr 18, 2012 at 4:51 PM, Jay Parashar jparas...@itscape.com wrote: I am having trouble in running the OpsCenter. It starts without any error but the GUI stays in the index page and just shows “Loading OpsCenter…”. Firebug shows an error “this._onClusterSave.bind is not a function”. I have the log turned on DEBUG it shows no error (pasted below). This is the only change I have made to the “opscenterd.conf” file. Cassandra 1.0.8 included in DSE 2.0 is running fine. Would appreciate any help. Previously, I had no problems configuring/running the OpsCenter 1.4.1. 2012-04-18 23:40:32+0200 [] INFO: twistd 10.2.0 (/usr/bin/python2.6 2.6.6) starting up. 2012-04-18 23:40:32+0200 [] INFO: reactor class: twisted.internet.selectreactor.SelectReactor. 2012-04-18 23:40:32+0200 [] INFO: Logging level set to 'debug' 2012-04-18 23:40:32+0200 [] INFO: OpsCenterdService startService 2012-04-18 23:40:32+0200 [] INFO: OpsCenter version: 2.0 2012-04-18 23:40:32+0200 [] INFO: Compatible agent version: 2.7 2012-04-18 23:40:32+0200 [] DEBUG: Main config options: 2012-04-18 23:40:32+0200 [] DEBUG: agents : [('ssl_certfile', './ssl/opscenter.pem'), ('tmp_dir', './tmp'), ('path_to_installscript', './bin/install_agent.sh'), ('path_to_sudowrap', './bin/sudo_with_pass.py'), ('agent_keyfile', './ssl/agentKeyStore'), ('path_to_deb', 'NONE'), ('ssl_keyfile', './ssl/opscenter.key'), ('agent_certfile', './ssl/agentKeyStore.pem'), ('path_to_rpm', 'NONE')], webserver : [('interface', '127.0.0.1'), ('staticdir', './content'), ('port', ''), ('log_path', './log/http.log')], stat_reporter : [('ssl_key', './ssl/stats.pem')], logging : [('log_path', './log/opscenterd.log'), ('level', 'DEBUG')], authentication : [('passwd_file', './passwds')], 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files 2012-04-18 23:40:32+0200 [] INFO: No clusters are configured yet, checking to see if a config migration is needed 2012-04-18 23:40:32+0200 [] INFO: Main config does not appear to include a cluster configuration, skipping migration 2012-04-18 23:40:32+0200 [] DEBUG: Loading all per-cluster config files 2012-04-18 23:40:32+0200 [] DEBUG: Done loading all per-cluster config files 2012-04-18 23:40:32+0200 [] INFO: No clusters are configured 2012-04-18 23:40:32+0200 [] INFO: HTTP BASIC authentication disabled 2012-04-18 23:40:32+0200 [] INFO: SSL enabled 2012-04-18 23:40:32+0200 [] INFO: opscenterd.WebServer.OpsCenterdWebServer starting on 2012-04-18 23:40:32+0200 [] INFO: Starting factory opscenterd.WebServer.OpsCenterdWebServer instance at 0x9e35c2c 2012-04-18 23:40:32+0200 [] INFO: morbid.morbid.StompFactory starting on 61619 2012-04-18 23:40:32+0200 [] INFO: Starting factory morbid.morbid.StompFactory instance at 0x9e96a4c 2012-04-18 23:40:32+0200 [] INFO: Configuring agent communication with ssl support enabled. 2012-04-18 23:40:32+0200 [] INFO: morbid.morbid.StompFactory starting on 61620 2012-04-18 23:40:32+0200 [] INFO: OS Version: Linux version 2.6.35-22-generic (buildd@rothera) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:34:50 UTC 2010 2012-04-18 23:40:32+0200 [] INFO: CPU Info: ['2197.558', '2197.558'] 2012-04-18 23:40:33+0200 [] INFO: Mem Info: 2966MB 2012-04-18 23:40:33+0200 [] INFO: Package Manager: apt 2012-04-18 23:41:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.23%, memory usage: 16 MB 2012-04-18 23:42:32+0200 [] DEBUG: Average opscenterd CPU usage: 0.07%, memory usage: 16 MB Thanks Jay
Cassandra read optimization
Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Thanks, Dan F.
Re: Cassandra read optimization
On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote: Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Is your client mult-threaded? The single threaded performance of Cassandra isn't at all impressive and it really is designed for dealing with a lot of simultaneous requests. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: RMI/JMX errors, weird
Server log below. Mind you that all the nodes are still up -- even though reported as dead in this log. What's going on here? Thanks! INFO [GossipTasks:1] 2012-04-18 22:18:26,487 Gossiper.java (line 719) InetAddress /130.199.185.193 is now dead. INFO [ScheduledTasks:1] 2012-04-18 22:18:26,487 StatusLogger.java (line 50) Pool NameActive Pending Blocked ERROR [GossipTasks:1] 2012-04-18 22:18:26,488 AntiEntropyService.java (line 722) Problem during repair session manual-repair-1b3453b 6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 died ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 22:18:26,488 StorageService.java (line 1607) Repair session org.apache.cas sandra.service.AntiEntropyService$RepairSession@4cc9e2bc failed. java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repai r-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.185.195 died at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1603) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-43545b22-ffe8-4243-8a98-509b bfec9872, endpoint /130.199.185.195 died at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more Caused by: java.io.IOException: Problem during repair session manual-repair-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199. 185.195 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:723) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:760) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:165) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) at
Lexer error at char '\u201C'
Trying to add an agent config through the master web server to point to a collector node, getting: FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis
Re: Lexer error at char '\u201C'
This... looks like Flume. Are you sure you've got the right mailing list? On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: Trying to add an agent config through the master web server to point to a collector node, getting: FAILEDconfig [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)]Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis -- Tyler Hobbs DataStax http://datastax.com/
Re: Lexer error at char '\u201C'
…..slaps himself. Oh you guys at Datastax are great. I have deployed a small Cassandra cluster using your community edition. Actually currently working on making Flume use cassandra as a sink…..unsuccessfully. However, I did just get this Flume error fixed. Are you aware of any cassandra sinks that actually work for Flume? Trevor Francis On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote: This... looks like Flume. Are you sure you've got the right mailing list? On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: Trying to add an agent config through the master web server to point to a collector node, getting: FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis -- Tyler Hobbs DataStax
Re: Lexer error at char '\u201C'
https://github.com/thobbs/flume-cassandra-plugin I think that is fairly up to date, right Tyler? On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: …..slaps himself. Oh you guys at Datastax are great. I have deployed a small Cassandra cluster using your community edition. Actually currently working on making Flume use cassandra as a sink…..unsuccessfully. However, I did just get this Flume error fixed. Are you aware of any cassandra sinks that actually work for Flume? Trevor Francis On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote: This... looks like Flume. Are you sure you've got the right mailing list? On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: Trying to add an agent config through the master web server to point to a collector node, getting: FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis -- Tyler Hobbs DataStax http://datastax.com/
Re: Lexer error at char '\u201C'
Yup, you beat me to the punch by a minute. On Wed, Apr 18, 2012 at 11:39 PM, Nick Bailey n...@datastax.com wrote: https://github.com/thobbs/flume-cassandra-plugin I think that is fairly up to date, right Tyler? On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: …..slaps himself. Oh you guys at Datastax are great. I have deployed a small Cassandra cluster using your community edition. Actually currently working on making Flume use cassandra as a sink…..unsuccessfully. However, I did just get this Flume error fixed. Are you aware of any cassandra sinks that actually work for Flume? Trevor Francis On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote: This... looks like Flume. Are you sure you've got the right mailing list? On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: Trying to add an agent config through the master web server to point to a collector node, getting: FAILED config [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis -- Tyler Hobbs DataStax http://datastax.com/ -- Tyler Hobbs DataStax http://datastax.com/
Re: Cassandra read optimization
I tested this out with a small pycassa script: https://gist.github.com/2418598 On my not-very-impressive laptop, I can read 5000 of the super columns in 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns at a time gives much better performance; I definitely recommend going with a smaller batch size. Make sure that the timeout on your ConnectionPool isn't too low to handle a big request in pycassa. If you turn on logging (as it is in the script I linked), you should be able to see if the request is timing out a couple of times before it succeeds. It might also be good to make sure that you've got JNA in place and your heap size is sufficient. On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com wrote: On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote: Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Is your client mult-threaded? The single threaded performance of Cassandra isn't at all impressive and it really is designed for dealing with a lot of simultaneous requests. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero -- Tyler Hobbs DataStax http://datastax.com/
Re: Lexer error at char '\u201C'
it pukes: ERROR com.cloudera.flume.conf.SinkFactoryImpl: Could not find class org.apache.cassandra.plugins.flume.sink.LogsandraSyslogSink for plugin loading followed the read me to a t property nameflume.plugin.classes/name valueorg.apache.cassandra.plugins.flume.sink.SimpleCassandraSink,org.apache.cassandra.plugins.flume.sink.LogsandraSyslogSink/value descriptionComma separated list of plugin classes/description /property Weird: Trevor Francis On Apr 18, 2012, at 11:39 PM, Nick Bailey wrote: https://github.com/thobbs/flume-cassandra-plugin I think that is fairly up to date, right Tyler? On Wed, Apr 18, 2012 at 11:18 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: …..slaps himself. Oh you guys at Datastax are great. I have deployed a small Cassandra cluster using your community edition. Actually currently working on making Flume use cassandra as a sink…..unsuccessfully. However, I did just get this Flume error fixed. Are you aware of any cassandra sinks that actually work for Flume? Trevor Francis On Apr 18, 2012, at 11:12 PM, Tyler Hobbs wrote: This... looks like Flume. Are you sure you've got the right mailing list? On Wed, Apr 18, 2012 at 11:04 PM, Trevor Francis trevor.fran...@tgrahamcapital.com wrote: Trying to add an agent config through the master web server to point to a collector node, getting: FAILEDconfig [10.38.20.197, tailDir(/var/log/acc/, .*, true, 0), agentDFOSink(“10.38.20.203”,35853)] Attempted to write an invalid sink/source: Lexer error at char '\u201C' at line 1 char 13 Any ideas? Trevor Francis -- Tyler Hobbs DataStax