RE: Seed gossip version error
Hi Amlan, We have the same pb with Cassandra 2.1.5. I have no hint (yet) to follow. Did you found the root of this pb ? Thanks. Regards, Dominique [@@ THALES GROUP INTERNAL @@] De : Amlan Roy [mailto:amlan@cleartrip.com] Envoyé : mercredi 1 juillet 2015 12:46 À : user@cassandra.apache.org Objet : Seed gossip version error Hi, I have a running cluster running with version 2.1.7. Two of the machines went down and they are not joining the cluster even after restart. I see the following WARN message in system.log in all the nodes: system.log:WARN [MessagingService-Outgoing-cassandra2.cleartrip.com/172.18.3.32http://MessagingService-Outgoing-cassandra2.cleartrip.com/172.18.3.32] 2015-07-01 13:00:41,878 OutboundTcpConnection.java:414 - Seed gossip version is -2147483648; will not connect with that version Please let me know if you have faced the same problem. Regards, Amlan
Re: Schema questions for data structures with recently-modified access patterns
The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Keep the original document base table, but then the query table should have the PK as last_modified, docId, with last_modified descending, so that a query can get the n most recently modified documents. Yes, you still need to manually delete the old entry for the document in the query table if duplicates are a problem for you. Yeah, a TTL would be good if you don't care about documents modified a month or a week ago. -- Jack Krupansky On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
RE: Seed gossip version error
Thanks for your reply. Yes, I am sure all nodes are running the same version. On second thoughts, I think my gossip pb is due to intense GC activities, leading to be even not able to do a gossip handshake ! Regards, Dominique [@@ THALES GROUP INTERNAL @@] De : Carlos Rolo [mailto:r...@pythian.com] Envoyé : mardi 21 juillet 2015 18:33 À : user@cassandra.apache.org Objet : Re: Seed gossip version error That error should only occur when you have a mismatch between the Seed version and the new node version. Are you sure all your nodes are running in the same version? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolohttp://linkedin.com/in/carlosjuzarterolo Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.comhttp://www.pythian.com/ On Tue, Jul 21, 2015 at 5:37 PM, DE VITO Dominique dominique.dev...@thalesgroup.commailto:dominique.dev...@thalesgroup.com wrote: Hi Amlan, We have the same pb with Cassandra 2.1.5. I have no hint (yet) to follow. Did you found the root of this pb ? Thanks. Regards, Dominique [@@ THALES GROUP INTERNAL @@] De : Amlan Roy [mailto:amlan@cleartrip.commailto:amlan@cleartrip.com] Envoyé : mercredi 1 juillet 2015 12:46 À : user@cassandra.apache.orgmailto:user@cassandra.apache.org Objet : Seed gossip version error Hi, I have a running cluster running with version 2.1.7. Two of the machines went down and they are not joining the cluster even after restart. I see the following WARN message in system.log in all the nodes: system.log:WARN [MessagingService-Outgoing-cassandra2.cleartrip.com/172.18.3.32http://MessagingService-Outgoing-cassandra2.cleartrip.com/172.18.3.32] 2015-07-01 13:00:41,878 OutboundTcpConnection.java:414 - Seed gossip version is -2147483648; will not connect with that version Please let me know if you have faced the same problem. Regards, Amlan --
Re: Seed gossip version error
That error should only occur when you have a mismatch between the Seed version and the new node version. Are you sure all your nodes are running in the same version? Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Mobile: +31 6 159 61 814 | Tel: +1 613 565 8696 x1649 www.pythian.com On Tue, Jul 21, 2015 at 5:37 PM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi Amlan, We have the same pb with Cassandra 2.1.5. I have no hint (yet) to follow. Did you found the root of this pb ? Thanks. Regards, Dominique [@@ THALES GROUP INTERNAL @@] *De :* Amlan Roy [mailto:amlan@cleartrip.com] *Envoyé :* mercredi 1 juillet 2015 12:46 *À :* user@cassandra.apache.org *Objet :* Seed gossip version error Hi, I have a running cluster running with version 2.1.7. Two of the machines went down and they are not joining the cluster even after restart. I see the following WARN message in system.log in all the nodes: system.log:WARN [ MessagingService-Outgoing-cassandra2.cleartrip.com/172.18.3.32] 2015-07-01 13:00:41,878 OutboundTcpConnection.java:414 - Seed gossip version is -2147483648; will not connect with that version Please let me know if you have faced the same problem. Regards, Amlan -- --
Re: Schema questions for data structures with recently-modified access patterns
If last_modified is a clustering column, it needs a partitioning column, which is what date is for (although I should have named it day, and I also forgot to add the order by desc clause). This is essentially what I came up with. Still not liking how easy it is to get duplicates. On Jul 21, 2015, at 9:31 AM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Keep the original document base table, but then the query table should have the PK as last_modified, docId, with last_modified descending, so that a query can get the n most recently modified documents. Yes, you still need to manually delete the old entry for the document in the query table if duplicates are a problem for you. Yeah, a TTL would be good if you don't care about documents modified a month or a week ago. -- Jack Krupansky On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
howto do sql query like in a relational database
Hi, I have a simple (perhaps stupid) question. If I want to *search* data in cassandra, how could find in a text field all records which start with 'Cas' ( in sql I do select * from table where field like 'Cas%') I know that this is not directly possible. - But how is it possible? - Do nobody have the need to search text fragments, and if not is there a small example to explain *why* this is not needed? As far as I understand, databases are great for *searching* data. Concerning numerical data in cassandra I can use = all that operators. Is cassandra intended to be used for mostly numerical data? I did not catch the point up to now, sorry. Anton
RE: howto do sql query like in a relational database
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store http://wiki.apache.org/cassandra/ It is intended for searching by key. It has more querying options but it really shines when querying by key. Not all databases offer the same functionality. Both a knife and a fork are eating utensils, but you wouldn't want to cut a tomato with a fork. There are text-indexing databases out there that might suit your needs better. Try elasticsearch. -Original Message- From: anton [mailto:anto...@gmx.de] Sent: Tuesday, July 21, 2015 7:54 PM To: user@cassandra.apache.org Subject: howto do sql query like in a relational database Hi, I have a simple (perhaps stupid) question. If I want to *search* data in cassandra, how could find in a text field all records which start with 'Cas' ( in sql I do select * from table where field like 'Cas%') I know that this is not directly possible. - But how is it possible? - Do nobody have the need to search text fragments, and if not is there a small example to explain *why* this is not needed? As far as I understand, databases are great for *searching* data. Concerning numerical data in cassandra I can use = all that operators. Is cassandra intended to be used for mostly numerical data? I did not catch the point up to now, sorry. Anton
Re: Can't connect to Cassandra server
Hi Erick, In cassandra-env.sh, system_memory_in_mb was set to 2GB, I changed it into 16GB, but I still get the same issue. Following are my complete system.log after changing cassandra-env.sh, and new cassandra-env.sh. https://gist.githubusercontent.com/cdwijayarathna/5e7e69c62ac09b45490b/raw/f73f043a6cd68eb5e7f93cf597ec514df7ac61ae/log https://gist.github.com/cdwijayarathna/2665814a9bd3c47ba650 I can't find ant output.log in my cassandra installation. Thanks On Tue, Jul 21, 2015 at 4:31 AM, Erick Ramirez er...@ramirez.com.au wrote: Chamila, As you can see from the netstat/lsof output, there is nothing listening on port 9042 because Cassandra has not started yet. This is the reason you are unable to connect via cqlsh. You need to work out first why Cassandra has not started. With regards to JVM, Oded is referring to the max heap size and new heap size you have configured. The suspicion is that you have max heap size set too low which is apparent from the heap pressure and GC pattern in the log you provided. Please provide the gist for the following so we can assist: - updated system.log - copy of output.log - cassandra-env.sh Cheers, Erick *Erick Ramirez* About Me about.me/erickramirezonline -- *Chamila Dilshan Wijayarathna,* Software Engineer Mobile:(+94)788193620 WSO2 Inc., http://wso2.com/
Re: DateTieredCompactionStrategy DTCS sometimes stop dropping SSTables
On Mon, Jul 20, 2015 at 6:20 PM, Christophe Schmitz christo...@instaclustr.com wrote: I am running a 6 node cluster on 2.1.7 ... Sounds similar to : https://issues.apache.org/jira/browse/CASSANDRA-9577 or maybe https://issues.apache.org/jira/browse/CASSANDRA-9056 or https://issues.apache.org/jira/browse/CASSANDRA-8243 The latter two should both be fixed by 2.1.7... =Rob
Re: High CPU load
Yup... it seems like it's gc fault gc logs 2015-07-21T14:19:54.336+: 2876133.270: Total time for which application threads were stopped: 0.0832030 seconds 2015-07-21T14:19:55.739+: 2876134.673: Total time for which application threads were stopped: 0.0806960 seconds 2015-07-21T14:19:57.149+: 2876136.083: Total time for which application threads were stopped: 0.0806890 seconds 2015-07-21T14:19:58.550+: 2876137.484: Total time for which application threads were stopped: 0.0821070 seconds 2015-07-21T14:19:59.941+: 2876138.875: Total time for which application threads were stopped: 0.0802640 seconds 2015-07-21T14:20:01.340+: 2876140.274: Total time for which application threads were stopped: 0.0835670 seconds 2015-07-21T14:20:02.744+: 2876141.678: Total time for which application threads were stopped: 0.0842440 seconds 2015-07-21T14:20:04.143+: 2876143.077: Total time for which application threads were stopped: 0.0841630 seconds 2015-07-21T14:20:05.541+: 2876144.475: Total time for which application threads were stopped: 0.0839850 seconds Heap after GC invocations=2273737 (full 101): par new generation total 1474560K, used 106131K [0x0005fae0, 0x00065ee0, 0x00065ee0) eden space 1310720K, 0% used [0x0005fae0, 0x0005fae0, 0x00064ae0) from space 163840K, 64% used [0x00064ae0, 0x0006515a4ee0, 0x000654e0) to space 163840K, 0% used [0x000654e0, 0x000654e0, 0x00065ee0) concurrent mark-sweep generation total 6750208K, used 1316691K [0x00065ee0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 49336K, used 29520K [0x0007fae0, 0x0007fde2e000, 0x0008) } 2015-07-21T14:12:05.683+: 2875664.617: Total time for which application threads were stopped: 0.0830280 seconds {Heap before GC invocations=2273737 (full 101): par new generation total 1474560K, used 1416851K [0x0005fae0, 0x00065ee0, 0x00065ee0) eden space 1310720K, 100% used [0x0005fae0, 0x00064ae0, 0x00064ae0) from space 163840K, 64% used [0x00064ae0, 0x0006515a4ee0, 0x000654e0) to space 163840K, 0% used [0x000654e0, 0x000654e0, 0x00065ee0) concurrent mark-sweep generation total 6750208K, used 1316691K [0x00065ee0, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 49336K, used 29520K [0x0007fae0, 0x0007fde2e000, 0x0008) It seems like eden heap space is being constantly occupied by something which is later removed by gc... On Mon, Jul 20, 2015 at 9:18 AM, Jason Wee peich...@gmail.com wrote: just a guess, gc? On Mon, Jul 20, 2015 at 3:15 PM, Marcin Pietraszek mpietras...@opera.com wrote: Hello! I've noticed a strange CPU utilisation patterns on machines in our cluster. After C* daemon restart it behaves in a normal way, after a few weeks since a restart CPU usage starts to raise. Currently on one of the nodes (screenshots attached) cpu load is ~4. Shortly before restart load raises to ~15 (our cassandra machines have 16 cpus). In that cluster we're using bulkloading from hadoop cluster with 1400 reducers (200 parallel bulkloading tasks). After such session of heavy bulkloading number of pending compactions is quite high but it's able to clear them before next bulkloading session. We're also tracking number of pending compactions and during most of the time it's 0. On our machines we do have a few gigs of free memory ~7GB (17GB used), also it seems like we aren't IO bound. Screenshots from our zabbix with CPU utilisation graphs: http://i60.tinypic.com/xas8q8.jpg http://i58.tinypic.com/24pifcy.jpg Do you guys know what could be causing such high load? -- mp
Re: Schema questions for data structures with recently-modified access patterns
I'm relatively new to data modeling in Cassandra, but perhaps instead of date and last_modified in your primary key for doc_by_last_modified, just use the docId. This way, you are can update the last_modified and date fields against the docId and it removes the duplicate issue and obviates the need to delete the current row or adding a new one-- you'd simply be updating (upserting?) by the docId Regards, Victor On Mon, Jul 20, 2015 at 11:59 PM, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert