Re: Read performance in map data type
Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.com wrote: Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
client side socket limit : 64K client side maximum connection per host : 8 read consistency level : Quorum On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak shrika...@gmail.com wrote: How about the client side socket limits? Cassandra client side maximum connection per host and read consistency level? ~Shrikar On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
I've observed that reducing fetch size results in better latency (isn't that obvious :-)), tried from fetch size varying from 100 to 1, seeing a lot of errors for 1. Haven't tried modifying the number of columns. Let me start a new thread focused on fetch size. On Wed, Apr 2, 2014 at 9:53 AM, Sourabh Agrawal iitr.sour...@gmail.comwrote: From the doc : The fetch size controls how much resulting rows will be retrieved simultaneously. So, I guess it does not depend on the number of columns as such. As all the columns for a key reside on the same node, I think it wouldn't matter much whatever be the number of columns as long as we have enough memory in the app. Default value is 5000. (com.datastax.driver.core.QueryOptions) We use it with the default value. I have never profiled cassandra for read load. If you profile it for different fetch sizes, please share the results :) On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Thanks Sourabh, I've modelled my table as studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using studentID and sometime using studentID and subjectID. I've tried driver 2.0.0 and its giving good results. Also using its auto paging feature. Any idea what should be a typical value for fetch size. And does the fetch size depends on how many columns are there in the CQL table for e.g. should fetch size in a table like studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID) be less than fetch size in studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973 -- Thanks Regards, Apoorva
optimum fetch size in datastax driver
Hello All, We have a schema which can be modelled as *(studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID)*. There can be ~1M studentIDs and for each studentID there can be ~10K subjectIDs. The queries can be using studentID and studentID-subjectID We have a 3 node (each having 24 cores) apache cassandra 2.0.4 cluster and are using datastax driver 2.0.0 to interact with it using its automatic paging feature. I've tried various fetch sizes varying from 100 to 10K and observed that read latency increases with fetch size (which looks obvious). At around 10K there are a lot of errors. Want to understand :- - Is there a rule of thumb for deciding on the optimum fetch size ( *com.datastax.driver.core.Statement.setFetchSize()* ). - Does cassandra keeps the entire result in cache and only returns the rows corresponding to the fetch size or it treats subsequent as new queries ( *com.datastax.driver.core.**ResultSet.fetchMoreResults() *) - Whether the optimum fetch size depends on number of columns in CQL table for e.g. should fetch size in a table like ***studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID)* be less than fetch size in *studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID)* -- Thanks Regards, Apoorva
Dead node appearing in datastax driver
Hello All, We had a 4 node cassandra 2.0.4 cluster ( lets call them host1, host2, host3 and host4), out of which we've removed one node (host4) using nodetool removenode command. Now using nodetool status or nodetool ring we no longer see host4. It's also not appearing in Datastax opscenter. But its intermittently appearing in Metadata.getAllHosts() while connecting using datastax driver 1.0.4. Couple of questions :- -How is it appearing. -Can this have impact on read / write performance of client. Code which we are using to connect is public void connect() { PoolingOptions poolingOptions = new PoolingOptions(); cluster = Cluster.builder() .addContactPoints(inetAddresses.toArray(new String[]{})) .withLoadBalancingPolicy(new RoundRobinPolicy()) .withPoolingOptions(poolingOptions) .withPort(port) .withCredentials(username, password) .build(); Metadata metadata = cluster.getMetadata(); System.out.printf(Connected to cluster: %s\n, metadata.getClusterName()); for (Host host : metadata.getAllHosts()) { System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n, host.getDatacenter(), host.getAddress(), host.getRack()); } } -- Thanks Regards, Apoorva
Re: Dead node appearing in datastax driver
Hello Sylvian, Queried system.peers on three live nodes and host4 is appearing on two of these. On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We had a 4 node cassandra 2.0.4 cluster ( lets call them host1, host2, host3 and host4), out of which we've removed one node (host4) using nodetool removenode command. Now using nodetool status or nodetool ring we no longer see host4. It's also not appearing in Datastax opscenter. But its intermittently appearing in Metadata.getAllHosts() while connecting using datastax driver 1.0.4. Couple of questions :- -How is it appearing. Not sure. Can you try querying the peers system table on each of your nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is still mentioned somewhere? -Can this have impact on read / write performance of client. No. If the host doesn't exists, the driver might try to reconnect to it at times, but since it won't be able to, it won't try to use it for reads and writes. That does mean you might have a reconnection task running with some regularity, but 1) it's not on the write/read path of queries and 2) provided you've left the default reconnection policy, this will happen once every 10 minutes and will be pretty cheap so that it will consume an completely negligible amount of ressources. That doesn't mean I'm not interested tracking down why that happens in the first place though. -- Sylvain Code which we are using to connect is public void connect() { PoolingOptions poolingOptions = new PoolingOptions(); cluster = Cluster.builder() .addContactPoints(inetAddresses.toArray(new String[]{})) .withLoadBalancingPolicy(new RoundRobinPolicy()) .withPoolingOptions(poolingOptions) .withPort(port) .withCredentials(username, password) .build(); Metadata metadata = cluster.getMetadata(); System.out.printf(Connected to cluster: %s\n, metadata.getClusterName()); for (Host host : metadata.getAllHosts()) { System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n, host.getDatacenter(), host.getAddress(), host.getRack()); } } -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Dead node appearing in datastax driver
Did that and I actually see a significant reduction in write latency. On Tue, Apr 1, 2014 at 5:35 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Apr 1, 2014 at 1:49 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Sylvian, Queried system.peers on three live nodes and host4 is appearing on two of these. That's why the driver thinks they are still there. You're most probably running into https://issues.apache.org/jira/browse/CASSANDRA-6053 since you are on C* 2.0.4. As said, this is relatively harmless, but you should think about upgrading to 2.0.6 to fix it in the future (you could manually remove the bad entries in System.peers in the meantime if you want, they are really just leftover that shouldn't be here). -- Sylvain On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We had a 4 node cassandra 2.0.4 cluster ( lets call them host1, host2, host3 and host4), out of which we've removed one node (host4) using nodetool removenode command. Now using nodetool status or nodetool ring we no longer see host4. It's also not appearing in Datastax opscenter. But its intermittently appearing in Metadata.getAllHosts() while connecting using datastax driver 1.0.4. Couple of questions :- -How is it appearing. Not sure. Can you try querying the peers system table on each of your nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is still mentioned somewhere? -Can this have impact on read / write performance of client. No. If the host doesn't exists, the driver might try to reconnect to it at times, but since it won't be able to, it won't try to use it for reads and writes. That does mean you might have a reconnection task running with some regularity, but 1) it's not on the write/read path of queries and 2) provided you've left the default reconnection policy, this will happen once every 10 minutes and will be pretty cheap so that it will consume an completely negligible amount of ressources. That doesn't mean I'm not interested tracking down why that happens in the first place though. -- Sylvain Code which we are using to connect is public void connect() { PoolingOptions poolingOptions = new PoolingOptions(); cluster = Cluster.builder() .addContactPoints(inetAddresses.toArray(newString[]{})) .withLoadBalancingPolicy(new RoundRobinPolicy()) .withPoolingOptions(poolingOptions) .withPort(port) .withCredentials(username, password) .build(); Metadata metadata = cluster.getMetadata(); System.out.printf(Connected to cluster: %s\n, metadata.getClusterName()); for (Host host : metadata.getAllHosts()) { System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n, host.getDatacenter(), host.getAddress(), host.getRack()); } } -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Dead node appearing in datastax driver
I manually removed entries from System.peers. The improvements can well be coincidental as various other apps were also running on the same test bed. On Tue, Apr 1, 2014 at 8:43 PM, Sylvain Lebresne sylv...@datastax.comwrote: What does Did that mean? Does that means I upgraded to 2.0.6, or does that mean I manually removed entries from System.peers. If the latter, I'd need more info on what you did exactly, what your peers table looked like before and how they look like now: there is no reason deleting the peers entries for hosts that at not part of the cluster anymore would have anything to do with write latency (but if say you've removed wrong entries, that might have make the driver think some live host had been removed and if the drivers has less nodes to use to dispatch queries, that might impact latency I suppose -- at least that's the only related thing I can think of). -- Sylvain On Tue, Apr 1, 2014 at 2:44 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Did that and I actually see a significant reduction in write latency. On Tue, Apr 1, 2014 at 5:35 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Apr 1, 2014 at 1:49 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Sylvian, Queried system.peers on three live nodes and host4 is appearing on two of these. That's why the driver thinks they are still there. You're most probably running into https://issues.apache.org/jira/browse/CASSANDRA-6053 since you are on C* 2.0.4. As said, this is relatively harmless, but you should think about upgrading to 2.0.6 to fix it in the future (you could manually remove the bad entries in System.peers in the meantime if you want, they are really just leftover that shouldn't be here). -- Sylvain On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We had a 4 node cassandra 2.0.4 cluster ( lets call them host1, host2, host3 and host4), out of which we've removed one node (host4) using nodetool removenode command. Now using nodetool status or nodetool ring we no longer see host4. It's also not appearing in Datastax opscenter. But its intermittently appearing in Metadata.getAllHosts() while connecting using datastax driver 1.0.4. Couple of questions :- -How is it appearing. Not sure. Can you try querying the peers system table on each of your nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is still mentioned somewhere? -Can this have impact on read / write performance of client. No. If the host doesn't exists, the driver might try to reconnect to it at times, but since it won't be able to, it won't try to use it for reads and writes. That does mean you might have a reconnection task running with some regularity, but 1) it's not on the write/read path of queries and 2) provided you've left the default reconnection policy, this will happen once every 10 minutes and will be pretty cheap so that it will consume an completely negligible amount of ressources. That doesn't mean I'm not interested tracking down why that happens in the first place though. -- Sylvain Code which we are using to connect is public void connect() { PoolingOptions poolingOptions = new PoolingOptions(); cluster = Cluster.builder() .addContactPoints(inetAddresses.toArray(newString[]{})) .withLoadBalancingPolicy(new RoundRobinPolicy()) .withPoolingOptions(poolingOptions) .withPort(port) .withCredentials(username, password) .build(); Metadata metadata = cluster.getMetadata(); System.out.printf(Connected to cluster: %s\n, metadata.getClusterName()); for (Host host : metadata.getAllHosts()) { System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n, host.getDatacenter(), host.getAddress(), host.getRack()); } } -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
Thanks Sourabh, I've modelled my table as studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using studentID and sometime using studentID and subjectID. I've tried driver 2.0.0 and its giving good results. Also using its auto paging feature. Any idea what should be a typical value for fetch size. And does the fetch size depends on how many columns are there in the CQL table for e.g. should fetch size in a table like studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID) be less than fetch size in studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob -- Thanks Regards, Apoorva
Re: Read performance in map data type
Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. On Mon, Mar 31, 2014 at 11:51 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do Until this bug is fixed upstream, dropping and recreating a table may create unexpected behavior. https://issues.apache.org/jira/browse/CASSANDRA-5202 =Rob -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hello Sourabh, I'd prefer to do query like select * from marks_table where studentID = ? and subjectID in (?, ?, ??) but if its costly then can happily delegate the responsibility to the application layer. Haven't tried 2.x java driver for this specific issue but tried it once earlier and found the performance slower than 1.x; isn't so? On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal iitr.sour...@gmail.comwrote: Hi Apoorva, Do you always query on studentID only or do you need to query on both studentID and subjectID? Also, I think using the latest driver (2.x) can make querying large number of rows efficient. http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973 -- Thanks Regards, Apoorva
Read performance in map data type
Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.com wrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva