Re: Read performance in map data type
http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/tracing_t.html On Fri, Apr 4, 2014 at 11:34 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: On Fri, Apr 4, 2014 at 9:37 PM, Tyler Hobbs ty...@datastax.com wrote: On Fri, Apr 4, 2014 at 12:41 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: If we store the same data as a json using text data type i.e (studentID int, subjectMarksJson text) we are getting a latency of ~10ms from the same client for even bigger. I understand that json is not the preferred storage for cassandra and will loose various flexibility which a proper tabular approach provides. But such a huge jump in read latency is killer. I'm pastebin-ing the histogram for json storage as well http://pastebin.com/RiW6hMb2. Can you trace the slow query and paste the results? How can I enable that -- Tyler Hobbs DataStax http://datastax.com/ -- Thanks Regards, Apoorva -- Tyler Hobbs DataStax http://datastax.com/
Re: Read performance in map data type
Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.com wrote: Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
How about the client side socket limits? Cassandra client side maximum connection per host and read consistency level? ~Shrikar On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
client side socket limit : 64K client side maximum connection per host : 8 read consistency level : Quorum On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak shrika...@gmail.com wrote: How about the client side socket limits? Cassandra client side maximum connection per host and read consistency level? ~Shrikar On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, As per the cfhistogram there are some rows which have more than 75k columns and around 150k reads hit 2 SStables. Are you sure that you are seeing more than 500ms latency? The cfhistogram should the worst read performance was around 51ms which looks reasonable with many reads hitting 2 sstables. Thanks, Shrikar On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, We are still facing read latency issue, here is the histogram http://pastebin.com/yEvMuHYh On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: Read performance in map data type
On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: At the client side we are getting a latency of ~350ms, we are using datastax driver 2.0.0 and have kept the fetch size as 500. And these are coming while reading rows having ~200 columns. And you're sure that the 300ms between what cassandra reports and what your app reports are not just network/serialization time? What do you believe the latency should be? =Rob
Re: Read performance in map data type
I've observed that reducing fetch size results in better latency (isn't that obvious :-)), tried from fetch size varying from 100 to 1, seeing a lot of errors for 1. Haven't tried modifying the number of columns. Let me start a new thread focused on fetch size. On Wed, Apr 2, 2014 at 9:53 AM, Sourabh Agrawal iitr.sour...@gmail.comwrote: From the doc : The fetch size controls how much resulting rows will be retrieved simultaneously. So, I guess it does not depend on the number of columns as such. As all the columns for a key reside on the same node, I think it wouldn't matter much whatever be the number of columns as long as we have enough memory in the app. Default value is 5000. (com.datastax.driver.core.QueryOptions) We use it with the default value. I have never profiled cassandra for read load. If you profile it for different fetch sizes, please share the results :) On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Thanks Sourabh, I've modelled my table as studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using studentID and sometime using studentID and subjectID. I've tried driver 2.0.0 and its giving good results. Also using its auto paging feature. Any idea what should be a typical value for fetch size. And does the fetch size depends on how many columns are there in the CQL table for e.g. should fetch size in a table like studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID) be less than fetch size in studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973 -- Thanks Regards, Apoorva
Re: Read performance in map data type
On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob
Re: Read performance in map data type
Thanks Sourabh, I've modelled my table as studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using studentID and sometime using studentID and subjectID. I've tried driver 2.0.0 and its giving good results. Also using its auto paging feature. Any idea what should be a typical value for fetch size. And does the fetch size depends on how many columns are there in the CQL table for e.g. should fetch size in a table like studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID) be less than fetch size in studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob -- Thanks Regards, Apoorva
Re: Read performance in map data type
From the doc : The fetch size controls how much resulting rows will be retrieved simultaneously. So, I guess it does not depend on the number of columns as such. As all the columns for a key reside on the same node, I think it wouldn't matter much whatever be the number of columns as long as we have enough memory in the app. Default value is 5000. (com.datastax.driver.core.QueryOptions) We use it with the default value. I have never profiled cassandra for read load. If you profile it for different fetch sizes, please share the results :) On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Thanks Sourabh, I've modelled my table as studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using studentID and sometime using studentID and subjectID. I've tried driver 2.0.0 and its giving good results. Also using its auto paging feature. Any idea what should be a typical value for fetch size. And does the fetch size depends on how many columns are there in the CQL table for e.g. should fetch size in a table like studentID int, subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID, subjectID) be less than fetch size in studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. Use unique keyspace (or table) names for each test? That's the approach they're taking in 5202... =Rob -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973
Re: Read performance in map data type
On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do Until this bug is fixed upstream, dropping and recreating a table may create unexpected behavior. https://issues.apache.org/jira/browse/CASSANDRA-5202 =Rob
Re: Read performance in map data type
Thanks Robert, Is there a workaround, as in our test setups we keep dropping and recreating tables. On Mon, Mar 31, 2014 at 11:51 PM, Robert Coli rc...@eventbrite.com wrote: On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do Until this bug is fixed upstream, dropping and recreating a table may create unexpected behavior. https://issues.apache.org/jira/browse/CASSANDRA-5202 =Rob -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hi, I don't think there is a problem with the driver. Regarding the schema, you may want to choose between wide rows and skinny rows. http://stackoverflow.com/questions/19039123/cassandra-wide-vs-skinny-rows-for-large-columns http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html When you do studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) you are partitioning data by studentID (wide row pattern). So, there will be one row for each studentID. So, all reads for a studentID will go to a single node. http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_compound_keys_c.html If you do studentID int, subjectID int, marks int, PRIMARY KEY((studentID, subjectID)) you are partitioning data by a composite value of both columns. So read requests will be distributed. But now you can not query with only studentID. http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_table_r.html#reference_ds_v3f_vfk_xj__compositPart *You should not use a map because it has various restrictions* : http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddlWhenCollections.html On Sat, Mar 29, 2014 at 5:13 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Sourabh, I'd prefer to do query like select * from marks_table where studentID = ? and subjectID in (?, ?, ??) but if its costly then can happily delegate the responsibility to the application layer. Haven't tried 2.x java driver for this specific issue but tried it once earlier and found the performance slower than 1.x; isn't so? On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal iitr.sour...@gmail.comwrote: Hi Apoorva, Do you always query on studentID only or do you need to query on both studentID and subjectID? Also, I think using the latest driver (2.x) can make querying large number of rows efficient. http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973 -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973
Re: Read performance in map data type
Hi Apoorva, Do you always query on studentID only or do you need to query on both studentID and subjectID? Also, I think using the latest driver (2.x) can make querying large number of rows efficient. http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973
Re: Read performance in map data type
Hello Sourabh, I'd prefer to do query like select * from marks_table where studentID = ? and subjectID in (?, ?, ??) but if its costly then can happily delegate the responsibility to the application layer. Haven't tried 2.x java driver for this specific issue but tried it once earlier and found the performance slower than 1.x; isn't so? On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal iitr.sour...@gmail.comwrote: Hi Apoorva, Do you always query on studentID only or do you need to query on both studentID and subjectID? Also, I think using the latest driver (2.x) can make querying large number of rows efficient. http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Sourabh Agrawal Bangalore +91 9945657973 -- Thanks Regards, Apoorva
Read performance in map data type
Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.comwrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva
Re: Read performance in map data type
Hello Shrikar, Yes primary key is (studentID, subjectID). I had dropped the test table, recreating and populating it post which will share the cfhistogram. In such case is there any practical limit on the rows I should fetch, for e.g. should I do select * form marks_table where studentID = ? limit 500; instead of doing select * form marks_table where studentID = ?; On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.com wrote: Hi Apoorva, I assume this is the table with studentId and subjectId as primary keys and not other like like marks in that. create table marks_table(studentId int, subjectId int, marks int, PRIMARY KEY(studentId,subjectId)); Also could you give the cfhistogram stats? nodetool cfhistograms your keyspace marks_table; Thanks, Shrikar On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We've a schema which can be modeled as (studentID, subjectID, marks) where combination of studentID and subjectID is unique. Number of studentID can go up to 100 million and for each studentID we can have up to 10k subjectIDs. We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are using a four node cluster, each having 24 cores and 32GB memory. I'm sure that the machines are not underperformant as on same test bed we've consistently received 5ms response times for ~1b documents when queried via primary key. I've tried three approaches, all of which result in significant deterioration (500 ms response time) in read query performance once number of subjectIDs goes past ~100 for a studentID. Approaches are :- 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int) and query by subjectID 2. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? 3. model as (studentID int, subjectID int, marks int, PRIMARY KEY(studentID, subjectID) and query as select * from marks_table where studentID = ? and subjectID in (?, ?, ??) number of subjectIDs in query being ~1K. What can be the bottlenecks. Is it better if we model as (studentID int, subjct_marks_json text) and query by studentID. -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva