Re: Read performance in map data type

2014-04-04 Thread Tyler Hobbs
http://www.datastax.com/documentation/developer/java-driver/2.0/java-driver/tracing_t.html


On Fri, Apr 4, 2014 at 11:34 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:



 On Fri, Apr 4, 2014 at 9:37 PM, Tyler Hobbs ty...@datastax.com wrote:


 On Fri, Apr 4, 2014 at 12:41 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 If we store the same data as a json using text data type i.e (studentID
 int, subjectMarksJson text) we are getting a latency of ~10ms from the same
 client for even bigger. I understand that json is not the preferred storage
 for cassandra and will loose various flexibility which a proper tabular
 approach provides. But such a huge jump in read latency is killer. I'm
 pastebin-ing the histogram for json storage as well
 http://pastebin.com/RiW6hMb2.


 Can you trace the slow query and paste the results?

 How can I enable that




 --
 Tyler Hobbs
 DataStax http://datastax.com/




 --
 Thanks  Regards,
 Apoorva




-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
Hello Shrikar,

We are still facing read latency issue, here is the histogram
http://pastebin.com/yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int, PRIMARY
 KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Shrikar archak
Hi Apoorva,
As per the cfhistogram there are some rows which have more than 75k columns
and around 150k reads hit 2 SStables.

Are you sure that you are seeing more than 500ms latency?  The cfhistogram
should the worst read performance was around 51ms
which looks reasonable with many reads hitting 2 sstables.

Thanks,
Shrikar


On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva



Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
At the client side we are getting a latency of ~350ms, we are using
datastax driver 2.0.0 and have kept the fetch size as 500. And these are
coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.com wrote:

 Hi Apoorva,
 As per the cfhistogram there are some rows which have more than 75k
 columns and around 150k reads hit 2 SStables.

 Are you sure that you are seeing more than 500ms latency?  The cfhistogram
 should the worst read performance was around 51ms
 which looks reasonable with many reads hitting 2 sstables.

 Thanks,
 Shrikar


 On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Shrikar archak
How about the client side socket limits? Cassandra client side maximum
connection per host and read consistency level?

~Shrikar


On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 At the client side we are getting a latency of ~350ms, we are using
 datastax driver 2.0.0 and have kept the fetch size as 500. And these are
 coming while reading rows having ~200 columns.


 On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,
 As per the cfhistogram there are some rows which have more than 75k
 columns and around 150k reads hit 2 SStables.

 Are you sure that you are seeing more than 500ms latency?  The
 cfhistogram should the worst read performance was around 51ms
 which looks reasonable with many reads hitting 2 sstables.

 Thanks,
 Shrikar


 On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test
 table, recreating and populating it post which will share the cfhistogram.
 In such case is there any practical limit on the rows I should fetch, for
 e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver
 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
 memory. I'm sure that the machines are not underperformant as on same 
 test
 bed we've consistently received 5ms response times for ~1b documents 
 when
 queried via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva



Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
client side socket limit : 64K
client side maximum connection per host : 8
read consistency level : Quorum


On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak shrika...@gmail.com wrote:

 How about the client side socket limits? Cassandra client side maximum
 connection per host and read consistency level?

 ~Shrikar


 On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 At the client side we are getting a latency of ~350ms, we are using
 datastax driver 2.0.0 and have kept the fetch size as 500. And these are
 coming while reading rows having ~200 columns.


 On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,
 As per the cfhistogram there are some rows which have more than 75k
 columns and around 150k reads hit 2 SStables.

 Are you sure that you are seeing more than 500ms latency?  The
 cfhistogram should the worst read performance was around 51ms
 which looks reasonable with many reads hitting 2 sstables.

 Thanks,
 Shrikar


 On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test
 table, recreating and populating it post which will share the cfhistogram.
 In such case is there any practical limit on the rows I should fetch, for
 e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak 
 shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver
 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
 memory. I'm sure that the machines are not underperformant as on same 
 test
 bed we've consistently received 5ms response times for ~1b documents 
 when
 queried via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Robert Coli
On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 At the client side we are getting a latency of ~350ms, we are using
 datastax driver 2.0.0 and have kept the fetch size as 500. And these are
 coming while reading rows having ~200 columns.


And you're sure that the 300ms between what cassandra reports and what your
app reports are not just network/serialization time?

What do you believe the latency should be?

=Rob


Re: Read performance in map data type

2014-04-02 Thread Apoorva Gaurav
I've observed that reducing fetch size results in better latency (isn't
that obvious :-)), tried from fetch size varying from 100 to 1, seeing
a lot of errors for 1. Haven't tried modifying the number of columns.

Let me start a new thread focused on fetch size.


On Wed, Apr 2, 2014 at 9:53 AM, Sourabh Agrawal iitr.sour...@gmail.comwrote:

 From the doc : The fetch size controls how much resulting rows will be
 retrieved simultaneously.
 So, I guess it does not depend on the number of columns as such. As all
 the columns for a key reside on the same node, I think it wouldn't matter
 much whatever be the number of columns as long as we have enough memory in
 the app.

 Default value is 5000. (com.datastax.driver.core.QueryOptions)

 We use it with the default value. I have never profiled cassandra for read
 load. If you profile it for different fetch sizes, please share the results
 :)


 On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.comwrote:

 Thanks Sourabh,

 I've modelled my table as studentID int, subjectID int, marks int,
 PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using
 studentID and sometime using studentID and subjectID.

 I've tried driver 2.0.0 and its giving good results. Also using its auto
 paging feature. Any idea what should be a typical value for fetch size. And
 does the fetch size depends on how many columns are there in the CQL table
 for e.g. should fetch size in a table like studentID int, subjectID
 int, marks1 int, marks2 int, marks3 int marksN int PRIMARY
 KEY(studentID, subjectID) be less than fetch size in studentID int,
 subjectID int, marks int, PRIMARY KEY(studentID, subjectID)


 On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote:

  On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


 Use unique keyspace (or table) names for each test? That's the approach
 they're taking in 5202...

 =Rob




 --
 Thanks  Regards,
 Apoorva




 --
 Sourabh Agrawal
 Bangalore
 +91 9945657973




-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-01 Thread Robert Coli
 On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


Use unique keyspace (or table) names for each test? That's the approach
they're taking in 5202...

=Rob


Re: Read performance in map data type

2014-04-01 Thread Apoorva Gaurav
Thanks Sourabh,

I've modelled my table as studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) as primarily I'll be querying using studentID
and sometime using studentID and subjectID.

I've tried driver 2.0.0 and its giving good results. Also using its auto
paging feature. Any idea what should be a typical value for fetch size. And
does the fetch size depends on how many columns are there in the CQL table
for e.g. should fetch size in a table like studentID int, subjectID int,
marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID,
subjectID) be less than fetch size in studentID int, subjectID int, marks
int, PRIMARY KEY(studentID, subjectID)


On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote:

  On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


 Use unique keyspace (or table) names for each test? That's the approach
 they're taking in 5202...

 =Rob




-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-01 Thread Sourabh Agrawal
From the doc : The fetch size controls how much resulting rows will be
retrieved simultaneously.
So, I guess it does not depend on the number of columns as such. As all the
columns for a key reside on the same node, I think it wouldn't matter much
whatever be the number of columns as long as we have enough memory in the
app.

Default value is 5000. (com.datastax.driver.core.QueryOptions)

We use it with the default value. I have never profiled cassandra for read
load. If you profile it for different fetch sizes, please share the results
:)


On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav apoorva.gau...@myntra.comwrote:

 Thanks Sourabh,

 I've modelled my table as studentID int, subjectID int, marks int,
 PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using
 studentID and sometime using studentID and subjectID.

 I've tried driver 2.0.0 and its giving good results. Also using its auto
 paging feature. Any idea what should be a typical value for fetch size. And
 does the fetch size depends on how many columns are there in the CQL table
 for e.g. should fetch size in a table like studentID int, subjectID int,
 marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID,
 subjectID) be less than fetch size in studentID int, subjectID int,
 marks int, PRIMARY KEY(studentID, subjectID)


 On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote:

  On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


 Use unique keyspace (or table) names for each test? That's the approach
 they're taking in 5202...

 =Rob




 --
 Thanks  Regards,
 Apoorva




-- 
Sourabh Agrawal
Bangalore
+91 9945657973


Re: Read performance in map data type

2014-03-31 Thread Robert Coli
On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do


Until this bug is fixed upstream, dropping and recreating a table may
create unexpected behavior.

https://issues.apache.org/jira/browse/CASSANDRA-5202

=Rob


Re: Read performance in map data type

2014-03-31 Thread Apoorva Gaurav
Thanks Robert, Is there a workaround, as in our test setups we keep
dropping and recreating tables.


On Mon, Mar 31, 2014 at 11:51 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do


 Until this bug is fixed upstream, dropping and recreating a table may
 create unexpected behavior.

 https://issues.apache.org/jira/browse/CASSANDRA-5202

 =Rob





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-03-30 Thread Sourabh Agrawal
Hi,

I don't think there is a problem with the driver.

Regarding the schema, you may want to choose between wide rows and skinny
rows.
http://stackoverflow.com/questions/19039123/cassandra-wide-vs-skinny-rows-for-large-columns
http://thelastpickle.com/blog/2013/01/11/primary-keys-in-cql.html

When you do studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID)  you are partitioning data by studentID (wide
row pattern). So, there will be one row for each studentID. So, all reads
for a studentID will go to a single node.
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_compound_keys_c.html

If you do studentID int, subjectID int, marks int, PRIMARY KEY((studentID,
subjectID)) you are partitioning data by a composite value of both
columns. So read requests will be distributed. But now you can not query
with only studentID.
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/create_table_r.html#reference_ds_v3f_vfk_xj__compositPart

*You should not use a map because it has various restrictions* :
http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddlWhenCollections.html



On Sat, Mar 29, 2014 at 5:13 PM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello Sourabh,

 I'd prefer to do query like select * from marks_table where studentID = ?
 and subjectID in (?, ?, ??) but if its costly then can happily delegate
 the responsibility to the application layer.

 Haven't tried 2.x java driver for this specific issue but tried it once
 earlier and found the performance slower than 1.x; isn't so?


 On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal 
 iitr.sour...@gmail.comwrote:

 Hi Apoorva,

 Do you always query on studentID only or do you need to query on both
 studentID and subjectID?

 Also, I think using the latest driver (2.x) can make querying large
 number of rows efficient.
 http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0




 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Sourabh Agrawal
 Bangalore
 +91 9945657973




 --
 Thanks  Regards,
 Apoorva




-- 
Sourabh Agrawal
Bangalore
+91 9945657973


Re: Read performance in map data type

2014-03-29 Thread Sourabh Agrawal
Hi Apoorva,

Do you always query on studentID only or do you need to query on both
studentID and subjectID?

Also, I think using the latest driver (2.x) can make querying large number
of rows efficient.
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0




On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int, PRIMARY
 KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




-- 
Sourabh Agrawal
Bangalore
+91 9945657973


Re: Read performance in map data type

2014-03-29 Thread Apoorva Gaurav
Hello Sourabh,

I'd prefer to do query like select * from marks_table where studentID = ?
and subjectID in (?, ?, ??) but if its costly then can happily delegate
the responsibility to the application layer.

Haven't tried 2.x java driver for this specific issue but tried it once
earlier and found the performance slower than 1.x; isn't so?


On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal iitr.sour...@gmail.comwrote:

 Hi Apoorva,

 Do you always query on studentID only or do you need to query on both
 studentID and subjectID?

 Also, I think using the latest driver (2.x) can make querying large number
 of rows efficient.
 http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0




 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Sourabh Agrawal
 Bangalore
 +91 9945657973




-- 
Thanks  Regards,
Apoorva


Read performance in map data type

2014-03-28 Thread Apoorva Gaurav
Hello All,

We've a schema which can be modeled as (studentID, subjectID, marks) where
combination of studentID and subjectID is unique. Number of studentID can
go up to 100 million and for each studentID we can have up to  10k
subjectIDs.

We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are
using a four node cluster, each having 24 cores and 32GB memory. I'm sure
that the machines are not underperformant as on same test bed we've
consistently received 5ms response times for ~1b documents when queried
via primary key.

I've tried three approaches, all of which result in significant
deterioration (500 ms response time) in read query performance once number
of subjectIDs goes past ~100 for a studentID. Approaches are :-

1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int)
and query by subjectID

2. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ?

3. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
query being ~1K.

What can be the bottlenecks. Is it better if we model as (studentID int,
subjct_marks_json text) and query by studentID.

-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-03-28 Thread Shrikar archak
Hi Apoorva,

I assume this is the table with studentId and subjectId  as primary keys
and not other like like marks in that.

create table marks_table(studentId int, subjectId int, marks int, PRIMARY
KEY(studentId,subjectId));

Also could you give the cfhistogram stats?

nodetool cfhistograms your keyspace marks_table;



Thanks,
Shrikar


On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks) where
 combination of studentID and subjectID is unique. Number of studentID can
 go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are
 using a four node cluster, each having 24 cores and 32GB memory. I'm sure
 that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int)
 and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva



Re: Read performance in map data type

2014-03-28 Thread Apoorva Gaurav
Hello Shrikar,

Yes primary key is (studentID, subjectID). I had dropped the test table,
recreating and populating it post which will share the cfhistogram. In such
case is there any practical limit on the rows I should fetch, for e.g.
should I do
   select * form marks_table where studentID = ? limit 500;
instead of doing
   select * form marks_table where studentID = ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.com wrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int, PRIMARY
 KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva