Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
Hello Shrikar,

We are still facing read latency issue, here is the histogram
http://pastebin.com/yEvMuHYh


On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav
apoorva.gau...@myntra.comwrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int, PRIMARY
 KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
At the client side we are getting a latency of ~350ms, we are using
datastax driver 2.0.0 and have kept the fetch size as 500. And these are
coming while reading rows having ~200 columns.


On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.com wrote:

 Hi Apoorva,
 As per the cfhistogram there are some rows which have more than 75k
 columns and around 150k reads hit 2 SStables.

 Are you sure that you are seeing more than 500ms latency?  The cfhistogram
 should the worst read performance was around 51ms
 which looks reasonable with many reads hitting 2 sstables.

 Thanks,
 Shrikar


 On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-03 Thread Apoorva Gaurav
client side socket limit : 64K
client side maximum connection per host : 8
read consistency level : Quorum


On Thu, Apr 3, 2014 at 12:59 PM, Shrikar archak shrika...@gmail.com wrote:

 How about the client side socket limits? Cassandra client side maximum
 connection per host and read consistency level?

 ~Shrikar


 On Thu, Apr 3, 2014 at 12:20 AM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 At the client side we are getting a latency of ~350ms, we are using
 datastax driver 2.0.0 and have kept the fetch size as 500. And these are
 coming while reading rows having ~200 columns.


 On Thu, Apr 3, 2014 at 12:45 PM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,
 As per the cfhistogram there are some rows which have more than 75k
 columns and around 150k reads hit 2 SStables.

 Are you sure that you are seeing more than 500ms latency?  The
 cfhistogram should the worst read performance was around 51ms
 which looks reasonable with many reads hitting 2 sstables.

 Thanks,
 Shrikar


 On Wed, Apr 2, 2014 at 11:30 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 We are still facing read latency issue, here is the histogram
 http://pastebin.com/yEvMuHYh


 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test
 table, recreating and populating it post which will share the cfhistogram.
 In such case is there any practical limit on the rows I should fetch, for
 e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak 
 shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary
 keys and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of 
 studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver
 1.0.4. We are using a four node cluster, each having 24 cores and 32GB
 memory. I'm sure that the machines are not underperformant as on same 
 test
 bed we've consistently received 5ms response times for ~1b documents 
 when
 queried via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once 
 number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-02 Thread Apoorva Gaurav
I've observed that reducing fetch size results in better latency (isn't
that obvious :-)), tried from fetch size varying from 100 to 1, seeing
a lot of errors for 1. Haven't tried modifying the number of columns.

Let me start a new thread focused on fetch size.


On Wed, Apr 2, 2014 at 9:53 AM, Sourabh Agrawal iitr.sour...@gmail.comwrote:

 From the doc : The fetch size controls how much resulting rows will be
 retrieved simultaneously.
 So, I guess it does not depend on the number of columns as such. As all
 the columns for a key reside on the same node, I think it wouldn't matter
 much whatever be the number of columns as long as we have enough memory in
 the app.

 Default value is 5000. (com.datastax.driver.core.QueryOptions)

 We use it with the default value. I have never profiled cassandra for read
 load. If you profile it for different fetch sizes, please share the results
 :)


 On Wed, Apr 2, 2014 at 8:45 AM, Apoorva Gaurav 
 apoorva.gau...@myntra.comwrote:

 Thanks Sourabh,

 I've modelled my table as studentID int, subjectID int, marks int,
 PRIMARY KEY(studentID, subjectID) as primarily I'll be querying using
 studentID and sometime using studentID and subjectID.

 I've tried driver 2.0.0 and its giving good results. Also using its auto
 paging feature. Any idea what should be a typical value for fetch size. And
 does the fetch size depends on how many columns are there in the CQL table
 for e.g. should fetch size in a table like studentID int, subjectID
 int, marks1 int, marks2 int, marks3 int marksN int PRIMARY
 KEY(studentID, subjectID) be less than fetch size in studentID int,
 subjectID int, marks int, PRIMARY KEY(studentID, subjectID)


 On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote:

  On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


 Use unique keyspace (or table) names for each test? That's the approach
 they're taking in 5202...

 =Rob




 --
 Thanks  Regards,
 Apoorva




 --
 Sourabh Agrawal
 Bangalore
 +91 9945657973




-- 
Thanks  Regards,
Apoorva


optimum fetch size in datastax driver

2014-04-02 Thread Apoorva Gaurav
Hello All,

We have a schema which can be modelled as *(studentID int, subjectID int,
marks int, PRIMARY KEY(studentID, subjectID)*. There can be ~1M studentIDs
and for each studentID there can be ~10K subjectIDs. The queries can be
using studentID and studentID-subjectID We have a 3 node (each having 24
cores) apache cassandra 2.0.4 cluster and are using datastax driver 2.0.0
to interact with it using its automatic paging feature. I've tried
various fetch
sizes varying from 100 to 10K and observed that read latency increases with
fetch size (which looks obvious). At around 10K there are a lot of errors.
Want to understand :-

   - Is there a rule of thumb for deciding on the optimum fetch size (
   *com.datastax.driver.core.Statement.setFetchSize()* ).
   - Does cassandra keeps the entire result in cache and only returns the
   rows corresponding to the fetch size or it treats subsequent as new queries
   ( *com.datastax.driver.core.**ResultSet.fetchMoreResults() *)
   - Whether the optimum fetch size depends on number of columns in CQL
   table for e.g. should fetch size in a table like ***studentID int,
   subjectID int, marks1 int, marks2 int, marks3 int marksN int PRIMARY
   KEY(studentID, subjectID)* be less than fetch size in *studentID int,
   subjectID int, marks int, PRIMARY KEY(studentID, subjectID)*


-- 
Thanks  Regards,
Apoorva


Dead node appearing in datastax driver

2014-04-01 Thread Apoorva Gaurav
Hello All,

We had a 4 node cassandra 2.0.4 cluster  ( lets call them host1, host2,
host3 and host4), out of which we've removed one node (host4) using
nodetool removenode command. Now using nodetool status or nodetool ring we
no longer see host4. It's also not appearing in Datastax opscenter. But its
intermittently appearing in Metadata.getAllHosts() while connecting using
datastax driver 1.0.4.

Couple of questions :-
-How is it appearing.
-Can this have impact on read / write performance of client.

Code which we are using to connect is

 public void connect() {

PoolingOptions poolingOptions = new PoolingOptions();

cluster = Cluster.builder()

.addContactPoints(inetAddresses.toArray(new String[]{}))

.withLoadBalancingPolicy(new RoundRobinPolicy())

.withPoolingOptions(poolingOptions)

.withPort(port)

.withCredentials(username, password)

.build();

Metadata metadata = cluster.getMetadata();

System.out.printf(Connected to cluster: %s\n,
metadata.getClusterName());

for (Host host : metadata.getAllHosts()) {

System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n,
host.getDatacenter(), host.getAddress(), host.getRack());

}

}



-- 
Thanks  Regards,
Apoorva


Re: Dead node appearing in datastax driver

2014-04-01 Thread Apoorva Gaurav
Hello Sylvian,

Queried system.peers on three live nodes and host4 is appearing on two of
these.


On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello All,

 We had a 4 node cassandra 2.0.4 cluster  ( lets call them host1, host2,
 host3 and host4), out of which we've removed one node (host4) using
 nodetool removenode command. Now using nodetool status or nodetool ring we
 no longer see host4. It's also not appearing in Datastax opscenter. But its
 intermittently appearing in Metadata.getAllHosts() while connecting using
 datastax driver 1.0.4.

 Couple of questions :-
 -How is it appearing.


 Not sure. Can you try querying the peers system table on each of your
 nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is
 still mentioned somewhere?


 -Can this have impact on read / write performance of client.


 No. If the host doesn't exists, the driver might try to reconnect to it at
 times, but since it won't be able to, it won't try to use it for reads and
 writes. That does mean you might have a reconnection task running with some
 regularity, but 1) it's not on the write/read path of queries and 2)
 provided you've left the default reconnection policy, this will happen once
 every 10 minutes and will be pretty cheap so that it will consume an
 completely negligible amount of ressources. That doesn't mean I'm not
 interested tracking down why that happens in the first place though.

 --
 Sylvain




 Code which we are using to connect is

  public void connect() {

 PoolingOptions poolingOptions = new PoolingOptions();

 cluster = Cluster.builder()

 .addContactPoints(inetAddresses.toArray(new String[]{}))

 .withLoadBalancingPolicy(new RoundRobinPolicy())

 .withPoolingOptions(poolingOptions)

 .withPort(port)

 .withCredentials(username, password)

 .build();

 Metadata metadata = cluster.getMetadata();

 System.out.printf(Connected to cluster: %s\n,
 metadata.getClusterName());

 for (Host host : metadata.getAllHosts()) {

 System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n,
 host.getDatacenter(), host.getAddress(), host.getRack());

 }

 }



 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Dead node appearing in datastax driver

2014-04-01 Thread Apoorva Gaurav
Did that and I actually see a significant reduction in write latency.


On Tue, Apr 1, 2014 at 5:35 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Tue, Apr 1, 2014 at 1:49 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.comwrote:

 Hello Sylvian,

 Queried system.peers on three live nodes and host4 is appearing on two of
 these.


 That's why the driver thinks they are still there. You're most probably
 running into https://issues.apache.org/jira/browse/CASSANDRA-6053 since
 you are on C* 2.0.4. As said, this is relatively harmless, but you should
 think about upgrading to 2.0.6 to fix it in the future (you could manually
 remove the bad entries in System.peers in the meantime if you want, they
 are really just leftover that shouldn't be here).

 --
 Sylvain



 On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We had a 4 node cassandra 2.0.4 cluster  ( lets call them host1, host2,
 host3 and host4), out of which we've removed one node (host4) using
 nodetool removenode command. Now using nodetool status or nodetool ring we
 no longer see host4. It's also not appearing in Datastax opscenter. But its
 intermittently appearing in Metadata.getAllHosts() while connecting using
 datastax driver 1.0.4.

 Couple of questions :-
 -How is it appearing.


 Not sure. Can you try querying the peers system table on each of your
 nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is
 still mentioned somewhere?


 -Can this have impact on read / write performance of client.


 No. If the host doesn't exists, the driver might try to reconnect to it
 at times, but since it won't be able to, it won't try to use it for reads
 and writes. That does mean you might have a reconnection task running with
 some regularity, but 1) it's not on the write/read path of queries and 2)
 provided you've left the default reconnection policy, this will happen once
 every 10 minutes and will be pretty cheap so that it will consume an
 completely negligible amount of ressources. That doesn't mean I'm not
 interested tracking down why that happens in the first place though.

 --
 Sylvain




 Code which we are using to connect is

  public void connect() {

 PoolingOptions poolingOptions = new PoolingOptions();

 cluster = Cluster.builder()

 .addContactPoints(inetAddresses.toArray(newString[]{}))

 .withLoadBalancingPolicy(new RoundRobinPolicy())

 .withPoolingOptions(poolingOptions)

 .withPort(port)

 .withCredentials(username, password)

 .build();

 Metadata metadata = cluster.getMetadata();

 System.out.printf(Connected to cluster: %s\n,
 metadata.getClusterName());

 for (Host host : metadata.getAllHosts()) {

 System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n,
 host.getDatacenter(), host.getAddress(), host.getRack());

 }

 }



 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Dead node appearing in datastax driver

2014-04-01 Thread Apoorva Gaurav
I manually removed entries from System.peers.

The improvements can well be coincidental as various other apps were also
running on the same test bed.


On Tue, Apr 1, 2014 at 8:43 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 What does Did that mean? Does that means I upgraded to 2.0.6, or does
 that mean I manually removed entries from System.peers. If the latter,
 I'd need more info on what you did exactly, what your peers table looked
 like before and how they look like now: there is no reason deleting the
 peers entries for hosts that at not part of the cluster anymore would have
 anything to do with write latency (but if say you've removed wrong entries,
 that might have make the driver think some live host had been removed and
 if the drivers has less nodes to use to dispatch queries, that might impact
 latency I suppose -- at least that's the only related thing I can think of).

 --
 Sylvain


 On Tue, Apr 1, 2014 at 2:44 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.comwrote:

 Did that and I actually see a significant reduction in write latency.


 On Tue, Apr 1, 2014 at 5:35 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Tue, Apr 1, 2014 at 1:49 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello Sylvian,

 Queried system.peers on three live nodes and host4 is appearing on two
 of these.


 That's why the driver thinks they are still there. You're most probably
 running into https://issues.apache.org/jira/browse/CASSANDRA-6053 since
 you are on C* 2.0.4. As said, this is relatively harmless, but you should
 think about upgrading to 2.0.6 to fix it in the future (you could manually
 remove the bad entries in System.peers in the meantime if you want, they
 are really just leftover that shouldn't be here).

 --
 Sylvain



 On Tue, Apr 1, 2014 at 5:06 PM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 On Tue, Apr 1, 2014 at 12:50 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We had a 4 node cassandra 2.0.4 cluster  ( lets call them host1,
 host2, host3 and host4), out of which we've removed one node (host4) 
 using
 nodetool removenode command. Now using nodetool status or nodetool ring 
 we
 no longer see host4. It's also not appearing in Datastax opscenter. But 
 its
 intermittently appearing in Metadata.getAllHosts() while connecting using
 datastax driver 1.0.4.

 Couple of questions :-
 -How is it appearing.


 Not sure. Can you try querying the peers system table on each of your
 nodes (with cqlsh: SELECT * FROM system.peers) and see if the host4 is
 still mentioned somewhere?


 -Can this have impact on read / write performance of client.


 No. If the host doesn't exists, the driver might try to reconnect to
 it at times, but since it won't be able to, it won't try to use it for
 reads and writes. That does mean you might have a reconnection task 
 running
 with some regularity, but 1) it's not on the write/read path of queries 
 and
 2) provided you've left the default reconnection policy, this will happen
 once every 10 minutes and will be pretty cheap so that it will consume an
 completely negligible amount of ressources. That doesn't mean I'm not
 interested tracking down why that happens in the first place though.

 --
 Sylvain




 Code which we are using to connect is

  public void connect() {

 PoolingOptions poolingOptions = new PoolingOptions();

 cluster = Cluster.builder()

 .addContactPoints(inetAddresses.toArray(newString[]{}))

 .withLoadBalancingPolicy(new RoundRobinPolicy())

 .withPoolingOptions(poolingOptions)

 .withPort(port)

 .withCredentials(username, password)

 .build();

 Metadata metadata = cluster.getMetadata();

 System.out.printf(Connected to cluster: %s\n,
 metadata.getClusterName());

 for (Host host : metadata.getAllHosts()) {

 System.out.printf(Datacenter: %s; Host: %s; Rack: %s\n,
 host.getDatacenter(), host.getAddress(), host.getRack());

 }

 }



 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-04-01 Thread Apoorva Gaurav
Thanks Sourabh,

I've modelled my table as studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) as primarily I'll be querying using studentID
and sometime using studentID and subjectID.

I've tried driver 2.0.0 and its giving good results. Also using its auto
paging feature. Any idea what should be a typical value for fetch size. And
does the fetch size depends on how many columns are there in the CQL table
for e.g. should fetch size in a table like studentID int, subjectID int,
marks1 int, marks2 int, marks3 int marksN int PRIMARY KEY(studentID,
subjectID) be less than fetch size in studentID int, subjectID int, marks
int, PRIMARY KEY(studentID, subjectID)


On Wed, Apr 2, 2014 at 2:20 AM, Robert Coli rc...@eventbrite.com wrote:

  On Mon, Mar 31, 2014 at 9:13 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Thanks Robert, Is there a workaround, as in our test setups we keep
 dropping and recreating tables.


 Use unique keyspace (or table) names for each test? That's the approach
 they're taking in 5202...

 =Rob




-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-03-31 Thread Apoorva Gaurav
Thanks Robert, Is there a workaround, as in our test setups we keep
dropping and recreating tables.


On Mon, Mar 31, 2014 at 11:51 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Mar 28, 2014 at 7:41 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do


 Until this bug is fixed upstream, dropping and recreating a table may
 create unexpected behavior.

 https://issues.apache.org/jira/browse/CASSANDRA-5202

 =Rob





-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-03-29 Thread Apoorva Gaurav
Hello Sourabh,

I'd prefer to do query like select * from marks_table where studentID = ?
and subjectID in (?, ?, ??) but if its costly then can happily delegate
the responsibility to the application layer.

Haven't tried 2.x java driver for this specific issue but tried it once
earlier and found the performance slower than 1.x; isn't so?


On Sat, Mar 29, 2014 at 3:30 PM, Sourabh Agrawal iitr.sour...@gmail.comwrote:

 Hi Apoorva,

 Do you always query on studentID only or do you need to query on both
 studentID and subjectID?

 Also, I think using the latest driver (2.x) can make querying large number
 of rows efficient.
 http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0




 On Sat, Mar 29, 2014 at 8:11 AM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello Shrikar,

 Yes primary key is (studentID, subjectID). I had dropped the test table,
 recreating and populating it post which will share the cfhistogram. In such
 case is there any practical limit on the rows I should fetch, for e.g.
 should I do
select * form marks_table where studentID = ? limit 500;
 instead of doing
select * form marks_table where studentID = ?;


 On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.comwrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int,
 PRIMARY KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID
 int, subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





 --
 Thanks  Regards,
 Apoorva




 --
 Sourabh Agrawal
 Bangalore
 +91 9945657973




-- 
Thanks  Regards,
Apoorva


Read performance in map data type

2014-03-28 Thread Apoorva Gaurav
Hello All,

We've a schema which can be modeled as (studentID, subjectID, marks) where
combination of studentID and subjectID is unique. Number of studentID can
go up to 100 million and for each studentID we can have up to  10k
subjectIDs.

We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We are
using a four node cluster, each having 24 cores and 32GB memory. I'm sure
that the machines are not underperformant as on same test bed we've
consistently received 5ms response times for ~1b documents when queried
via primary key.

I've tried three approaches, all of which result in significant
deterioration (500 ms response time) in read query performance once number
of subjectIDs goes past ~100 for a studentID. Approaches are :-

1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint, int)
and query by subjectID

2. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ?

3. model as (studentID int, subjectID int, marks int, PRIMARY
KEY(studentID, subjectID) and query as select * from marks_table where
studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
query being ~1K.

What can be the bottlenecks. Is it better if we model as (studentID int,
subjct_marks_json text) and query by studentID.

-- 
Thanks  Regards,
Apoorva


Re: Read performance in map data type

2014-03-28 Thread Apoorva Gaurav
Hello Shrikar,

Yes primary key is (studentID, subjectID). I had dropped the test table,
recreating and populating it post which will share the cfhistogram. In such
case is there any practical limit on the rows I should fetch, for e.g.
should I do
   select * form marks_table where studentID = ? limit 500;
instead of doing
   select * form marks_table where studentID = ?;


On Sat, Mar 29, 2014 at 5:20 AM, Shrikar archak shrika...@gmail.com wrote:

 Hi Apoorva,

 I assume this is the table with studentId and subjectId  as primary keys
 and not other like like marks in that.

 create table marks_table(studentId int, subjectId int, marks int, PRIMARY
 KEY(studentId,subjectId));

 Also could you give the cfhistogram stats?

 nodetool cfhistograms your keyspace marks_table;



 Thanks,
 Shrikar


 On Fri, Mar 28, 2014 at 3:53 PM, Apoorva Gaurav apoorva.gau...@myntra.com
  wrote:

 Hello All,

 We've a schema which can be modeled as (studentID, subjectID, marks)
 where combination of studentID and subjectID is unique. Number of studentID
 can go up to 100 million and for each studentID we can have up to  10k
 subjectIDs.

 We are using apahce cassandra 2.0.4 and datastax java driver 1.0.4. We
 are using a four node cluster, each having 24 cores and 32GB memory. I'm
 sure that the machines are not underperformant as on same test bed we've
 consistently received 5ms response times for ~1b documents when queried
 via primary key.

 I've tried three approaches, all of which result in significant
 deterioration (500 ms response time) in read query performance once number
 of subjectIDs goes past ~100 for a studentID. Approaches are :-

 1. model as (studentID int PRIMARY KEY, subjectID_marks_map mapint,
 int) and query by subjectID

 2. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ?

 3. model as (studentID int, subjectID int, marks int, PRIMARY
 KEY(studentID, subjectID) and query as select * from marks_table where
 studentID = ? and subjectID in (?, ?, ??)  number of subjectIDs in
 query being ~1K.

 What can be the bottlenecks. Is it better if we model as (studentID int,
 subjct_marks_json text) and query by studentID.

 --
 Thanks  Regards,
 Apoorva





-- 
Thanks  Regards,
Apoorva