[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-09-07 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124994#comment-14124994
 ] 

Ryan McGuire commented on CASSANDRA-7519:
-

I'm getting compilation errors from this on cassandra-2.1 HEAD:

{code}
stress-build:
[mkdir] Created dir: /home/ryan/git/datastax/cassandra/build/classes/stress
[javac] Compiling 102 source files to 
/home/ryan/git/datastax/cassandra/build/classes/stress
[javac] 
/home/ryan/git/datastax/cassandra/tools/stress/src/org/apache/cassandra/stress/settings/SettingsCommandPreDefinedMixed.java:60:
 error: constructor SampledOpDistributionFactory in class 
SampledOpDistributionFactoryT cannot be applied to given types;
[javac] return new SampledOpDistributionFactoryCommand(ratios, 
clustering)
[javac]^
[javac]   required: ListPairCommand,Double,DistributionFactory
[javac]   found: MapCommand,Double,DistributionFactory
[javac]   reason: actual argument MapCommand,Double cannot be converted 
to ListPairCommand,Double by method invocation conversion
[javac]   where T is a type-variable:
[javac] T extends Object declared in class SampledOpDistributionFactory
[javac] 
/home/ryan/git/datastax/cassandra/tools/stress/src/org/apache/cassandra/stress/settings/SettingsCommandPreDefinedMixed.java:61:
 error: constructor SampledOpDistributionFactory in class 
SampledOpDistributionFactoryT cannot be applied to given types;
[javac] {
[javac] ^
[javac]   required: ListPairCommand,Double,DistributionFactory
[javac]   found: no arguments
[javac]   reason: actual and formal argument lists differ in length
[javac]   where T is a type-variable:
[javac] T extends Object declared in class SampledOpDistributionFactory
[javac] 
/home/ryan/git/datastax/cassandra/tools/stress/src/org/apache/cassandra/stress/settings/SettingsCommandUser.java:67:
 error: constructor SampledOpDistributionFactory in class 
SampledOpDistributionFactoryT cannot be applied to given types;
[javac] return new SampledOpDistributionFactoryString(ratios, 
clustering)
[javac]^
[javac]   required: ListPairString,Double,DistributionFactory
[javac]   found: MapString,Double,DistributionFactory
[javac]   reason: actual argument MapString,Double cannot be converted to 
ListPairString,Double by method invocation conversion
[javac]   where T is a type-variable:
[javac] T extends Object declared in class SampledOpDistributionFactory
[javac] 
/home/ryan/git/datastax/cassandra/tools/stress/src/org/apache/cassandra/stress/settings/SettingsCommandUser.java:68:
 error: constructor SampledOpDistributionFactory in class 
SampledOpDistributionFactoryT cannot be applied to given types;
[javac] {
[javac] ^
[javac]   required: ListPairString,Double,DistributionFactory
[javac]   found: no arguments
[javac]   reason: actual and formal argument lists differ in length
[javac]   where T is a type-variable:
[javac] T extends Object declared in class SampledOpDistributionFactory
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 4 errors

BUILD FAILED
/home/ryan/git/datastax/cassandra/build.xml:735: Compile failed; see the 
compiler error output for details.
{code}

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared 

[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-09-07 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14125124#comment-14125124
 ] 

Benedict commented on CASSANDRA-7519:
-

There were no merge conflicts from 2.1.0-2.1, so I missed this. I've ninja 
fixed.

I haven't resolved the ticket because, whilst it's been committed to 2.1.0, 
it's unclear if it will be pushed back to 2.1.1 since the RC is up for vote. So 
the fixVersion is unknown.

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-09-05 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122877#comment-14122877
 ] 

T Jake Luciani commented on CASSANDRA-7519:
---

+1

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-09-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122314#comment-14122314
 ] 

Benedict commented on CASSANDRA-7519:
-

Rebased and made the agreed tweaks as two follow up commits. If you can +1 I'll 
commit this to 2.1.0 in time for release

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-08-23 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108219#comment-14108219
 ] 

T Jake Luciani commented on CASSANDRA-7519:
---

Ran some tests and tweaked the schema from the blogpost and things look better. 
 I do have some further questions/suggestions besides the better names.

- What is the point of batchcount?  The point of a batch is to group the 
inserts into a single statement for the server, so why would you send multiple 
of these sequentially? Even though it's possible I can't think of a realistic 
workload that would use it.

- I think it would be helpful to output some information on the partition sizes 
and batch sizes for inserts to give people a sense of what their selected 
values will do, like:

{code}
Global:
  Partitions: Min of X, Max of Y  
  Rows per partition:  Min of X,  Max of Y 

Per Batch:
  Partitions: Min of X, Max of Y
  Rows per partition: Min of X, Max of Y
{code}



 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-08-23 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108226#comment-14108226
 ] 

Benedict commented on CASSANDRA-7519:
-

bq. What is the point of batchcount? The point of a batch is to group the 
inserts into a single statement for the server, so why would you send multiple 
of these sequentially? Even though it's possible I can't think of a realistic 
workload that would use it.

The idea was to support benchmarking many inserts into a very wide row. However 
now that we support the revisit mechanism, this does seem superfluous. There is 
one slight potential reason to include it, which is that currently batches only 
support 5000 statements. Currently stress automatically splits into batches of 
this size, but perhaps we should error out at the start if it's possible to 
generate a batch larger than this, and support this option to permit users to 
split it up. Or, alternatively, we should drop this option and forbid batches 
larger than 5000 in size only if using LOGGED batches. Any of the above seem 
reasonable to me.

bq, I think it would be helpful to output some information on the partition 
sizes and batch sizes for inserts to give people a sense of what their selected 
values will do,

That does sound sensible, yes. I'll ad that.

It seems that it might be worthwhile including an estimate of the size of data 
we've sent in the main stress output as well, as with a lot of randomly (esp. 
expontentially) generated data it could vary dramatically, so the current data 
might not be as useful. As it first appears.

Separately, I think we should make a minor tweak and base the stderr 
calculation on partition count, not operation count.

 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-08-17 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100233#comment-14100233
 ] 

T Jake Luciani commented on CASSANDRA-7519:
---

I'm not very keen on the new labels you've chosen for the insert section of the 
yaml file, They should be more verbose.

Batch size - number of unique partitions to update in a single operation 
  This should mention partitions in it no? partitions_per_batch maybe?

Batch count - number of batches we aim to split the update up into
   Does this mean the number of batches to split a operation of N partitions 
into? If so, then perhaps batch_split_count?

I plan to run some test workloads to double check the logic, but first cut of 
the code looked good.  I left a couple comments on the github branch


 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7519) Further stress improvements to generate more realistic workloads

2014-08-17 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100237#comment-14100237
 ] 

Benedict commented on CASSANDRA-7519:
-

bq. I plan to run some test workloads to double check the logic, but first cut 
of the code looked good. I left a couple comments on the github branch

Thanks!

bq. I'm not very keen on the new labels you've chosen for the insert section of 
the yaml file, They should be more verbose

Nomenclature is always tricky, certainly not fixed on them. Although by making 
these more verbose we'll need to make the command line correspondingly more 
verbose to keep them in sync, which I'm not super keen on, but not too fussed 
about either.

bq. partitions_per_batch maybe?

perhaps partitions_per_operation? because per_batch implies we might change the 
number of partitions between batches, whereas we work with the same partitions 
for the duration of an 'operation' (the n= declared on command line)...

bq. batch_split_count

batches_per_operation?


 Further stress improvements to generate more realistic workloads
 

 Key: CASSANDRA-7519
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7519
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Benedict
Assignee: Benedict
Priority: Minor
  Labels: tools
 Fix For: 2.1.1


 We generally believe that the most common workload is for reads to 
 exponentially prefer most recently written data. However as stress currently 
 behaves we have two id generation modes: sequential and random (although 
 random can be distributed). I propose introducing a new mode which is 
 somewhat like sequential, except we essentially 'look back' from the current 
 id by some amount defined by a distribution. I may possibly make the position 
 only increment as it's first written to also, so that this mode can be run 
 from a clean slate with a mixed workload. This should allow is to generate 
 workloads that are more representative.
 At the same time, I will introduce a timestamp value generator for primary 
 key columns that is strictly ascending, i.e. has some random component but is 
 based off of the actual system time (or some shared monotonically increasing 
 state) so that we can again generate a more realistic workload. This may be 
 challenging to tie in with the new procedurally generated partitions, but I'm 
 sure it can be done without too much difficulty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)