[jira] [Comment Edited] (CASSANDRA-12490) Add sequence distribution type to cassandra stress

2016-10-16 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581311#comment-15581311
 ] 

Stefania edited comment on CASSANDRA-12490 at 10/17/16 5:53 AM:


bq. It still generates min..max as nextWithWrap() (which is called by the other 
variants of next()) returns start + (next % totalCount) so it starts again from 
min if it goes past max.

You're correct it does wrap, but it won't necessarily start at min, so the 
documentation should clearly state this. I would also add a call to 
{{setSeed()}} in one of the unit tests if the patch goes ahead. 
{{inverseCumProb()}} is correct because of the wrapping.


was (Author: stefania):
bq. It still generates min..max as nextWithWrap() (which is called by the other 
variants of next()) returns start + (next % totalCount) so it starts again from 
min if it goes past max.

You're correct it does wrap, but it won't necessarily start at min, so the 
documentation should clearly state this. I would also add a call to 
{{setSeed()}} in one of the unit tests if the patch goes ahead. 
{{inverseCumProb()}} is instead correct because of the wrapping.

> Add sequence distribution type to cassandra stress
> --
>
> Key: CASSANDRA-12490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Ben Slater
>Assignee: Ben Slater
>Priority: Minor
> Fix For: 3.10
>
> Attachments: 12490-trunk.patch, 12490.yaml, 12490update-trunk.patch, 
> cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more predictable number of inserted records (and generating a base 
> set of data without wasted writes).
> When using a yaml stress spec there is no sequenced distribution available. 
> It think it would be useful to have this for doing initial load of data for 
> testing 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12490) Add sequence distribution type to cassandra stress

2016-10-13 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571614#comment-15571614
 ] 

Benedict edited comment on CASSANDRA-12490 at 10/13/16 11:09 AM:
-

_partition_ keys are a distinct beast, and if your population distribution for 
these is tiny then yes you will get overwrites, and I'm really not sure there's 
anything we can _reliably_ do about that.  Mostly I've been talking about 
behaviour _within_ a partition (except when pointing out some breakages).

The command line "-pop" property specifies the population of unique partition 
_seeds_.  These have to be translated into the partition key population 
distribution(s) first, which then between them uniquely produce the partition's 
contents (different seeds hitting the same PK will produce the same entire 
partition).  The problem is that the size of the unique seed set could be 
gigantic (we let n be billions in size, and it is often necessary to run with 
datasets this large), so enumerating all of these unique seeds and determining 
their value in the partition key column population distributions would be 
prohibitively expensive.  So we just accept that users should sensibly ensure 
their partition key population distribution is large enough to accommodate 
enough random samples to fulfil their seed population.

Now, for small populations we *could* mode-switch.  But I'm not sure it ever 
makes sense to so materially constrain your partition key population 
distribution.  It might even make sense for stress to forbid constraining this 
distribution too much, as it has essentially no impact to the behaviour profile 
of the cluster.

If you want to visit a single partition many times, there are better ways to do 
that. i.e., specifying that the seed population as small, but that you want to 
run many operations.  This will give you an identically constrained population, 
without any risk of weirdness, as well as permitting the same yaml to be used 
for different scales of test.  ideally you want each visit, in such a scenario, 
to use the more advanced features of stress anyway (such as partial visitation 
of the whole generated (presumably huge) partition, or incremental visitation)


was (Author: benedict):
_partition_ keys are a distinct beast, and if your population distribution for 
these is tiny then yes you will get overwrites, and I'm really not sure there's 
anything we can _reliably_ do about that.  Mostly I've been talking about 
behaviour _within_ a partition (except when pointing out some breakages).

The command line "-pop" property specifies the population of unique partition 
_seeds_.  These have to be translated into the partition key population 
distribution(s) first, which then between them identify the partition's 
contents.  The problem is that the size of the unique seed set could be 
gigantic (we let n be billions in size, and it is often necessary to run with 
datasets this large), so enumerating all of these unique seeds and determining 
their value in the partition key column population distributions would be 
prohibitively expensive.  So we just accept that users should sensibly ensure 
their partition key population distribution is large enough to accommodate 
enough random samples to fulfil their seed population.

Now, for small populations we *could* mode-switch.  But I'm not sure it ever 
makes sense to so materially constrain your partition key population 
distribution.  It might even make sense for stress to forbid constraining this 
distribution too much, as it has essentially no impact to the behaviour profile 
of the cluster.

If you want to visit a single partition many times, there are better ways to do 
that. i.e., specifying that the seed population as small, but that you want to 
run many operations.  This will give you an identically constrained population, 
without any risk of weirdness, as well as permitting the same yaml to be used 
for different scales of test.  ideally you want each visit, in such a scenario, 
to use the more advanced features of stress anyway (such as partial visitation 
of the whole generated (presumably huge) partition, or incremental visitation)

> Add sequence distribution type to cassandra stress
> --
>
> Key: CASSANDRA-12490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Ben Slater
>Assignee: Ben Slater
>Priority: Minor
> Fix For: 3.10
>
> Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more 

[jira] [Comment Edited] (CASSANDRA-12490) Add sequence distribution type to cassandra stress

2016-10-13 Thread Ben Slater (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571589#comment-15571589
 ] 

Ben Slater edited comment on CASSANDRA-12490 at 10/13/16 10:57 AM:
---

OK, so I did some more investigation this evening to try to better understand 
this and found a few interesting things. I suspect there is at least on bug 
here but I'll be interested to see what you think.

I set up a simple spec to test what was going on:
{code}
table: test4
table_definition: |
  CREATE TABLE test4 (
pk text,
val text,
PRIMARY KEY (pk)
  ) 
columnspec:
  - name: pk
size: fixed(32) 
population: uniform(1..50)
{code}

When I run this with `ops(insert=1) n=50` the end result is 1 row added to the 
table. When I run it with n=500 I get 3 rows. Some other observations:
a) tracing this through it seems that's because the small number of seed values 
from the population (due to the small n=) results in a very low variation in 
values being returned from `delegate.sample()` in 
`DistributionBoundApache.next()`. They were (all 37 Add sequence distribution type to cassandra stress
> --
>
> Key: CASSANDRA-12490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Tools
>Reporter: Ben Slater
>Assignee: Ben Slater
>Priority: Minor
> Fix For: 3.10
>
> Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds. 
> This ensures generated values don't overlap (unless the sequence wraps) 
> providing more predictable number of inserted records (and