[
https://issues.apache.org/jira/browse/CASSANDRA-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571614#comment-15571614
]
Benedict edited comment on CASSANDRA-12490 at 10/13/16 11:09 AM:
-
_partition_ keys are a distinct beast, and if your population distribution for
these is tiny then yes you will get overwrites, and I'm really not sure there's
anything we can _reliably_ do about that. Mostly I've been talking about
behaviour _within_ a partition (except when pointing out some breakages).
The command line "-pop" property specifies the population of unique partition
_seeds_. These have to be translated into the partition key population
distribution(s) first, which then between them uniquely produce the partition's
contents (different seeds hitting the same PK will produce the same entire
partition). The problem is that the size of the unique seed set could be
gigantic (we let n be billions in size, and it is often necessary to run with
datasets this large), so enumerating all of these unique seeds and determining
their value in the partition key column population distributions would be
prohibitively expensive. So we just accept that users should sensibly ensure
their partition key population distribution is large enough to accommodate
enough random samples to fulfil their seed population.
Now, for small populations we *could* mode-switch. But I'm not sure it ever
makes sense to so materially constrain your partition key population
distribution. It might even make sense for stress to forbid constraining this
distribution too much, as it has essentially no impact to the behaviour profile
of the cluster.
If you want to visit a single partition many times, there are better ways to do
that. i.e., specifying that the seed population as small, but that you want to
run many operations. This will give you an identically constrained population,
without any risk of weirdness, as well as permitting the same yaml to be used
for different scales of test. ideally you want each visit, in such a scenario,
to use the more advanced features of stress anyway (such as partial visitation
of the whole generated (presumably huge) partition, or incremental visitation)
was (Author: benedict):
_partition_ keys are a distinct beast, and if your population distribution for
these is tiny then yes you will get overwrites, and I'm really not sure there's
anything we can _reliably_ do about that. Mostly I've been talking about
behaviour _within_ a partition (except when pointing out some breakages).
The command line "-pop" property specifies the population of unique partition
_seeds_. These have to be translated into the partition key population
distribution(s) first, which then between them identify the partition's
contents. The problem is that the size of the unique seed set could be
gigantic (we let n be billions in size, and it is often necessary to run with
datasets this large), so enumerating all of these unique seeds and determining
their value in the partition key column population distributions would be
prohibitively expensive. So we just accept that users should sensibly ensure
their partition key population distribution is large enough to accommodate
enough random samples to fulfil their seed population.
Now, for small populations we *could* mode-switch. But I'm not sure it ever
makes sense to so materially constrain your partition key population
distribution. It might even make sense for stress to forbid constraining this
distribution too much, as it has essentially no impact to the behaviour profile
of the cluster.
If you want to visit a single partition many times, there are better ways to do
that. i.e., specifying that the seed population as small, but that you want to
run many operations. This will give you an identically constrained population,
without any risk of weirdness, as well as permitting the same yaml to be used
for different scales of test. ideally you want each visit, in such a scenario,
to use the more advanced features of stress anyway (such as partial visitation
of the whole generated (presumably huge) partition, or incremental visitation)
> Add sequence distribution type to cassandra stress
> --
>
> Key: CASSANDRA-12490
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12490
> Project: Cassandra
> Issue Type: Improvement
> Components: Tools
>Reporter: Ben Slater
>Assignee: Ben Slater
>Priority: Minor
> Fix For: 3.10
>
> Attachments: 12490-trunk.patch, 12490.yaml, cqlstress-seq-example.yaml
>
>
> When using the write command, cassandra stress sequentially generates seeds.
> This ensures generated values don't overlap (unless the sequence wraps)
> providing more