Re: cassandra-stress HexStrings generator

2018-12-14 Thread Saleil Bhat (BLOOMBERG/ 731 LEX)
Main issue is resolved. The test I was using to determine normality was too 
sensitive to discretization, so it was yielding a negative result even though 
the data looked pretty normal on visual inspection.  The tool only ever uses 
the Strings generator; HexStrings is unused.

The only (minor) concern is that the Strings generator generates some control 
characters as part of the generated string. I presume that this behavior is 
undesired and that the characters should be restricted to ASCII printing 
characters. 

Thanks, 
-Saleil

From: bened...@apache.org At: 12/13/18 17:10:17To:  Saleil Bhat (BLOOMBERG/ 731 
LEX ) ,  dev@cassandra.apache.org
Subject: Re: cassandra-stress HexStrings generator

I’m honestly not sure.  The code has changed since I last worked on it, which 
was years ago.  I suspect the profile mode has entirely supplanted the prior 
modes, and that these older modes supported the HexStrings generator.

Perhaps somebody else can help answer this question.


> On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) 
 wrote:
> 
> Ah ok thanks. This brings up another question: how did the HexStrings 
generator code path even get called? 
> 
> 
> 
> When I saw these results, I was using the following test table: 
>  CREATE TABLE testtable (
>  partition_key text,
>  clustering_column text, 
>  value text,
>  PRIMARY KEY (partition_key, clustering_column)
>  )
> 
> 
> From StressProfile.java, any column of type TEXT should use the Strings 
generator. 
> However, my data looks suspiciously like the HexStrings generator was being 
used instead. 
> 
> 
> First, the generated strings included control characters like SUB (\x1A), BEL 
(\x07), etc. However, the Strings generator code looks like it forces the 
characters to be in the printing characters range. 
> Second, the result I documented previously (that the characters are normally 
distributed, but the strings are not), matches the implementation of 
HexStrings. 
> 
> 
> 
> Do you know why this might be the case?
> 
> Thanks, 
> -Saleil 
> 
> 
> From: bened...@apache.org At: 12/12/18 18:09:14To:  Saleil Bhat (BLOOMBERG/ 
731 LEX ) ,  dev@cassandra.apache.org
> Subject: Re: cassandra-stress HexStrings generator
> 
> Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
> been a long time so I cannot remember much for certain).  
> 
> It should be implemented like the Strings generator.  It looks like both 
> HexStrings and HexBytes are incorrect, and have been for a long time.
> 
> 
>> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
>  wrote:
>> 
>> Hi, 
>> 
>> I have a question about the behavior of the HexStrings value generator in 
the 
> cassandra-stress tool, particularly concerning its population/identity 
> distribution.  
>> 
>> 
>> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
> profile, the population field in a columnspec “represents the total unique 
> population distribution of that column across rows.”
>> 
>> 
>> I interpreted this to mean that if I specify some distribution 'F' for a 
> column, then the probability of occurrence for each potential value of that 
> column is given by 'F'. 
>> 
>> So, for example, if I provided the following columnspec for a text column: 
>> name: fake_column 
>>  size: fixed(32) 
>>population: gaussian(1..100)  
>> and then generated a large amount of data according to this specification, 
>> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
> histogram of the frequency of occurrence of each value would be roughly 
> bell-shaped. 
>> 
>> 
>> 
>> However, the current implementation of the HexStrings generator deviates 
from 
> this expectation. In the current implementation, each CHARACTER in the string 
> is drawn from F, rather than the string as a whole. Therefore, if you plot 
the 
> histogram of frequency of occurrence for each character, you get a 
bell-shaped 
> curve, but the distribution of the occurrences of whole strings (the actual 
> columns) is something else. 
>> 
>> 
>> My question is, is this the desired behavior for string columns? Was my 
> expectation/interpretation incorrect? If so, can anyone give some insight as 
to 
> why strings are designed to behave this way and what the use case is for this 
> behavior? 
>> 
>> Thanks, 
>> -Saleil 
> 
> 




Re: cassandra-stress HexStrings generator

2018-12-13 Thread Benedict Elliott Smith
I’m honestly not sure.  The code has changed since I last worked on it, which 
was years ago.  I suspect the profile mode has entirely supplanted the prior 
modes, and that these older modes supported the HexStrings generator.

Perhaps somebody else can help answer this question.


> On 13 Dec 2018, at 17:37, Saleil Bhat (BLOOMBERG/ 731 LEX) 
>  wrote:
> 
> Ah ok thanks. This brings up another question: how did the HexStrings 
> generator code path even get called? 
> 
> 
> 
> When I saw these results, I was using the following test table: 
>  CREATE TABLE testtable (
>  partition_key text,
>  clustering_column text, 
>  value text,
>  PRIMARY KEY (partition_key, clustering_column)
>  )
> 
> 
> From StressProfile.java, any column of type TEXT should use the Strings 
> generator. 
> However, my data looks suspiciously like the HexStrings generator was being 
> used instead. 
> 
> 
> First, the generated strings included control characters like SUB (\x1A), BEL 
> (\x07), etc. However, the Strings generator code looks like it forces the 
> characters to be in the printing characters range. 
> Second, the result I documented previously (that the characters are normally 
> distributed, but the strings are not), matches the implementation of 
> HexStrings. 
> 
> 
> 
> Do you know why this might be the case?
> 
> Thanks, 
> -Saleil 
> 
> 
> From: bened...@apache.org At: 12/12/18 18:09:14To:  Saleil Bhat (BLOOMBERG/ 
> 731 LEX ) ,  dev@cassandra.apache.org
> Subject: Re: cassandra-stress HexStrings generator
> 
> Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
> been a long time so I cannot remember much for certain).  
> 
> It should be implemented like the Strings generator.  It looks like both 
> HexStrings and HexBytes are incorrect, and have been for a long time.
> 
> 
>> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
>  wrote:
>> 
>> Hi, 
>> 
>> I have a question about the behavior of the HexStrings value generator in 
>> the 
> cassandra-stress tool, particularly concerning its population/identity 
> distribution.  
>> 
>> 
>> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
> profile, the population field in a columnspec “represents the total unique 
> population distribution of that column across rows.”
>> 
>> 
>> I interpreted this to mean that if I specify some distribution 'F' for a 
> column, then the probability of occurrence for each potential value of that 
> column is given by 'F'. 
>> 
>> So, for example, if I provided the following columnspec for a text column: 
>> name: fake_column 
>>  size: fixed(32) 
>>population: gaussian(1..100)  
>> and then generated a large amount of data according to this specification, 
>> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
> histogram of the frequency of occurrence of each value would be roughly 
> bell-shaped. 
>> 
>> 
>> 
>> However, the current implementation of the HexStrings generator deviates 
>> from 
> this expectation. In the current implementation, each CHARACTER in the string 
> is drawn from F, rather than the string as a whole. Therefore, if you plot 
> the 
> histogram of frequency of occurrence for each character, you get a 
> bell-shaped 
> curve, but the distribution of the occurrences of whole strings (the actual 
> columns) is something else. 
>> 
>> 
>> My question is, is this the desired behavior for string columns? Was my 
> expectation/interpretation incorrect? If so, can anyone give some insight as 
> to 
> why strings are designed to behave this way and what the use case is for this 
> behavior? 
>> 
>> Thanks, 
>> -Saleil 
> 
> 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: cassandra-stress HexStrings generator

2018-12-13 Thread Saleil Bhat (BLOOMBERG/ 731 LEX)
Ah ok thanks. This brings up another question: how did the HexStrings generator 
code path even get called? 



When I saw these results, I was using the following test table: 
  CREATE TABLE testtable (
  partition_key text,
  clustering_column text, 
  value text,
  PRIMARY KEY (partition_key, clustering_column)
  )


From StressProfile.java, any column of type TEXT should use the Strings 
generator. 
However, my data looks suspiciously like the HexStrings generator 
was being used instead. 


First, the generated strings included control characters like SUB (\x1A), BEL 
(\x07), etc. However, the Strings generator code looks like it forces the 
characters to be in the printing characters range. 
Second, the result I documented previously (that the characters are normally 
distributed, but the strings are not), matches the implementation of 
HexStrings. 



Do you know why this might be the case?

Thanks, 
-Saleil 


From: bened...@apache.org At: 12/12/18 18:09:14To:  Saleil Bhat (BLOOMBERG/ 731 
LEX ) ,  dev@cassandra.apache.org
Subject: Re: cassandra-stress HexStrings generator

Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
been a long time so I cannot remember much for certain).  

It should be implemented like the Strings generator.  It looks like both 
HexStrings and HexBytes are incorrect, and have been for a long time.


> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
 wrote:
> 
> Hi, 
> 
> I have a question about the behavior of the HexStrings value generator in the 
cassandra-stress tool, particularly concerning its population/identity 
distribution.  
> 
> 
> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
profile, the population field in a columnspec “represents the total unique 
population distribution of that column across rows.”
> 
> 
> I interpreted this to mean that if I specify some distribution 'F' for a 
column, then the probability of occurrence for each potential value of that 
column is given by 'F'. 
> 
> So, for example, if I provided the following columnspec for a text column: 
>  name: fake_column 
>   size: fixed(32) 
> population: gaussian(1..100)  
> and then generated a large amount of data according to this specification, 
> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
histogram of the frequency of occurrence of each value would be roughly 
bell-shaped. 
> 
> 
> 
> However, the current implementation of the HexStrings generator deviates from 
this expectation. In the current implementation, each CHARACTER in the string 
is drawn from F, rather than the string as a whole. Therefore, if you plot the 
histogram of frequency of occurrence for each character, you get a bell-shaped 
curve, but the distribution of the occurrences of whole strings (the actual 
columns) is something else. 
> 
> 
> My question is, is this the desired behavior for string columns? Was my 
expectation/interpretation incorrect? If so, can anyone give some insight as to 
why strings are designed to behave this way and what the use case is for this 
behavior? 
> 
> Thanks, 
> -Saleil 




Re: cassandra-stress HexStrings generator

2018-12-12 Thread Benedict Elliott Smith
Yes, I’m pretty sure you understood correctly (I wrote most of this, but it’s 
been a long time so I cannot remember much for certain).  

It should be implemented like the Strings generator.  It looks like both 
HexStrings and HexBytes are incorrect, and have been for a long time.


> On 12 Dec 2018, at 22:27, Saleil Bhat (BLOOMBERG/ 731 LEX) 
>  wrote:
> 
> Hi, 
> 
> I have a question about the behavior of the HexStrings value generator in the 
> cassandra-stress tool, particularly concerning its population/identity 
> distribution.  
> 
> 
> Per the discussion in JIRA item CASSANDRA-6146 concerning the stress YAML 
> profile, the population field in a columnspec “represents the total unique 
> population distribution of that column across rows.”
> 
> 
> I interpreted this to mean that if I specify some distribution 'F' for a 
> column, then the probability of occurrence for each potential value of that 
> column is given by 'F'. 
> 
> So, for example, if I provided the following columnspec for a text column: 
>  name: fake_column 
>   size: fixed(32) 
> population: gaussian(1..100)  
> and then generated a large amount of data according to this specification, 
> I would expect there to be 100 distinct values for ‘fake_column’, and that a 
> histogram of the frequency of occurrence of each value would be roughly 
> bell-shaped. 
> 
> 
> 
> However, the current implementation of the HexStrings generator deviates from 
> this expectation. In the current implementation, each CHARACTER in the string 
> is drawn from F, rather than the string as a whole. Therefore, if you plot 
> the histogram of frequency of occurrence for each character, you get a 
> bell-shaped curve, but the distribution of the occurrences of whole strings 
> (the actual columns) is something else. 
> 
> 
> My question is, is this the desired behavior for string columns? Was my 
> expectation/interpretation incorrect? If so, can anyone give some insight as 
> to why strings are designed to behave this way and what the use case is for 
> this behavior? 
> 
> Thanks, 
> -Saleil 


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org