what happened to a pagination if some data are inserted before it get resumed ?

2019-10-31 Thread jagernicolas
Hi,

 what would happen If between the moment I save a paging state and the moment I 
resume it, some data have been added to the database ?

for example, let say I do a query which return 100 results paged by 10 rows. I 
get my first page, i.e, my first 10 elements.
Then, let say that before I ask to get the second page, some data were added to 
what was my first page, some to the second etc.

what will I see when I resume the pagination ? will I get the results as if 
nothing was added to the database, or am I going to see on my second page some 
results that was pushed out of the first page ?
in my case we using the python driver our code use the same functions that the 
following example, class Items(Model):
 id = columns.Text(primary_key=True)
 data = columns.Bytes()

query = Items.objects.all().limit(10)

first_page = list(query);
last = first_page[-1]
next_page = list(query.filter(pk__token__gt=cqlengine.Token(last.pk))) source: 
https://docs.datastax.com/en/developer/python-driver/3.20/cqlengine/queryset/ 
(https://docs.datastax.com/en/developer/python-driver/3.20/cqlengine/queryset/)

there is another way to store the pagination than storing the token ? (if I 
showing the example and asking that it's because I have the feeling there is 
two ways to use the python driver. the one using function like filter, and 
another one where we send a query as we would have written in cqlsh)

regards,
Nicolas Jäger


oversized partition detection ? monitoring the partitions growth ?

2019-10-31 Thread jagernicolas
Hi,
how can I detect a partition that reaches the 100MB ? is it possible to log the 
size of every partition one time per day ?

regards,
Nicolas Jäger


Re: Sizing a cluster

2019-10-01 Thread jagernicolas
Hi Léo thax for the links,

Is that the size of the uncompressed data or the data once it has been 
inserted and compressed by cassandra ?The size of 0.5MB is the size of the data 
we sent, before cassandra do compression if any.
 Looking at the cassandra compression : 
http://cassandra.apache.org/doc/latest/operating/compression.html 
(http://cassandra.apache.org/doc/latest/operating/compression.html) and testing 
different parameters on a test cluster might be interesting before you do the 
sizing of the final production cluster,We are in dev phase, we have two small 
clusters. I haven't yet take in account the compression. For the compaction I 
roughly considered that we need 50% extra space per node (the extra space is 
not in the calculation I did in my last email).
1 octobre 2019 08:58 "Léo FERLIN SUTTON" mailto:lfer...@mailjet.com.invalid?to=%22L%C3%A9o%20FERLIN%20SUTTON%22%20)>
 a écrit:
 Hi !
I'm not an expert but don't forget that cassandra needs space to do it's 
compactions. 
Take a look at the worst case scenarios from this datastax grid : 
https://docs.datastax.com/en/dse-planning/doc/planning/capacityPlanning.html#capacityPlanning__disk
 
(https://docs.datastax.com/en/dse-planning/doc/planning/capacityPlanning.html#capacityPlanning__disk)
  
> The size of a picture + data is about 0.5MB  
Is that the size of the uncompressed data or the data once it has been inserted 
and compressed by cassandra ? 
 Looking at the cassandra compression : 
http://cassandra.apache.org/doc/latest/operating/compression.html 
(http://cassandra.apache.org/doc/latest/operating/compression.html) and testing 
different parameters on a test cluster might be interesting before you do the 
sizing of the final production cluster, 
Regards, 
Leo  
 On Tue, Oct 1, 2019 at 1:40 PM mailto:jagernico...@legtux.org)> wrote: 
Hi,
We want to use Cassandra to store camera detection. The size of a picture + 
data is about 0.5MB. We starting with 5 devices, but we targeting 50 device for 
the next year, and could go up to 1000. I summary everything ,
*  Number of sources: 5 - 50 - 1000 (src)  
*  Frequency of data: 1Hz (f)  
*  Estimate size of data: 0.5MB (s)  
*  Replication factor: 3 (RF)  
I calculated the size per year,
* src * f *60 * 60 * 24 * 365 * s
gives me,
* 5 sources = 0.24 PB per year 
* 50 sources = 2.4 PB per year 
* 1000 sources = 47.3 per year 
so if respect the 2TB rule, I got, 120 nodes in the simplest case (5 sources). 
Am I right ?

regards,
Nicolas Jäger


Sizing a cluster

2019-10-01 Thread jagernicolas
Hi,
We want to use Cassandra to store camera detection. The size of a picture + 
data is about 0.5MB. We starting with 5 devices, but we targeting 50 device for 
the next year, and could go up to 1000. I summary everything ,
*  Number of sources: 5 - 50 - 1000 (src)  
*  Frequency of data: 1Hz (f)  
*  Estimate size of data: 0.5MB (s)  
*  Replication factor: 3 (RF)  
I calculated the size per year,
* src * f *60 * 60 * 24 * 365 * s
gives me,
* 5 sources = 0.24 PB per year 
* 50 sources = 2.4 PB per year 
* 1000 sources = 47.3 per year 
so if respect the 2TB rule, I got, 120 nodes in the simplest case (5 sources). 
Am I right ?

regards,
Nicolas Jäger


stressing our cluster

2019-09-03 Thread jagernicolas
Hi, I' m testing Cassandra and got some questions: 

I'm currently using image blob to store images converted to base64. 
Those images are under 1MB. For know I have only one source of images and it 
work without problem. But problems come with the stress-test tool. First in my 
test I have defined, 

columnspec: - name: image  size: fixed(681356) 

the size 681356 is the number of characters (base64) which comes from 
the largest picture from my source. 

when I start the stress test, I got the following error most of the 
time: 

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra 
timeout during write query at consistency LOCAL_ONE (1 replica were required 
but only 0 acknowledged the write) 

(the timeout is set to his default value of two seconds) 

The cluster is made of three nodes, all are VMs in the cloud. The VM 
used to do the stress test is also the seed node. A write test made using dd 
shown that we have a low write speed (I talked with the people in charge to 
spin up VMs, he confirmed that speed from the plan we have), 

$ dd if=/dev/zero of=1024M bs=1M count=1024 1024+0 records in 1024+0 
records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 31.3265 s, 34.3 MB/s 

First I want to be sure I understand correctly the error: for me the 
message, 

com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra 
timeout during write query at consistency LOCAL_ONE (1 replica were required 
but only 0 acknowledged the write) 

for me that means that we bottleneck the write operations (memtable to 
sstable) right ? 

Second, I wonder if having only 3 nodes can be a problem. In my 
understanding of Apache Cassandra, if my partition keys is defined to split 
data evenly trough the cluster, more clusters means less write operations 
(memtable and sstable) performed in the time. Moreover, considering the case of 
the stress test, I wonder if the fact to have only 3 nodes overload each node 
like an hot spot would have. 

third, the write speed on the disk is something important right ? so 
changing the plan we have for our VMS to something with a better write speed 
should help to solve the current issue, right ? 

about the stress test, what is exactly the meaning of threads ? is each 
thread inserting data asynchronously or synchronously (waiting an ACK from 
Cassandra?) ? should I consider each thread as a source ? because the stress 
test never goes over 64 threads (iirc).

Actually I don' t know for sure what parameter(s) define the number of sources 
during the stress test. For me the stress test should show me the limits of the 
infrastructure and model under the pressure of increasing the number of sources 
and operations per seconds.  

do you have any comment or advice that can help me? 

regards, 

Nicolas Jäger