Re: High read rate on hard-disk

Octavian Rinciog Thu, 18 Jan 2018 03:54:25 -0800

Hy Alain,

Thank you for your response.


> - Other than the 'lock', Counters perform an implicit read before the write
> operation.

>From what I know there is one counter cache[1], that is used to read
the old values of the counters. According to [2], it is used only for
UPDATE requests


> I would say what you are seeing is expected with this use case. Also, I have
> never seen a use case where using RF = 1 is good idea (excepted for some
> testing maybe). Be aware this data is weak and can easily be lost (if it's a
> deliberate choice, ignore my comment). On the bright side, you have no
> entropy / consistency issues or need for repairs with RF = 1 :D.

Yes, indeed RF=1 policy is our choice (basically because we didn't
manage to scale the counter writes very good and we assumed that we
can loose some data)


[1]https://apache.googlesource.com/cassandra/+/refs/heads/trunk/src/java/org/apache/cassandra/db/CounterMutation.java#193
[2]https://issues.apache.org/jira/browse/CASSANDRA-12500?focusedCommentId=15464023&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15464023


2018-01-18 12:51 GMT+02:00 Alain RODRIGUEZ <arodr...@gmail.com>:
> Hello Octavian,
>
>>
>>  I have a counter table(RF=1)
>>
>>  SELECT vs UPDATE requests ratio is 0.001. ( Read Count: 3771000, Write
>> Count: 3401236000, in one month)
>>
>> SELECT vs UPDATE requests ratio is 0.001. ( Read Count: 3771000, Write
>> Count: 3401236000, in one month)
>
>
>> The problem is that our read rate limit on our hard-disk is always near
>> 30MBps and our write rate limit is near 500KBps.
>
>
> I did not read all your numbers, but here are the internal details you could
> be missing:
>
> - Other than the 'lock', Counters perform an implicit read before the write
> operation. To increment, you need to know about past value. It was true last
> time I used them, I believe there is no real workaround and it's still the
> case today.
> - Writes do not hit the disk synchronously. Instead of this, they are stored
> in the Memtable and only flushed once, sequentially and efficiently. Then
> compactions manages to merge partitions after, asynchronously.
>
> I would say what you are seeing is expected with this use case. Also, I have
> never seen a use case where using RF = 1 is good idea (excepted for some
> testing maybe). Be aware this data is weak and can easily be lost (if it's a
> deliberate choice, ignore my comment). On the bright side, you have no
> entropy / consistency issues or need for repairs with RF = 1 :D.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2018-01-17 17:40 GMT+00:00 Octavian Rinciog <octavian.rinc...@gmail.com>:
>>
>> Hello!
>>
>> I am using Cassandra 3.10, on Ubuntu 14.04 and I have a counter
>> table(RF=1), with the following schema:
>>
>> CREATE TABLE edges (
>>     src_id text,
>>     src_type text,
>>     source text
>>     weight counter,
>>     PRIMARY KEY ((src_id, src_type), source)
>> ) WITH
>>    compaction = {'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>> 'max_threshold': '32', 'min_threshold': '4'}
>>
>> SELECT vs UPDATE requests ratio is 0.001. ( Read Count: 3771000, Write
>> Count: 3401236000, in one month)
>>
>> We have Counter Cache enabled:
>>
>> Counter Cache          : entries 1018782, size 256 MiB, capacity 256
>> MiB, 2799913189 hits, 3469459479 requests, 0.807 recent hit rate, 7200
>> save period in seconds
>>
>> The problem is that our read rate limit on our hard-disk is always
>> near 30MBps and our write rate limit is near 500KBps.
>>
>> One example of output of "iostat -x" is
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdb               0.06     1.04  263.65    2.04 28832.42   572.53
>> 146.07     0.36    1.35    0.74   81.16   1.27  33.81
>>
>> Also with iotop, we saw that are about 8 threads that each goes around
>> 3MB/s read rate.
>>
>> Total DISK READ :      22.73 M/s | Total DISK WRITE :     494.35 K/s
>> Actual DISK READ:      22.62 M/s | Actual DISK WRITE:     528.57 K/s
>>   TID  PRIO  USER    DISK READ>  DISK WRITE  SWAPIN      IO    COMMAND
>> 14793 be/4 cassandra 3.061 M/s    0.0010 B/s  0.00 % 93.27 % java
>> -Dcassandra.fd_max_interval_ms=400
>>
>> The output of strace on these threads is :
>>
>> strace -cp 14793
>> Process 14793 attached
>> ^CProcess 14793 detached
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  99.85   32.118518          57    567288    256251 futex
>>   0.15    0.048822           3     15339           write
>>   0.00    0.000000           0         1           rt_sigreturn
>> ------ ----------- ----------- --------- --------- ----------------
>> 100.00   32.167340                582628    256251 total
>>
>>
>> Despite that iotop shows that this thread is reading with 3MB/s, there
>> is no read syscall in strace.
>>
>> I want to ask if actually the futex is responsible for the read rate
>> and how can we debug this problem further ?
>>
>> Btw, there are no compaction tasks in progress and there are no SELECT
>> queries in progress.
>>
>> Also, I know that for each update, a lock is obtained[1]
>>
>> Thank you,
>>
>>
>> [1]https://apache.googlesource.com/cassandra/+/refs/heads/trunk/src/java/org/apache/cassandra/db/CounterMutation.java#121
>> --
>> Octavian Rinciog
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>



-- 
Octavian Rinciog

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: High read rate on hard-disk

Reply via email to