from:"Benjamin Roth"

Re: Cassandra seems slow when having many read operations

2017-07-22 Thread benjamin roth

Chunk size:
For us it made a 20x difference in read io. But it depends a lot on the use
case.

Am 22.07.2017 08:32 schrieb "Fay Hou [Storage Service] " <
fay...@coupang.com>:

> Hey Felipe:
>
> When you say increased memory from 16GB to 24GB, I think you meant you
> increased heap to 24GB. do you use cms or g1gc?
>  did you change any other parameters?
> As for the chunk size, we found change 64kb to 16kb didn't make a
> difference in low key cache rate environment
>
>
>
> On Fri, Jul 21, 2017 at 9:27 PM, benjamin roth <brs...@gmail.com> wrote:
>
>> Apart from all that you can try to reduce the compression chunk size from
>> the default 64kb to 16kb or even down to 4kb. This can help a lot if your
>> read io on disk is very high and the page cache is not efficient.
>>
>> Am 21.07.2017 23:03 schrieb "Petrus Gomes" <petru...@gmail.com>:
>>
>>> Thanks a lot to share the result.
>>>
>>> Boa Sorte.
>>> ;-)
>>> Take care.
>>> Petris Silva
>>>
>>> On Fri, Jul 21, 2017 at 12:19 PM, Felipe Esteves <
>>> felipe.este...@b2wdigital.com> wrote:
>>>
>>>> Hi, Petrus,
>>>>
>>>> Seems we've solved the problem, but it wasn't relationed to repair the
>>>> cluster or disk latency.
>>>> I've increased the memory available for Cassandra from 16GB to 24GB and
>>>> the performance was much improved!
>>>> The main symptom we've observed in Opscenter was a
>>>> significantly decrease in total compactions graph.
>>>>
>>>> Felipe Esteves
>>>>
>>>> Tecnologia
>>>>
>>>> felipe.este...@b2wdigital.com <seu.em...@b2wdigital.com>
>>>>
>>>>
>>>>
>>>> 2017-07-15 3:23 GMT-03:00 Petrus Gomes <petru...@gmail.com>:
>>>>
>>>>> Hi Felipe,
>>>>>
>>>>> Yes, try it and let us know how it goes.
>>>>>
>>>>> Thanks,
>>>>> Petrus Silva.
>>>>>
>>>>> On Fri, Jul 14, 2017 at 11:37 AM, Felipe Esteves <
>>>>> felipe.este...@b2wdigital.com> wrote:
>>>>>
>>>>>> Hi Petrus, thanks for the feedback.
>>>>>>
>>>>>> I couldn't found the percent repaired in nodetool info, C* version is
>>>>>> 2.1.8, maybe it's something newer than that?
>>>>>>
>>>>>> I'm analyzing this thread about num_token.
>>>>>>
>>>>>> Compaction is "compaction_throughput_mb_per_sec: 16", I don't get
>>>>>> pending compactions in Opscenter.
>>>>>>
>>>>>> One point I've noticed, is that Opscenter show "OS: Disk Latency" max
>>>>>> with high values when the problem occurs, but it doesn't reflect in 
>>>>>> server
>>>>>> directly monitoring, in these tools the IO and latency of disks seems ok.
>>>>>> But seems to me that "read repair attempted" is a bit high, maybe it
>>>>>> will explain the latency in reads. I will try to run a repair on cluster 
>>>>>> to
>>>>>> see how it goes.
>>>>>>
>>>>>> Felipe Esteves
>>>>>>
>>>>>> Tecnologia
>>>>>>
>>>>>> felipe.este...@b2wdigital.com <seu.em...@b2wdigital.com>
>>>>>>
>>>>>> Tel.: (21) 3504-7162 ramal 57162
>>>>>>
>>>>>> Skype: felipe2esteves
>>>>>>
>>>>>> 2017-07-13 15:02 GMT-03:00 Petrus Gomes <petru...@gmail.com>:
>>>>>>
>>>>>>> How is your Percent Repaired  when you run " nodetool info" ?
>>>>>>>
>>>>>>> Search for :
>>>>>>> "reduced num_token = improved performance ??" topic.
>>>>>>> The people were discussing that.
>>>>>>>
>>>>>>> How is your compaction is configured?
>>>>>>>
>>>>>>> Could you run the same process in command line to have a measurement?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Petrus Silva
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 13, 2017 at 7:49 AM, Felipe Esteves <
>>>>>>> felipe

Re: Cassandra seems slow when having many read operations

2017-07-21 Thread benjamin roth

Apart from all that you can try to reduce the compression chunk size from
the default 64kb to 16kb or even down to 4kb. This can help a lot if your
read io on disk is very high and the page cache is not efficient.

Am 21.07.2017 23:03 schrieb "Petrus Gomes" :

> Thanks a lot to share the result.
>
> Boa Sorte.
> ;-)
> Take care.
> Petris Silva
>
> On Fri, Jul 21, 2017 at 12:19 PM, Felipe Esteves <
> felipe.este...@b2wdigital.com> wrote:
>
>> Hi, Petrus,
>>
>> Seems we've solved the problem, but it wasn't relationed to repair the
>> cluster or disk latency.
>> I've increased the memory available for Cassandra from 16GB to 24GB and
>> the performance was much improved!
>> The main symptom we've observed in Opscenter was a significantly decrease
>> in total compactions graph.
>>
>> Felipe Esteves
>>
>> Tecnologia
>>
>> felipe.este...@b2wdigital.com 
>>
>>
>>
>> 2017-07-15 3:23 GMT-03:00 Petrus Gomes :
>>
>>> Hi Felipe,
>>>
>>> Yes, try it and let us know how it goes.
>>>
>>> Thanks,
>>> Petrus Silva.
>>>
>>> On Fri, Jul 14, 2017 at 11:37 AM, Felipe Esteves <
>>> felipe.este...@b2wdigital.com> wrote:
>>>
 Hi Petrus, thanks for the feedback.

 I couldn't found the percent repaired in nodetool info, C* version is
 2.1.8, maybe it's something newer than that?

 I'm analyzing this thread about num_token.

 Compaction is "compaction_throughput_mb_per_sec: 16", I don't get
 pending compactions in Opscenter.

 One point I've noticed, is that Opscenter show "OS: Disk Latency" max
 with high values when the problem occurs, but it doesn't reflect in server
 directly monitoring, in these tools the IO and latency of disks seems ok.
 But seems to me that "read repair attempted" is a bit high, maybe it
 will explain the latency in reads. I will try to run a repair on cluster to
 see how it goes.

 Felipe Esteves

 Tecnologia

 felipe.este...@b2wdigital.com 

 Tel.: (21) 3504-7162 ramal 57162

 Skype: felipe2esteves

 2017-07-13 15:02 GMT-03:00 Petrus Gomes :

> How is your Percent Repaired  when you run " nodetool info" ?
>
> Search for :
> "reduced num_token = improved performance ??" topic.
> The people were discussing that.
>
> How is your compaction is configured?
>
> Could you run the same process in command line to have a measurement?
>
> Thanks,
> Petrus Silva
>
>
>
> On Thu, Jul 13, 2017 at 7:49 AM, Felipe Esteves <
> felipe.este...@b2wdigital.com> wrote:
>
>> Hi,
>>
>> I have a Cassandra 2.1 cluster running on AWS that receives high read
>> loads, jumping from 100k requests to 400k requests, for example. Then it
>> normalizes and later cames another high throughput.
>>
>> To the application, it appears that Cassandra is slow. However, cpu
>> and disk use is ok in every instance, row cache is enabled and with 
>> almost
>> 100% hit rate.
>>
>> The logs from Cassandra instances doesn't have any errors, nor
>> tombstone messages or something liked that. It's mostly compactions and
>> G1GC operations.
>>
>> Any hints on where to investigate more?
>>
>>
>> Felipe Esteves
>>
>>
>>
>>
>>
>
> --
>
> Esta mensagem pode conter informações confidenciais e somente o
> indivíduo ou entidade a quem foi destinada pode utilizá-la. A transmissão
> incorreta da mensagem não acarreta a perda de sua confidencialidade. Caso
> esta mensagem tenha sido recebida por engano, solicitamos que o fato seja
> comunicado ao remetente e que a mensagem seja eliminada de seu sistema
> imediatamente. É vedado a qualquer pessoa que não seja o destinatário 
> usar,
> revelar, distribuir ou copiar qualquer parte desta mensagem. Ambiente de
> comunicação sujeito a monitoramento.
>
> This message may include confidential information and only the
> intended addresses have the right to use it as is, or any part of it. A
> wrong transmission does not break its confidentiality. If you've received
> it because of a mistake or erroneous transmission, please notify the 
> sender
> and delete it from your system immediately. This communication environment
> is controlled and monitored.
>
> B2W Digital
>
>
>




>>>
>>
>>
>>
>

Re: Corrupted commit log prevents Cassandra start

2017-07-07 Thread benjamin roth

Hi Hannu,

I remember there have been discussions about this in the past. Most
probably there is already a JIRA for this.
I roughly remember a consense like that:
- Default behaviour should remain
- It should be configurable to the needs and preferences of the DBA
- It should at least spit out errors in the logs

... of course it would be even better to have the underlying issue fixed
that commit logs should not be corrupt but I remember that this is not so
easy due to some "architectural implications" of Cassandra. IIRC Ed
Capriolo posted something related to that some months ago.

For a quick fix, I'd recommend:
- Delete the affected log file
- Start the node
- Run a full-range (not -pr) repair on that node

2017-07-07 10:57 GMT+02:00 Hannu Kröger :

> Hello,
>
> We had a test server crashing for some reason (not related to Cassandra
> probably) and now when trying to start cassandra, it gives following error:
>
> ERROR [main] 2017-07-06 09:29:56,140 JVMStabilityInspector.java:82 -
> Exiting due to error while processing commit log during initialization.
> org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException:
> Mutation checksum failure at 24240116 in Next section at 24239690 in
> CommitLog-6-1498576271195.log
> at org.apache.cassandra.db.commitlog.CommitLogReader.
> readSection(CommitLogReader.java:332) [apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:201)
> [apache-cassandra-3.10.jar:3.10]
> at org.apache.cassandra.db.commitlog.CommitLogReader.
> readAllFiles(CommitLogReader.java:84) [apache-cassandra-3.10.jar:3.10]
> at org.apache.cassandra.db.commitlog.CommitLogReplayer.
> replayFiles(CommitLogReplayer.java:140) [apache-cassandra-3.10.jar:3.10]
> at org.apache.cassandra.db.commitlog.CommitLog.
> recoverFiles(CommitLog.java:177) [apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:158)
> [apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:326)
> [apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:601)
> [apache-cassandra-3.10.jar:3.10]
> at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:735)
> [apache-cassandra-3.10.jar:3.10]
>
> Shouldn’t Cassandra tolerate this situation?
>
> Of course we can delete commit logs and life goes on. But isn’t this a bug
> or something?
>
> Hannu
>
>

Re: Is it possible to repair a single partition.

2017-06-27 Thread benjamin roth

Then the partition is too big or has too many sstables that contain data
for that partition so that the query times out. You can run a manual
compaction on that table. That helped me several times.

+ I hope you are not trying to read that parition at once. Please use
paging to query large partitions.

2017-06-27 21:25 GMT+02:00 Pranay akula :

> Hi Jonathan,
> I tried multiple times running the query at consistency all, every time i
> run it i getting the same output that co-ordinator is timing out. I can see
> it wasn't helping much there were multiple read repair drops. Is there any
> other way to get that partition fixed ??
>
>
> Thanks
> Pranay.
>
>
> On Tue, Jun 27, 2017 at 3:13 PM, Jonathan Haddad 
> wrote:
>
>> Query it at consistency ALL and let read repair do its thing.
>> On Tue, Jun 27, 2017 at 11:48 AM Pranay akula 
>> wrote:
>>
>>> I have a CF with composite partition key, partition key consists of blob
>>> and text data types.
>>> The select query against this particular partition is timing out so to
>>> debug it further i ran nodetool getendpoints i am getting error like below
>>>
>>> error: Non-hex characters in 0xbbdcbf21ffb72115599ca915634fcb85
>>> -- StackTrace --
>>> java.lang.NumberFormatException: Non-hex characters in
>>> 0xbbdcbf21ffb72115599ca915634fcb85
>>>
>>>
>>> so how can i get the endpoints for that particular partition and token
>>> as well to run repairs to that token range.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks
>>> Pranay.
>>>
>>
>

Re: Cassandra Cluster issues

2017-05-08 Thread benjamin roth

Hm that question is like "My car does not start - whats the problem?".
You have to monitor, monitor, monitor, monitor. I'd strongly advice to
graph as many metrics as you can. Read them from the JMX interface and
write them to a TSDB, visualize them e.g. with Grafana.
Then read logs, trace your queries, check all the system metrics like CPU
consumption, Disk IO, Network IO, Memory usage, Java GC pauses.

Then you will be able to find the bottleneck.

2017-05-08 15:15 GMT+02:00 Mehdi Bada :

> Dear Cassandra Users,
>
> I have some issues since few days with the following cluster:
>
> - 5 nodes
> - Cassandra 3.7
> - 2 seed nodes
> - 1 keyspace with RF=2, 300Go / nodes, WRITE_LEVEL=ONE, READ_LEVEL=ONE
> - 1 enormous table (90% of the keyspace)
> - TTL for each line insered
>
> The cluster is write oriented. All machines are consuming between 5 - 10 %
> of the CPU and 45 % of RAM.
>
> The cluster is very slow since the last repair, not all writes have been
> done... I don't know how to start the debbuging of my cluster.
>
> Do you have any ideas ?
>
>
> Many thanks in advance
>
> Regards
> Mehdi Bada
>
> 
>
> *Mehdi Bada* | Consultant
> Phone: +41 32 422 96 00 <+41%2032%20422%2096%2000> | Mobile: +41 79 928
> 75 48 <+41%2079%20928%2075%2048> | Fax: +41 32 422 96 15
> <+41%2032%20422%2096%2015>
> dbi services, Rue de la Jeunesse 2, CH-2800 Delémont
> mehdi.b...@dbi-services.com
> www.dbi-services.com
>
>
>
> *⇒ dbi services is recruiting Oracle & SQL Server experts ! – Join the
> team
> *
>
>

Re: TRUNCATE on a disk almost full - possible?

2017-04-21 Thread benjamin roth

Truncate needs no space. It just creates a hard link of all affected
SSTables under the corresponding -SNAPSHOT dir (at least with default
settings) and then removes the SSTables.
Also this operation should be rather fast as it is mostly a file-deletion
process with some metadata updates.

2017-04-21 11:21 GMT+02:00 Kunal Gangakhedkar :

> Hi all,
>
> We have a CF that's grown too large - it's not getting actively used in
> the app right now.
> The on-disk size of the . directory is ~407GB and I have only
> ~40GB free left on the disk.
>
> I understand that if I trigger a TRUNCATE on this CF, cassandra will try
> to take snapshot.
> My question:
> Is the ~40GB enough to safely truncate this table?
>
> I will manually remove the . directory once the truncate is
> completed.
>
> Also, while browsing through earlier msgs regarding truncate, I noticed
> that it's possible to get OperationTimedOut
> 
> exception. Does that stop the truncate operation?
>
> Is there any other safe way to clean up the CF?
>
> Thanks,
> Kunal
>

Re: WriteTimeoutException with LWT after few milliseconds

2017-04-19 Thread benjamin roth

Thanks, Jeff!

As soon as I have some spare time I will try to reproduce and open a Jira
for it.

2017-04-19 16:27 GMT+02:00 Jeff Jirsa <jji...@apache.org>:

>
>
> On 2017-04-13 05:13 (-0700), benjamin roth <brs...@gmail.com> wrote:
> > I found out that if the WTEs occur, there was already another process
> > inserting the same primary key because I found duplicates in some places
> > that perfectly match the WTE logs.
> >
> > Does anybody know, why this throws a WTE instead of returning [applied]'
> =
> > false ?
> > This is quite confusing!
> >
>
> Certainly seems wrong. May want to open a JIRA, especially if it's
> reproducible. Should mention what version and client you're using.
>
>

Re: Counter performance

2017-04-17 Thread benjamin roth

> 
> -+++
> +
>
>
>   Execute CQL3 query | 2017-04-17
> 18:31:49.622000 |  cassandra-01  |  0 |  cassandra-01
>
>   Parsing select counter_value from
> counter table limit 10; [SharedPool-Worker-4] | 2017-04-17
> 18:31:49.622000 |  cassandra-01  |142 |  cassandra-01
>
>   
>  Preparing
> statement [SharedPool-Worker-4] | 2017-04-17 18:31:49.623000 |
> cassandra-01  |217 |  cassandra-01
>
>RANGE_SLICE message received from /
> cassandra-01  [MessagingService-Incoming-/ cassandra-01 ] | 2017-04-17
> 18:31:49.623000 |  cassandra-05  | 18 |  cassandra-01
>
>  
> Computing
> ranges to query [SharedPool-Worker-4] | 2017-04-17 18:31:49.623000 |
> cassandra-01  |335 |  cassandra-01
>
>Executing seq scan across 2 sstables for (min(-9223372036854775808),
> max(-9173699490866503541)] [SharedPool-Worker-2] | 2017-04-17
> 18:31:49.623000 |  cassandra-05  |141 |  cassandra-01
>
>  Submitting range requests on 2561 ranges with a concurrency of 1 (861.45
> rows per range expected) [SharedPool-Worker-4] | 2017-04-17 18:31:49.623001
> |  cassandra-01  |   1060 |  cassandra-01
>
>   Enqueuing
> request to / cassandra-05  [SharedPool-Worker-4] | 2017-04-17
> 18:31:49.623001 |  cassandra-01  |   1134 |  cassandra-01
>
>  Submitted 1
> concurrent range requests [SharedPool-Worker-4] | 2017-04-17
> 18:31:49.624000 |  cassandra-01  |   1225 |  cassandra-01
>
>   Sending RANGE_SLICE message to /
> cassandra-05  [MessagingService-Outgoing-/ cassandra-05 ] | 2017-04-17
> 18:31:49.624000 |  cassandra-01  |   1257 |  cassandra-01
>
> Read 10
> live and 0 tombstone cells [SharedPool-Worker-2] | 2017-04-17
> 18:31:49.627000 |  cassandra-05  |   3350 |  cassandra-01
>
>  Enqueuing
> response to / cassandra-01  [SharedPool-Worker-2] | 2017-04-17
> 18:31:49.627000 |  cassandra-05  |   3394 |  cassandra-01
>
>  Sending REQUEST_RESPONSE message to /
> cassandra-01  [MessagingService-Outgoing-/ cassandra-01 ] | 2017-04-17
> 18:31:49.627000 |  cassandra-05  |   3453 |  cassandra-01
>
>   REQUEST_RESPONSE message received from /
> cassandra-05  [MessagingService-Incoming-/ cassandra-05 ] | 2017-04-17
> 18:31:49.628000 |  cassandra-01  |   5250 |  cassandra-01
>
>   Processing
> response from / cassandra-05  [SharedPool-Worker-6] | 2017-04-17
> 18:31:49.628000 |  cassandra-01  |   5319 |  cassandra-01
>
>
> Request complete | 2017-04-17
> 18:31:49.628595 |  cassandra-01  |   6595 |  cassandra-01
>
>
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 6:17 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Counter performance
>
>
>
> Just run some queries on counter tables. Some on regular tables. Look at
> traces and then compare. You don't need to do anything with application
> code. You can also set trace probability on a table level and then analyze
> the queries.
>
>
>
> Am 17.04.2017 17:07 schrieb "Eren Yilmaz" <eren.yil...@sebit.com.tr>:
>
> I can’t add tracing using driver – Usergrid code is way too complex. When
> I look at logging the slow queries on the C* side, it says the feature is
> added in version 3.10 (https://issues.apache.org/
> jira/browse/CASSANDRA-12403), and we use 3.7. Any other ways to log slow
> queries in this version? Or, what do we expect with this log output?
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 5:44 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Counter performance
>
>
>
> You could enable a slow query log and then trace single queries couldn't
> you?
>
>
>
> Am 17.04.2017 16:31 schrieb "Eren Yilmaz" <eren.yil...@sebit.com.tr>:
>
> I can’t trace selects on the application tables unfortu

RE: Counter performance

2017-04-17 Thread benjamin roth

Just run some queries on counter tables. Some on regular tables. Look at
traces and then compare. You don't need to do anything with application
code. You can also set trace probability on a table level and then analyze
the queries.

Am 17.04.2017 17:07 schrieb "Eren Yilmaz" <eren.yil...@sebit.com.tr>:

> I can’t add tracing using driver – Usergrid code is way too complex. When
> I look at logging the slow queries on the C* side, it says the feature is
> added in version 3.10 (https://issues.apache.org/
> jira/browse/CASSANDRA-12403), and we use 3.7. Any other ways to log slow
> queries in this version? Or, what do we expect with this log output?
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 5:44 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Counter performance
>
>
>
> You could enable a slow query log and then trace single queries couldn't
> you?
>
>
>
> Am 17.04.2017 16:31 schrieb "Eren Yilmaz" <eren.yil...@sebit.com.tr>:
>
> I can’t trace selects on the application tables unfortunately. The
> application is Usergrid, and it stores the data in binary. We have little
> control over Usergrid-created data.
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 4:12 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Counter performance
>
>
>
> Do you see difference when tracing the selects?
>
>
>
> 2017-04-17 13:36 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:
>
> Application tables use LeveledCompactionStrategy. At first, counter tables
> were created by default SizeTieredCompactionStrategy, but we changed them
> to LeveledCompactionStrategy then.
>
>
>
> compaction = { 'class' : 
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'sstable_size_in_mb' : 512 }
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 12:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Counter performance
>
>
>
> Do you have a different compaction strategy on the counter tables?
>
>
>
> 2017-04-17 10:07 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:
>
> We are using Cassandra (3.7) counter tables in our application, and there
> are about 10 counter tables. The counter tables are in a separate keyspace
> with RF=3 (total 10 nodes). The tables are read-heavy, for each web request
> to the application, we read at least 20 counter values. The counter reads
> are very slow comparing to the other application data reads from cassandra,
> and sometimes the reads put extra heavy CPU load on some nodes.
>
>
>
> Are there any tips, or best practices for increasing the performance of
> counter tables?
>
>
>
>
>
>
>

RE: Counter performance

2017-04-17 Thread benjamin roth

You could enable a slow query log and then trace single queries couldn't
you?

Am 17.04.2017 16:31 schrieb "Eren Yilmaz" <eren.yil...@sebit.com.tr>:

I can’t trace selects on the application tables unfortunately. The
application is Usergrid, and it stores the data in binary. We have little
control over Usergrid-created data.



*From:* benjamin roth [mailto:brs...@gmail.com]
*Sent:* Monday, April 17, 2017 4:12 PM

*To:* user@cassandra.apache.org
*Subject:* Re: Counter performance



Do you see difference when tracing the selects?



2017-04-17 13:36 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:

Application tables use LeveledCompactionStrategy. At first, counter tables
were created by default SizeTieredCompactionStrategy, but we changed them
to LeveledCompactionStrategy then.



compaction = { 'class' :
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
'sstable_size_in_mb' : 512 }



*From:* benjamin roth [mailto:brs...@gmail.com]
*Sent:* Monday, April 17, 2017 12:12 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Counter performance



Do you have a different compaction strategy on the counter tables?



2017-04-17 10:07 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:

We are using Cassandra (3.7) counter tables in our application, and there
are about 10 counter tables. The counter tables are in a separate keyspace
with RF=3 (total 10 nodes). The tables are read-heavy, for each web request
to the application, we read at least 20 counter values. The counter reads
are very slow comparing to the other application data reads from cassandra,
and sometimes the reads put extra heavy CPU load on some nodes.



Are there any tips, or best practices for increasing the performance of
counter tables?

Re: Counter performance

2017-04-17 Thread benjamin roth

Do you see difference when tracing the selects?

2017-04-17 13:36 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:

> Application tables use LeveledCompactionStrategy. At first, counter tables
> were created by default SizeTieredCompactionStrategy, but we changed them
> to LeveledCompactionStrategy then.
>
>
>
> compaction = { 'class' : 
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy',
> 'sstable_size_in_mb' : 512 }
>
>
>
> *From:* benjamin roth [mailto:brs...@gmail.com]
> *Sent:* Monday, April 17, 2017 12:12 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Counter performance
>
>
>
> Do you have a different compaction strategy on the counter tables?
>
>
>
> 2017-04-17 10:07 GMT+02:00 Eren Yilmaz <eren.yil...@sebit.com.tr>:
>
> We are using Cassandra (3.7) counter tables in our application, and there
> are about 10 counter tables. The counter tables are in a separate keyspace
> with RF=3 (total 10 nodes). The tables are read-heavy, for each web request
> to the application, we read at least 20 counter values. The counter reads
> are very slow comparing to the other application data reads from cassandra,
> and sometimes the reads put extra heavy CPU load on some nodes.
>
>
>
> Are there any tips, or best practices for increasing the performance of
> counter tables?
>
>
>

Re: Counter performance

2017-04-17 Thread benjamin roth

Do you have a different compaction strategy on the counter tables?

2017-04-17 10:07 GMT+02:00 Eren Yilmaz :

> We are using Cassandra (3.7) counter tables in our application, and there
> are about 10 counter tables. The counter tables are in a separate keyspace
> with RF=3 (total 10 nodes). The tables are read-heavy, for each web request
> to the application, we read at least 20 counter values. The counter reads
> are very slow comparing to the other application data reads from cassandra,
> and sometimes the reads put extra heavy CPU load on some nodes.
>
>
>
> Are there any tips, or best practices for increasing the performance of
> counter tables?
>

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

you should be able to find that out by scrubbing the corresponding table(s)
and see wich one hangs?
i guess the debuglog tells you which sstable is being scrubbed.

2017-04-13 15:07 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:

> i made a copy and also have the permission to upload sstables for that
> particular column_family
>
> is it possible to track down which sstable of that cf is affected or
> should i upload all of them?
>
>
> br,
> roland
>
>
> On Thu, 2017-04-13 at 13:57 +0200, benjamin roth wrote:
>
> I think thats a good reproduction case for the issue - you should copy the
> sstable away for further testing. Are you allowed to upload the broken
> sstable to JIRA?
>
> 2017-04-13 13:15 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> sorry .. i have to correct myself .. the problem still persists.
>
> tried nodetool scrub now for the table ... but scrub is also stuck at the
> same percentage
>
> id   compaction type keyspace
> tablecompleted total unit  progress
> 380e4980-2037-11e7-a9a4-a5f3eec2d826 Validation  bds  ad_event
> 805955242 841258085 bytes 95.80%
> fb17b8b0-2039-11e7-a9a4-a5f3eec2d826 Scrub   bds  ad_event
> 805961728 841258085 bytes 95.80%
> Active compaction remaining time :   0h00m00s
>
> according to the thread dump its the same issue
>
> Stack trace:
> com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$65/60401277.accept(Unknown
> Source)
> com.github.benmanes.caffeine.cache.BoundedBuffer$RingBuffer.
> drainTo(BoundedBuffer.java:104)
> com.github.benmanes.caffeine.cache.StripedBuffer.drainTo(Str
> ipedBuffer.java:160)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.drainRe
> adBuffer(BoundedLocalCache.java:964)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.mainten
> ance(BoundedLocalCache.java:918)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.perform
> CleanUp(BoundedLocalCache.java:903)
> com.github.benmanes.caffeine.cache.BoundedLocalCache$Perform
> CleanupTask.run(BoundedLocalCache.java:2680)
> com.google.common.util.concurrent.MoreExecutors$DirectExecut
> or.execute(MoreExecutors.java:457)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.schedul
> eDrainBuffers(BoundedLocalCache.java:875)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRe
> ad(BoundedLocalCache.java:748)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.compute
> IfAbsent(BoundedLocalCache.java:1783)
> com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsen
> t(LocalCache.java:97)
> com.github.benmanes.caffeine.cache.LocalLoadingCache.get(Loc
> alLoadingCache.java:66)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:213)
> org.apache.cassandra.io.util.LimitingRebufferer.rebuffer(Lim
> itingRebufferer.java:54)
> org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(R
> andomAccessReader.java:65)
> org.apache.cassandra.io.util.RandomAccessReader.reBuffer(Ran
> domAccessReader.java:59)
> org.apache.cassandra.io.util.RebufferingInputStream.read(Reb
> ufferingInputStream.java:88)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:66)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:60)
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402)
> org.apache.cassandra.db.marshal.AbstractType.readValue(
> AbstractType.java:420)
> org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245)
> org.apache.cassandra.db.rows.UnfilteredSerializer.readSimple
> Column(UnfilteredSerializer.java:610)
> org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$des
> erializeRowBody$1(UnfilteredSerializer.java:575)
> org.apache.cassandra.db.rows.UnfilteredSerializer$$Lambda$85/168219100.accept(Unknown
> Source)
> org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1222)
> org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1177)
> org.apache.cassandra.db.Columns.apply(Columns.java:377)
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserializ
> eRowBody(UnfilteredSerializer.java:571)
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserializ
> e(UnfilteredSerializer.java:440)
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$Curren
> tFormatIterator.computeNext(SSTableSimpleIterator.java:95)
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$Curren
> tFormatIterator.computeNext(SSTableSimpleIterator.java:73)
> org.apache.cassandra.utils.AbstractIterator.hasNext(Abstract
> Itera

Re: WriteTimeoutException with LWT after few milliseconds

2017-04-13 Thread benjamin roth

I found out that if the WTEs occur, there was already another process
inserting the same primary key because I found duplicates in some places
that perfectly match the WTE logs.

Does anybody know, why this throws a WTE instead of returning [applied]' =
false ?
This is quite confusing!

2017-04-12 17:41 GMT+02:00 Carlos Rolo <r...@pythian.com>:

> You can try to use TRACING to debug the situation, but for a LWT to fail
> so fast, the most probable cause is what you stated: "It is possible that
> there are concurrent inserts on the same PK - actually thats the reason why
> I use LWTs." AKA, someone inserted first.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant / Datastax Certified Architect / Cassandra MVP
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin:
> *linkedin.com/in/carlosjuzarterolo
> <http://linkedin.com/in/carlosjuzarterolo>*
> Mobile: +351 918 918 100 <+351%20918%20918%20100>
> www.pythian.com
>
> On Wed, Apr 12, 2017 at 3:51 PM, Roland Otta <roland.o...@willhaben.at>
> wrote:
>
>> sorry .. ignore my comment ...
>>
>> i missed your comment that the record is in the table ...
>>
>> On Wed, 2017-04-12 at 16:48 +0200, Roland Otta wrote:
>>
>> Hi Benjamin,
>>
>> its unlikely that i can assist you .. but nevertheless ... i give it a
>> try ;-)
>>
>> whats your consistency level for the insert?
>> what if one ore more nodes are marked down and proper consistency cant be
>> achieved?
>> of course the error message does not indicate that problem (as it says
>> its a timeout)... but in that case you would get an instant error for
>> inserts. wouldn't you?
>>
>> br,
>> roland
>>
>>
>>
>> On Wed, 2017-04-12 at 15:09 +0200, benjamin roth wrote:
>>
>> Hi folks,
>>
>> Can someone explain why that occurs?
>>
>> Write timeout after 0.006s
>> Query: 'INSERT INTO log_moment_import ("source", "reference", "user_id",
>> "moment_id", "date", "finished") VALUES (3, '1305821272790495', 65675537,
>> 0, '2017-04-12 13:00:51', NULL) IF NOT EXISTS
>> Primary key and parition key is source + reference
>> Message: Operation timed out - received only 1 responses.
>>
>> This appears every now and then in the log. When I check the for the
>> record in the table, it is there.
>> I could explain that, if the WTE occured after the configured write
>> timeout but it happens withing a few milliseconds.
>> Is this caused by lock contention? It is possible that there are
>> concurrent inserts on the same PK - actually thats the reason why I use
>> LWTs.
>>
>> Thanks!
>>
>>
>
> --
>
>
>
>

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

rator.hasNext(
> AbstractIterator.java:47)
> org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
> org.apache.cassandra.db.ColumnIndex.buildRowIndex(ColumnIndex.java:110)
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(
> BigTableWriter.java:173)
> org.apache.cassandra.io.sstable.SSTableRewriter.
> append(SSTableRewriter.java:135)
> org.apache.cassandra.io.sstable.SSTableRewriter.tryAppend(SSTableRewriter.
> java:156)
> org.apache.cassandra.db.compaction.Scrubber.tryAppend(Scrubber.java:319)
> org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:214)
> org.apache.cassandra.db.compaction.CompactionManager.
> scrubOne(CompactionManager.java:966)
> org.apache.cassandra.db.compaction.CompactionManager.
> access$300(CompactionManager.java:85)
> org.apache.cassandra.db.compaction.CompactionManager$
> 3.execute(CompactionManager.java:368)
> org.apache.cassandra.db.compaction.CompactionManager$
> 2.call(CompactionManager.java:311)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> org.apache.cassandra.concurrent.NamedThreadFactory$
> $Lambda$5/899929247.run(Unknown Source)
> java.lang.Thread.run(Thread.java:745)
>
>
> br,
> roland
>
>
> On Thu, 2017-04-13 at 10:04 +0000, Roland Otta wrote:
>
> i did 2 restarts before which did not help
>
> after that i have set for testing purposes file_cache_size_in_mb: 0 and
> buffer_pool_use_heap_if_exhausted: false and restarted again
>
> after that it worked ... but it also could be that it just worked by
> accident after the last restart and is not related to my config changes
>
> On Thu, 2017-04-13 at 11:58 +0200, benjamin roth wrote:
>
> If you restart the server the same validation completes successfully?
> If not, have you tries scrubbing the affected sstables?
>
> 2017-04-13 11:43 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> thank you guys ... i will
>
> i just wanted to make sure that i am not doing something completely wrong
> before opening an issue
>
> br,
> roland
>
>
> On Thu, 2017-04-13 at 21:35 +1200, Nate McCall wrote:
>
> Not sure what is going on there either. Roland - can you open an issue
> with the information above:
> https://issues.apache.org/jira/browse/CASSANDRA
>
> On Thu, Apr 13, 2017 at 7:49 PM, benjamin roth <brs...@gmail.com> wrote:
>
> What I can tell you from that trace - given that this is the correct
> thread and it really hangs there:
>
> The validation is stuck when reading from an SSTable.
> Unfortunately I am no caffeine expert. It looks like the read is cached
> and after the read caffeine tries to drain the cache and this is stuck. I
> don't see the reason from that stack trace.
> Someone had to dig deeper into caffeine to find the root cause.
>
> 2017-04-13 9:27 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> i had a closer look at the validation executor thread (i hope thats what
> you meant)
>
> it seems the thread is always repeating stuff in
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
>
> here is the full stack trace ...
>
> i am sorry .. but i have no clue whats happening there ..
>
> com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$64/2098345091
> <(209)%20834-5091>.accept(Unknown Source)
> com.github.benmanes.caffeine.cache.BoundedBuffer$RingBuffer.
> drainTo(BoundedBuffer.java:104)
> com.github.benmanes.caffeine.cache.StripedBuffer.drainTo(Str
> ipedBuffer.java:160)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.drainRe
> adBuffer(BoundedLocalCache.java:964)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.mainten
> ance(BoundedLocalCache.java:918)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.perform
> CleanUp(BoundedLocalCache.java:903)
> com.github.benmanes.caffeine.cache.BoundedLocalCache$Perform
> CleanupTask.run(BoundedLocalCache.java:2680)
> com.google.common.util.concurrent.MoreExecutors$DirectExecut
> or.execute(MoreExecutors.java:457)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.schedul
> eDrainBuffers(BoundedLocalCache.java:875)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRe
> ad(BoundedLocalCache.java:748)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.compute
> IfAbsent(BoundedLoc

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

What if you run it again with cache enabled?

Am 13.04.2017 12:04 schrieb "Roland Otta" <roland.o...@willhaben.at>:

> i did 2 restarts before which did not help
>
> after that i have set for testing purposes file_cache_size_in_mb: 0 and
> buffer_pool_use_heap_if_exhausted: false and restarted again
>
> after that it worked ... but it also could be that it just worked by
> accident after the last restart and is not related to my config changes
>
> On Thu, 2017-04-13 at 11:58 +0200, benjamin roth wrote:
>
> If you restart the server the same validation completes successfully?
> If not, have you tries scrubbing the affected sstables?
>
> 2017-04-13 11:43 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> thank you guys ... i will
>
> i just wanted to make sure that i am not doing something completely wrong
> before opening an issue
>
> br,
> roland
>
>
> On Thu, 2017-04-13 at 21:35 +1200, Nate McCall wrote:
>
> Not sure what is going on there either. Roland - can you open an issue
> with the information above:
> https://issues.apache.org/jira/browse/CASSANDRA
>
> On Thu, Apr 13, 2017 at 7:49 PM, benjamin roth <brs...@gmail.com> wrote:
>
> What I can tell you from that trace - given that this is the correct
> thread and it really hangs there:
>
> The validation is stuck when reading from an SSTable.
> Unfortunately I am no caffeine expert. It looks like the read is cached
> and after the read caffeine tries to drain the cache and this is stuck. I
> don't see the reason from that stack trace.
> Someone had to dig deeper into caffeine to find the root cause.
>
> 2017-04-13 9:27 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> i had a closer look at the validation executor thread (i hope thats what
> you meant)
>
> it seems the thread is always repeating stuff in
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
>
> here is the full stack trace ...
>
> i am sorry .. but i have no clue whats happening there ..
>
> com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$64/2098345091
> <(209)%20834-5091>.accept(Unknown Source)
> com.github.benmanes.caffeine.cache.BoundedBuffer$RingBuffer.
> drainTo(BoundedBuffer.java:104)
> com.github.benmanes.caffeine.cache.StripedBuffer.drainTo(Str
> ipedBuffer.java:160)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.drainRe
> adBuffer(BoundedLocalCache.java:964)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.mainten
> ance(BoundedLocalCache.java:918)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.perform
> CleanUp(BoundedLocalCache.java:903)
> com.github.benmanes.caffeine.cache.BoundedLocalCache$Perform
> CleanupTask.run(BoundedLocalCache.java:2680)
> com.google.common.util.concurrent.MoreExecutors$DirectExecut
> or.execute(MoreExecutors.java:457)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.schedul
> eDrainBuffers(BoundedLocalCache.java:875)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRe
> ad(BoundedLocalCache.java:748)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.compute
> IfAbsent(BoundedLocalCache.java:1783)
> com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsen
> t(LocalCache.java:97)
> com.github.benmanes.caffeine.cache.LocalLoadingCache.get(Loc
> alLoadingCache.java:66)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:213)
> org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(R
> andomAccessReader.java:65)
> org.apache.cassandra.io.util.RandomAccessReader.reBuffer(Ran
> domAccessReader.java:59)
> org.apache.cassandra.io.util.RebufferingInputStream.read(Reb
> ufferingInputStream.java:88)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:66)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:60)
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402)
> org.apache.cassandra.db.marshal.AbstractType.readValue(Abstr
> actType.java:420)
> org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245)
> org.apache.cassandra.db.rows.UnfilteredSerializer.readSimple
> Column(UnfilteredSerializer.java:610)
> org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$des
> erializeRowBody$1(UnfilteredSerializer.java:575)
> org.apache.cassandra.db.rows.UnfilteredSerializer$$Lambda$84/898489541.accept(Unknown
> Source)
> org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1222)
> org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

If you restart the server the same validation completes successfully?
If not, have you tries scrubbing the affected sstables?

2017-04-13 11:43 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:

> thank you guys ... i will
>
> i just wanted to make sure that i am not doing something completely wrong
> before opening an issue
>
> br,
> roland
>
>
> On Thu, 2017-04-13 at 21:35 +1200, Nate McCall wrote:
>
> Not sure what is going on there either. Roland - can you open an issue
> with the information above:
> https://issues.apache.org/jira/browse/CASSANDRA
>
> On Thu, Apr 13, 2017 at 7:49 PM, benjamin roth <brs...@gmail.com> wrote:
>
> What I can tell you from that trace - given that this is the correct
> thread and it really hangs there:
>
> The validation is stuck when reading from an SSTable.
> Unfortunately I am no caffeine expert. It looks like the read is cached
> and after the read caffeine tries to drain the cache and this is stuck. I
> don't see the reason from that stack trace.
> Someone had to dig deeper into caffeine to find the root cause.
>
> 2017-04-13 9:27 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> i had a closer look at the validation executor thread (i hope thats what
> you meant)
>
> it seems the thread is always repeating stuff in
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
>
> here is the full stack trace ...
>
> i am sorry .. but i have no clue whats happening there ..
>
> com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$64/2098345091
> <(209)%20834-5091>.accept(Unknown Source)
> com.github.benmanes.caffeine.cache.BoundedBuffer$RingBuffer.
> drainTo(BoundedBuffer.java:104)
> com.github.benmanes.caffeine.cache.StripedBuffer.drainTo(Str
> ipedBuffer.java:160)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.drainRe
> adBuffer(BoundedLocalCache.java:964)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.mainten
> ance(BoundedLocalCache.java:918)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.perform
> CleanUp(BoundedLocalCache.java:903)
> com.github.benmanes.caffeine.cache.BoundedLocalCache$Perform
> CleanupTask.run(BoundedLocalCache.java:2680)
> com.google.common.util.concurrent.MoreExecutors$DirectExecut
> or.execute(MoreExecutors.java:457)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.schedul
> eDrainBuffers(BoundedLocalCache.java:875)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRe
> ad(BoundedLocalCache.java:748)
> com.github.benmanes.caffeine.cache.BoundedLocalCache.compute
> IfAbsent(BoundedLocalCache.java:1783)
> com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsen
> t(LocalCache.java:97)
> com.github.benmanes.caffeine.cache.LocalLoadingCache.get(Loc
> alLoadingCache.java:66)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:235)
> org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebu
> ffer(ChunkCache.java:213)
> org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(R
> andomAccessReader.java:65)
> org.apache.cassandra.io.util.RandomAccessReader.reBuffer(Ran
> domAccessReader.java:59)
> org.apache.cassandra.io.util.RebufferingInputStream.read(Reb
> ufferingInputStream.java:88)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:66)
> org.apache.cassandra.io.util.RebufferingInputStream.readFull
> y(RebufferingInputStream.java:60)
> org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:402)
> org.apache.cassandra.db.marshal.AbstractType.readValue(Abstr
> actType.java:420)
> org.apache.cassandra.db.rows.Cell$Serializer.deserialize(Cell.java:245)
> org.apache.cassandra.db.rows.UnfilteredSerializer.readSimple
> Column(UnfilteredSerializer.java:610)
> org.apache.cassandra.db.rows.UnfilteredSerializer.lambda$des
> erializeRowBody$1(UnfilteredSerializer.java:575)
> org.apache.cassandra.db.rows.UnfilteredSerializer$$Lambda$84/898489541.accept(Unknown
> Source)
> org.apache.cassandra.utils.btree.BTree.applyForwards(BTree.java:1222)
> org.apache.cassandra.utils.btree.BTree.apply(BTree.java:1177)
> org.apache.cassandra.db.Columns.apply(Columns.java:377)
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserializ
> eRowBody(UnfilteredSerializer.java:571)
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserializ
> e(UnfilteredSerializer.java:440)
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$Curren
> tFormatIterator.computeNext(SSTableSimpleIterator.java:95)
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$Curren
> tFormatIterator.computeNext(SSTableSimpleIterator.java:73)
> org.apache.cassandra.utils.AbstractIterator.hasNext(A

Re: force processing of pending hinted handoffs

2017-04-13 Thread benjamin roth

I encountered this situation also once or twice but didn't succeed. I just
deleted the old hints and ran a repair 

Am 13.04.2017 10:35 schrieb "Roland Otta" <roland.o...@willhaben.at>:

> unfortunately it does not.
>
> i guess this is intended for resuming hinted handoff handling in case it
> hase been paused with the pausehandoff before.
> i have tested it (resuming .. pausing & resuming) but it has no effect on
> those old hints
>
> On Thu, 2017-04-13 at 10:27 +0200, benjamin roth wrote:
>
> There is a nodetool command to resume hints. Maybe that helps?
>
> Am 13.04.2017 09:42 schrieb "Roland Otta" <roland.o...@willhaben.at>:
>
> oh ... the operation is deprecated according to the docs ...
>
>
> On Thu, 2017-04-13 at 07:40 +, Roland Otta wrote:
> > i figured out that there is an mbean
> > org.apache.cassandra.db.type=HintedHandoffManager with the operation
> > scheduleHintDelivery
> >
> > i guess thats what i would need in that case. at least the docs let
> > me
> > think so http://javadox.com/org.apache.cassandra/cassandra-all/3.0.0/
> > or
> > g/apache/cassandra/db/HintedHandOffManagerMBean.html
> >
> > but everytime i try invoking that operation i get an
> > UnsupportedOperationException (tried it with hostname, ip and host-id
> > as parameters - everytime the same exception)
> >
> >
> >
> > On Tue, 2017-04-11 at 07:40 +, Roland Otta wrote:
> > > hi,
> > >
> > > sometimes we have the problem that we have hinted handoffs (for
> > > example
> > > because auf network problems between 2 DCs) that do not get
> > > processed
> > > even if the connection problem between the dcs recovers. Some of
> > > the
> > > files stay in the hints directory until we restart the node that
> > > contains the hints.
> > >
> > > after the restart of cassandra we can see the proper messages for
> > > the
> > > hints handling
> > >
> > > Apr 11 09:28:56 bigd006 cassandra: INFO  07:28:56 Deleted hint file
> > > c429ad19-ee9f-4b5a-abcd-1da1516d1003-1491895717182-1.hints
> > > Apr 11 09:28:56 bigd006 cassandra: INFO  07:28:56 Finished hinted
> > > handoff of file c429ad19-ee9f-4b5a-abcd-1da1516d1003-1491895717182-
> > > 1.hints to endpoint c429ad19-ee9f-4b5a-abcd-1da1516d1003
> > >
> > > is there a way (for example via jmx) to force a node to process
> > > outstanding hints instead of restarting the node?
> > > does anyone know whats the cause for not retrying to process those
> > > hints automatically?
> > >
> > > br,
> > > roland
> > >
>
>

Re: force processing of pending hinted handoffs

2017-04-13 Thread benjamin roth

There is a nodetool command to resume hints. Maybe that helps?

Am 13.04.2017 09:42 schrieb "Roland Otta" :

> oh ... the operation is deprecated according to the docs ...
>
>
> On Thu, 2017-04-13 at 07:40 +, Roland Otta wrote:
> > i figured out that there is an mbean
> > org.apache.cassandra.db.type=HintedHandoffManager with the operation
> > scheduleHintDelivery
> >
> > i guess thats what i would need in that case. at least the docs let
> > me
> > think so http://javadox.com/org.apache.cassandra/cassandra-all/3.0.0/
> > or
> > g/apache/cassandra/db/HintedHandOffManagerMBean.html
> >
> > but everytime i try invoking that operation i get an
> > UnsupportedOperationException (tried it with hostname, ip and host-id
> > as parameters - everytime the same exception)
> >
> >
> >
> > On Tue, 2017-04-11 at 07:40 +, Roland Otta wrote:
> > > hi,
> > >
> > > sometimes we have the problem that we have hinted handoffs (for
> > > example
> > > because auf network problems between 2 DCs) that do not get
> > > processed
> > > even if the connection problem between the dcs recovers. Some of
> > > the
> > > files stay in the hints directory until we restart the node that
> > > contains the hints.
> > >
> > > after the restart of cassandra we can see the proper messages for
> > > the
> > > hints handling
> > >
> > > Apr 11 09:28:56 bigd006 cassandra: INFO  07:28:56 Deleted hint file
> > > c429ad19-ee9f-4b5a-abcd-1da1516d1003-1491895717182-1.hints
> > > Apr 11 09:28:56 bigd006 cassandra: INFO  07:28:56 Finished hinted
> > > handoff of file c429ad19-ee9f-4b5a-abcd-1da1516d1003-1491895717182-
> > > 1.hints to endpoint c429ad19-ee9f-4b5a-abcd-1da1516d1003
> > >
> > > is there a way (for example via jmx) to force a node to process
> > > outstanding hints instead of restarting the node?
> > > does anyone know whats the cause for not retrying to process those
> > > hints automatically?
> > >
> > > br,
> > > roland
> > >

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

MergeIterator$ManyToOne.
> computeNext(MergeIterator.java:155)
> org.apache.cassandra.utils.AbstractIterator.hasNext(
> AbstractIterator.java:47)
> org.apache.cassandra.db.rows.UnfilteredRowIterators$
> UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:500)
> org.apache.cassandra.db.rows.UnfilteredRowIterators$
> UnfilteredRowMergeIterator.computeNext(UnfilteredRowIterators.java:360)
> org.apache.cassandra.utils.AbstractIterator.hasNext(
> AbstractIterator.java:47)
> org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
> org.apache.cassandra.db.rows.UnfilteredRowIterators.digest(
> UnfilteredRowIterators.java:178)
> org.apache.cassandra.repair.Validator.rowHash(Validator.java:221)
> org.apache.cassandra.repair.Validator.add(Validator.java:160)
> org.apache.cassandra.db.compaction.CompactionManager.
> doValidationCompaction(CompactionManager.java:1364)
> org.apache.cassandra.db.compaction.CompactionManager.
> access$700(CompactionManager.java:85)
> org.apache.cassandra.db.compaction.CompactionManager$
> 13.call(CompactionManager.java:933)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> org.apache.cassandra.concurrent.NamedThreadFactory$
> $Lambda$5/1371495133.run(Unknown Source)
> java.lang.Thread.run(Thread.java:745)
>
> On Thu, 2017-04-13 at 08:47 +0200, benjamin roth wrote:
>
> You should connect to the node with JConsole and see where the compaction
> thread is stuck
>
> 2017-04-13 8:34 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:
>
> hi,
>
> we have the following issue on our 3.10 development cluster.
>
> we are doing regular repairs with thelastpickle's fork of creaper.
> sometimes the repair (it is a full repair in that case) hangs because
> of a stuck validation compaction
>
> nodetool compactionstats gives me
> a1bb45c0-1fc6-11e7-81de-0fb0b3f5a345 Validation  bds  ad_event
> 805955242 841258085 bytes 95.80%
> we have here no more progress for hours
>
> nodetool tpstats shows
> alidationExecutor1 1  16186 0
>0
>
> i checked the logs on the affected node and could not find any
> suspicious errors.
>
> anyone that already had this issue and knows how to cope with that?
>
> a restart of the node helps to finish the repair ... but i am not sure
> whether that somehow breaks the full repair
>
> bg,
> roland
>
>
>

Re: hanging validation compaction

2017-04-13 Thread benjamin roth

You should connect to the node with JConsole and see where the compaction
thread is stuck

2017-04-13 8:34 GMT+02:00 Roland Otta :

> hi,
>
> we have the following issue on our 3.10 development cluster.
>
> we are doing regular repairs with thelastpickle's fork of creaper.
> sometimes the repair (it is a full repair in that case) hangs because
> of a stuck validation compaction
>
> nodetool compactionstats gives me
> a1bb45c0-1fc6-11e7-81de-0fb0b3f5a345 Validation  bds  ad_event
> 805955242 841258085 bytes 95.80%
> we have here no more progress for hours
>
> nodetool tpstats shows
> alidationExecutor1 1  16186 0
>0
>
> i checked the logs on the affected node and could not find any
> suspicious errors.
>
> anyone that already had this issue and knows how to cope with that?
>
> a restart of the node helps to finish the repair ... but i am not sure
> whether that somehow breaks the full repair
>
> bg,
> roland
>

Re: WriteTimeoutException with LWT after few milliseconds

2017-04-12 Thread benjamin roth

Hi Roland,

LWTs set consistency level implicitly to SERIAL which requires at least
QUORUM.
No, no node is/was down. If that happens the query will fail with "Could
not achieve consistency level QUORUM ..."

2017-04-12 16:48 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:

> Hi Benjamin,
>
> its unlikely that i can assist you .. but nevertheless ... i give it a try
> ;-)
>
> whats your consistency level for the insert?
> what if one ore more nodes are marked down and proper consistency cant be
> achieved?
> of course the error message does not indicate that problem (as it says its
> a timeout)... but in that case you would get an instant error for inserts.
> wouldn't you?
>
> br,
> roland
>
>
>
> On Wed, 2017-04-12 at 15:09 +0200, benjamin roth wrote:
>
> Hi folks,
>
> Can someone explain why that occurs?
>
> Write timeout after 0.006s
> Query: 'INSERT INTO log_moment_import ("source", "reference", "user_id",
> "moment_id", "date", "finished") VALUES (3, '1305821272790495', 65675537,
> 0, '2017-04-12 13:00:51', NULL) IF NOT EXISTS
> Primary key and parition key is source + reference
> Message: Operation timed out - received only 1 responses.
>
> This appears every now and then in the log. When I check the for the
> record in the table, it is there.
> I could explain that, if the WTE occured after the configured write
> timeout but it happens withing a few milliseconds.
> Is this caused by lock contention? It is possible that there are
> concurrent inserts on the same PK - actually thats the reason why I use
> LWTs.
>
> Thanks!
>
>

WriteTimeoutException with LWT after few milliseconds

2017-04-12 Thread benjamin roth

Hi folks,

Can someone explain why that occurs?

Write timeout after 0.006s
Query: 'INSERT INTO log_moment_import ("source", "reference", "user_id",
"moment_id", "date", "finished") VALUES (3, '1305821272790495', 65675537,
0, '2017-04-12 13:00:51', NULL) IF NOT EXISTS
Primary key and parition key is source + reference
Message: Operation timed out - received only 1 responses.

This appears every now and then in the log. When I check the for the record
in the table, it is there.
I could explain that, if the WTE occured after the configured write timeout
but it happens withing a few milliseconds.
Is this caused by lock contention? It is possible that there are concurrent
inserts on the same PK - actually thats the reason why I use LWTs.

Thanks!

Re: Multiple nodes decommission

2017-04-11 Thread benjamin roth

I did not test it but I'd bet that parallel decommision will lead to
inconsistencies.
Each decommission results in range movements and range reassignments which
becomes effective after a successful decommission.
If you start several decommissions at once, I guess the calculated
reassignments are invalid for at least one node after the first node
finished the decommission process.

I hope someone will correct me if i am wrong.

2017-04-11 18:43 GMT+02:00 Jacob Shadix :

> Are you using vnodes? I typically do one-by-one as the decommission will
> create additional load/network activity streaming data to the other nodes
> as the token ranges are reassigned.
>
> -- Jacob Shadix
>
> On Sat, Apr 8, 2017 at 10:55 AM, Vlad  wrote:
>
>> Hi,
>>
>> how multiple nodes should be decommissioned by "nodetool decommission"-
>> one by one or in parallel ?
>>
>> Thanks.
>>
>
>

Re: Node always dieing

2017-04-06 Thread benjamin roth

e_manager=CassandraRoleManager; roles_cache_max_entries=1000;
> roles_update_interval_in_ms=-1; roles_validity_in_ms=2000;
> row_cache_class_name=org.apache.cassandra.cache.OHCProvider;
> row_cache_keys_to_save=2147483647; row_cache_save_period=0;
> row_cache_size_in_mb=0; rpc_address=10.100.100.213; rpc_interface=null;
> rpc_interface_prefer_ipv6=false; rpc_keepalive=true;
> rpc_listen_backlog=50; rpc_max_threads=2147483647; rpc_min_threads=16;
> rpc_port=9160; rpc_recv_buff_size_in_bytes=null;
> rpc_send_buff_size_in_bytes=null; rpc_server_type=sync;
> saved_caches_directory=/mnt/cassandra/saved_caches;
> seed_provider=org.apache.cassandra.locator.SimpleSeedProvider{seeds=10.100.100.19,
> 10.100.100.85, 10.100.100.185, 10.100.100.161, 10.100.100.52,
> 10.100.1000.213}; server_encryption_options=;
> slow_query_log_timeout_in_ms=600; snapshot_before_compaction=false;
> ssl_storage_port=7001; sstable_preemptive_open_interval_in_mb=50;
> start_native_transport=true; start_rpc=false; storage_port=7000;
> stream_throughput_outbound_megabits_per_sec=200;
> streaming_keep_alive_period_in_secs=300; 
> streaming_socket_timeout_in_ms=8640;
> thrift_framed_transport_size_in_mb=15; thrift_max_message_length_in_mb=16;
> thrift_prepared_statements_cache_size_mb=null;
> tombstone_failure_threshold=10; tombstone_warn_threshold=1000;
> tracetype_query_ttl=86400; tracetype_repair_ttl=604800;
> transparent_data_encryption_options=org.apache.cassandra.config.
> TransparentDataEncryptionOptions@38c5cc4c; trickle_fsync=false;
> trickle_fsync_interval_in_kb=10240; truncate_request_timeout_in_ms=600;
> unlogged_batch_across_partitions_warn_threshold=10;
> user_defined_function_fail_timeout=1500; 
> user_defined_function_warn_timeout=500;
> user_function_timeout_policy=die; windows_timer_interval=1;
> write_request_timeout_in_ms=600]
> Thanks
>
>
> On 04/06/2017 11:30 AM, benjamin roth wrote:
>
> Have you checked the effective limits of a running CS process?
> Is CS run as Cassandra? Just to rule out missing file perms.
>
>
> Am 06.04.2017 12:24 schrieb "Cogumelos Maravilha" <
> cogumelosmaravi...@sapo.pt>:
>
> From cassandra.yaml:
> hints_directory: /mnt/cassandra/hints
> data_file_directories:
> - /mnt/cassandra/data
> commitlog_directory: /mnt/cassandra/commitlog
> saved_caches_directory: /mnt/cassandra/saved_caches
>
> drwxr-xr-x   3 cassandra cassandra   23 Apr  5 16:03 mnt/
>
> drwxr-xr-x 6 cassandra cassandra  68 Apr  5 16:17 ./
> drwxr-xr-x 3 cassandra cassandra  23 Apr  5 16:03 ../
> drwxr-xr-x 2 cassandra cassandra  80 Apr  6 10:07 commitlog/
> drwxr-xr-x 8 cassandra cassandra 124 Apr  5 16:17 data/
> drwxr-xr-x 2 cassandra cassandra  72 Apr  5 16:20 hints/
> drwxr-xr-x 2 cassandra cassandra  49 Apr  5 20:17 saved_caches/
>
> cassand+  2267 1 99 10:18 ?00:02:56 java
> -Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities
> -XX:Threa...
>
> /dev/mapper/um_vg-xfs_lv  885G   27G  858G   4% /mnt
>
> On /etc/security/limits.conf
>
> *   -   memlock  unlimited
> *   -  nofile  10
> *   -  nproc  32768
> *   -  as   unlimited
>
> On /etc/security/limits.d/cassandra.conf
>
> cassandra  -  memlock  unlimited
> cassandra  -  nofile   10
> cassandra  -  as   unlimited
> cassandra  -  nproc32768
>
> On /etc/sysctl.conf
>
> vm.max_map_count = 1048575
>
> On /etc/systcl.d/cassanda.conf
>
> vm.max_map_count = 1048575
> net.ipv4.tcp_keepalive_time=600
> On /etc/pam.d/su
> ...
> sessionrequired   pam_limits.so
> ...
>
> Distro is the currently Ubuntu LTS.
> Thanks
>
>
>
> On 04/06/2017 10:39 AM, benjamin roth wrote:
>
> Cassandra cannot write an SSTable to disk. Are you sure the disk/volume
> where SSTables reside (normally /var/lib/cassandra/data) is writeable for
> the CS user and has enough free space?
> The CDC warning also implies that.
> The other warnings indicate you are probably not running CS as root and
> you did not set an appropriate limit for max open files. Running out of
> open files can also be a reason for the IO error.
>
> 2017-04-06 11:34 GMT+02:00 Cogumelos Maravilha <cogumelosmaravi...@sapo.pt
> >:
>
>> Hi list,
>>
>> I'm using C* 3.10 in a 6 nodes cluster RF=2. All instances type
>> i3.xlarge (AWS) with 32GB, 2 cores and SSD LVM XFS formated 885G. I have
>> one node that is always dieing and I don't understand why. Can anyone
>> give me some hints please. All nodes using the same configuration.
>>
>> Thanks in advance.
>>
>> INFO  [IndexSummaryManager:1] 2017-04-06 05:22:18,352
>> IndexSummar

Re: Node always dieing

2017-04-06 Thread benjamin roth

Have you checked the effective limits of a running CS process?
Is CS run as Cassandra? Just to rule out missing file perms.


Am 06.04.2017 12:24 schrieb "Cogumelos Maravilha" <
cogumelosmaravi...@sapo.pt>:

>From cassandra.yaml:
hints_directory: /mnt/cassandra/hints
data_file_directories:
- /mnt/cassandra/data
commitlog_directory: /mnt/cassandra/commitlog
saved_caches_directory: /mnt/cassandra/saved_caches

drwxr-xr-x   3 cassandra cassandra   23 Apr  5 16:03 mnt/

drwxr-xr-x 6 cassandra cassandra  68 Apr  5 16:17 ./
drwxr-xr-x 3 cassandra cassandra  23 Apr  5 16:03 ../
drwxr-xr-x 2 cassandra cassandra  80 Apr  6 10:07 commitlog/
drwxr-xr-x 8 cassandra cassandra 124 Apr  5 16:17 data/
drwxr-xr-x 2 cassandra cassandra  72 Apr  5 16:20 hints/
drwxr-xr-x 2 cassandra cassandra  49 Apr  5 20:17 saved_caches/

cassand+  2267 1 99 10:18 ?00:02:56 java
-Xloggc:/var/log/cassandra/gc.log -ea -XX:+UseThreadPriorities -XX:Threa...

/dev/mapper/um_vg-xfs_lv  885G   27G  858G   4% /mnt

On /etc/security/limits.conf

*   -   memlock  unlimited
*   -  nofile  10
*   -  nproc  32768
*   -  as   unlimited

On /etc/security/limits.d/cassandra.conf

cassandra  -  memlock  unlimited
cassandra  -  nofile   10
cassandra  -  as   unlimited
cassandra  -  nproc32768

On /etc/sysctl.conf

vm.max_map_count = 1048575

On /etc/systcl.d/cassanda.conf

vm.max_map_count = 1048575
net.ipv4.tcp_keepalive_time=600
On /etc/pam.d/su
...
sessionrequired   pam_limits.so
...

Distro is the currently Ubuntu LTS.
Thanks



On 04/06/2017 10:39 AM, benjamin roth wrote:

Cassandra cannot write an SSTable to disk. Are you sure the disk/volume
where SSTables reside (normally /var/lib/cassandra/data) is writeable for
the CS user and has enough free space?
The CDC warning also implies that.
The other warnings indicate you are probably not running CS as root and you
did not set an appropriate limit for max open files. Running out of open
files can also be a reason for the IO error.

2017-04-06 11:34 GMT+02:00 Cogumelos Maravilha <cogumelosmaravi...@sapo.pt>:

> Hi list,
>
> I'm using C* 3.10 in a 6 nodes cluster RF=2. All instances type
> i3.xlarge (AWS) with 32GB, 2 cores and SSD LVM XFS formated 885G. I have
> one node that is always dieing and I don't understand why. Can anyone
> give me some hints please. All nodes using the same configuration.
>
> Thanks in advance.
>
> INFO  [IndexSummaryManager:1] 2017-04-06 05:22:18,352
> IndexSummaryRedistribution.java:75 - Redistributing index summaries
> ERROR [MemtablePostFlush:22] 2017-04-06 06:00:26,800
> CassandraDaemon.java:229 - Exception in thread
> Thread[MemtablePostFlush:22,5,main]
> org.apache.cassandra.io.FSWriteError: java.io.IOException: Input/output
> error
> at
> org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyIn
> ternal(SequentialWriter.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.syncInternal(S
> equentialWriter.java:185)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter.
> access$100(CompressedSequentialWriter.java:38)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter$
> TransactionalProxy.doPrepare(CompressedSequentialWriter.java:307)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.utils.concurrent.Transactional$Abstract
> Transactional.prepareToCommit(Transactional.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.prepareToCommi
> t(SequentialWriter.java:358)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$Tr
> ansactionalProxy.doPrepare(BigTableWriter.java:367)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.utils.concurrent.Transactional$Abstract
> Transactional.prepareToCommit(Transactional.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.format.SSTableWriter.prepare
> ToCommit(SSTableWriter.java:281)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.pre
> pareToCommit(SimpleSSTableMultiWriter.java:101)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.db.ColumnFamilyStore$Flush.flushMemtabl
> e(ColumnFamilyStore.java:1153)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFa
> milyStore.java:1086)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1142)
> ~[na:1.8.0_121]
> at
>

Re: Node always dieing

2017-04-06 Thread benjamin roth

Cassandra cannot write an SSTable to disk. Are you sure the disk/volume
where SSTables reside (normally /var/lib/cassandra/data) is writeable for
the CS user and has enough free space?
The CDC warning also implies that.
The other warnings indicate you are probably not running CS as root and you
did not set an appropriate limit for max open files. Running out of open
files can also be a reason for the IO error.

2017-04-06 11:34 GMT+02:00 Cogumelos Maravilha :

> Hi list,
>
> I'm using C* 3.10 in a 6 nodes cluster RF=2. All instances type
> i3.xlarge (AWS) with 32GB, 2 cores and SSD LVM XFS formated 885G. I have
> one node that is always dieing and I don't understand why. Can anyone
> give me some hints please. All nodes using the same configuration.
>
> Thanks in advance.
>
> INFO  [IndexSummaryManager:1] 2017-04-06 05:22:18,352
> IndexSummaryRedistribution.java:75 - Redistributing index summaries
> ERROR [MemtablePostFlush:22] 2017-04-06 06:00:26,800
> CassandraDaemon.java:229 - Exception in thread
> Thread[MemtablePostFlush:22,5,main]
> org.apache.cassandra.io.FSWriteError: java.io.IOException: Input/output
> error
> at
> org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(
> SequentialWriter.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.syncInternal(
> SequentialWriter.java:185)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter.access$100(
> CompressedSequentialWriter.java:38)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter$
> TransactionalProxy.doPrepare(CompressedSequentialWriter.java:307)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.
> prepareToCommit(Transactional.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.prepareToCommit(
> SequentialWriter.java:358)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter$
> TransactionalProxy.doPrepare(BigTableWriter.java:367)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.
> prepareToCommit(Transactional.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.format.SSTableWriter.
> prepareToCommit(SSTableWriter.java:281)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.prepareToCommit(
> SimpleSSTableMultiWriter.java:101)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.db.ColumnFamilyStore$Flush.flushMemtable(
> ColumnFamilyStore.java:1153)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.db.ColumnFamilyStore$Flush.run(
> ColumnFamilyStore.java:1086)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_121]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> [na:1.8.0_121]
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> [apache-cassandra-3.10.jar:3.10]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_121]
> Caused by: java.io.IOException: Input/output error
> at sun.nio.ch.FileDispatcherImpl.force0(Native Method) ~[na:1.8.0_121]
> at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
> ~[na:1.8.0_121]
> at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
> ~[na:1.8.0_121]
> at org.apache.cassandra.utils.SyncUtil.force(SyncUtil.java:158)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(
> SequentialWriter.java:169)
> ~[apache-cassandra-3.10.jar:3.10]
> ... 15 common frames omitted
> INFO  [IndexSummaryManager:1] 2017-04-06 06:22:18,366
> IndexSummaryRedistribution.java:75 - Redistributing index summaries
> ERROR [MemtablePostFlush:31] 2017-04-06 06:39:19,525
> CassandraDaemon.java:229 - Exception in thread
> Thread[MemtablePostFlush:31,5,main]
> org.apache.cassandra.io.FSWriteError: java.io.IOException: Input/output
> error
> at
> org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(
> SequentialWriter.java:173)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.util.SequentialWriter.syncInternal(
> SequentialWriter.java:185)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter.access$100(
> CompressedSequentialWriter.java:38)
> ~[apache-cassandra-3.10.jar:3.10]
> at
> org.apache.cassandra.io.compress.CompressedSequentialWriter$
> TransactionalProxy.doPrepare(CompressedSequentialWriter.java:307)
> ~[apache-cassandra-3.10.jar:3.10]
> at
>

Re: nodes are always out of sync

2017-04-02 Thread benjamin roth

Btw.: I created an issue for that some months ago
https://issues.apache.org/jira/browse/CASSANDRA-12991

2017-04-01 22:25 GMT+02:00 Roland Otta <roland.o...@willhaben.at>:

> thank you both chris and benjamin for taking time to clarify that.
>
>
> On Sat, 2017-04-01 at 21:17 +0200, benjamin roth wrote:
>
> Tl;Dr: there are race conditions in a repair and it is not trivial to fix
> them. So we rather stay with these race conditions. Actually they don't
> really hurt. The worst case is that ranges are repaired that don't really
> need a repair.
>
> Am 01.04.2017 21:14 schrieb "Chris Lohfink" <clohfin...@gmail.com>:
>
> Repairs do not have an ability to instantly build a perfect view of its
> data between your 3 nodes at an exact time. When a piece of data is written
> there is a delay between when they applied between the nodes, even if its
> just 500ms. So if a request to read the data and build the merkle tree of
> the data occurs and it finishes on node1 at 12:01 while node2 finishes at
> 12:02 the 1 minute or so delta (even if a few seconds, or if using snapshot
> repairs) between the partition/range hashes in the merkle tree can be
> different. On a moving data set its almost impossible to have the clusters
> perfectly in sync for a repair. I wouldnt worry about that log message. If
> you are worried about consistency between your read/writes use each or
> local quorum for both.
>
> Chris
>
> On Thu, Mar 30, 2017 at 1:22 AM, Roland Otta <roland.o...@willhaben.at>
> wrote:
>
> hi,
>
> we see the following behaviour in our environment:
>
> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
> replication factor 3.
> clients are writing data to the keyspace with consistency one.
>
> we are doing parallel, incremental repairs with cassandra reaper.
>
> even if a repair just finished and we are starting a new one
> immediately, we can see the following entries in our logs:
>
> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
> and /192.168.0.189 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 SyncTask.java:73 -
> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.189 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 SyncTask.java:73 -
> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
> and /192.168.0.190 have 1 range(s) out of sync for ad_event_history
> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 SyncTask.java:73 -
> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.190
> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>
> we cant see any hints on the systems ... so we thought everything is
> running smoothly with the writes.
>
> do we have to be concerned about the nodes always being out of sync or
> is this a normal behaviour in a write intensive table (as the tables
> will never be 100% in sync for the latest inserts)?
>
> bg,
> roland
>
>
>
>
>
>

Re: nodes are always out of sync

2017-04-01 Thread benjamin roth

Tl;Dr: there are race conditions in a repair and it is not trivial to fix
them. So we rather stay with these race conditions. Actually they don't
really hurt. The worst case is that ranges are repaired that don't really
need a repair.

Am 01.04.2017 21:14 schrieb "Chris Lohfink" :

> Repairs do not have an ability to instantly build a perfect view of its
> data between your 3 nodes at an exact time. When a piece of data is written
> there is a delay between when they applied between the nodes, even if its
> just 500ms. So if a request to read the data and build the merkle tree of
> the data occurs and it finishes on node1 at 12:01 while node2 finishes at
> 12:02 the 1 minute or so delta (even if a few seconds, or if using snapshot
> repairs) between the partition/range hashes in the merkle tree can be
> different. On a moving data set its almost impossible to have the clusters
> perfectly in sync for a repair. I wouldnt worry about that log message. If
> you are worried about consistency between your read/writes use each or
> local quorum for both.
>
> Chris
>
> On Thu, Mar 30, 2017 at 1:22 AM, Roland Otta 
> wrote:
>
>> hi,
>>
>> we see the following behaviour in our environment:
>>
>> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
>> replication factor 3.
>> clients are writing data to the keyspace with consistency one.
>>
>> we are doing parallel, incremental repairs with cassandra reaper.
>>
>> even if a repair just finished and we are starting a new one
>> immediately, we can see the following entries in our logs:
>>
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
>> and /192.168.0.189 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
>> and /192.168.0.189 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
>> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.190 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.190
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>>
>> we cant see any hints on the systems ... so we thought everything is
>> running smoothly with the writes.
>>
>> do we have to be concerned about the nodes always being out of sync or
>> is this a normal behaviour in a write intensive table (as the tables
>> will never be 100% in sync for the latest inserts)?
>>
>> bg,
>> roland
>>
>>
>>
>

Re: nodes are always out of sync

2017-04-01 Thread benjamin roth

I think your way to communicate needs work. No one forces you to answer on
questions.

Am 01.04.2017 21:09 schrieb "daemeon reiydelle" :

> What you are doing is correctly going to result in this, IF there is
> substantial backlog/network/disk or whatever pressure.
>
> What do you think will happen when you write with a replication factor
> greater than consistency level of write? Perhaps your mental model of how
> C* works needs work?
>
>
> *...*
>
>
>
> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198 <(415)%20501-0198>London
> (+44) (0) 20 8144 9872 <+44%2020%208144%209872>*
>
> On Sat, Apr 1, 2017 at 11:09 AM, Vladimir Yudovin 
> wrote:
>
>> Hi,
>>
>> did you try to read data with consistency ALL immediately after write
>> with consistency ONE? Does it succeed?
>>
>> Best regards, Vladimir Yudovin,
>> *Winguzone  - Cloud Cassandra Hosting*
>>
>>
>>  On Thu, 30 Mar 2017 04:22:28 -0400 *Roland Otta
>> >* wrote 
>>
>> hi,
>>
>> we see the following behaviour in our environment:
>>
>> cluster consists of 6 nodes (cassandra version 3.0.7). keyspace has a
>> replication factor 3.
>> clients are writing data to the keyspace with consistency one.
>>
>> we are doing parallel, incremental repairs with cassandra reaper.
>>
>> even if a repair just finished and we are starting a new one
>> immediately, we can see the following entries in our logs:
>>
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.188
>> and /192.168.0.189 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:00,782 SyncTask.java:73 -
>> [repair #d0f651f6-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
>> and /192.168.0.189 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.26
>> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:03,997 SyncTask.java:73 -
>> [repair #d0fa70a1-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 2 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:1] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:2] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.189
>> and /192.168.0.190 have 1 range(s) out of sync for ad_event_history
>> INFO  [RepairJobTask:4] 2017-03-30 10:14:05,375 SyncTask.java:73 -
>> [repair #d0fbd033-1520-11e7-a443-d9f5b942818e] Endpoints /192.168.0.190
>> and /192.168.0.191 have 1 range(s) out of sync for ad_event_history
>>
>> we cant see any hints on the systems ... so we thought everything is
>> running smoothly with the writes.
>>
>> do we have to be concerned about the nodes always being out of sync or
>> is this a normal behaviour in a write intensive table (as the tables
>> will never be 100% in sync for the latest inserts)?
>>
>> bg,
>> roland
>>
>>
>>
>>
>

Re: spikes in blocked native transport requests

2017-03-20 Thread benjamin roth

Did you check STW GCs?
You can do that with 'nodetool gcstats', by looking at the gc.log or
observing GC related JMX metrics.

2017-03-20 8:52 GMT+01:00 Roland Otta :

> we have a datacenter which is currently used exlusively for spark batch
> jobs.
>
> in case batch jobs are running against that environment we can see very
> high peaks in blocked native transport requests (up to 10k / minute).
>
> i am concerned because i guess that will slow other queries (in case
> other applications are going to use that dc as well).
>
> i already tried increasing native_transport_max_threads +
> concurrent_reads without success.
>
> during the jobs i cant find any resource limitiations on my hardware
> (iops, disk usage, cpu, ... is fine).
>
> am i missing something? any suggestions how to cope with that?
>
> br//
> roland
>
>
>

Re: Running cassandra

2017-03-19 Thread benjamin roth

You're welcome!

2017-03-19 18:41 GMT+01:00 Long Quanzheng <prc...@gmail.com>:

> You are RIGHT!
> It's working after I remove the env variable GREP_OPTIONS.
>
> Thanks!
>
> 2017-03-19 10:08 GMT-07:00 benjamin roth <brs...@gmail.com>:
>
>> I once had the same problem. In my case it was the coloured output of
>> grep that injected ansi codes into the CS startup command.
>>
>> Am 19.03.2017 18:07 schrieb "Long Quanzheng" <prc...@gmail.com>:
>>
>>> Hi
>>> It still doesn't work.
>>>
>>> The real problem is this error:
>>>
>>> Error: Could not find or load main class -ea
>>>
>>> Thanks
>>> Long
>>>
>>> On Sun, Mar 19, 2017 at 3:16 AM Vinci <vi...@protonmail.com> wrote:
>>>
>>>> You need to have a log directory to be able to run cassandra.
>>>>
>>>> mkdir logs
>>>>
>>>> then start the cassandra process.
>>>>
>>>>  Original Message 
>>>> Subject: Running cassandra
>>>> Local Time: 19 March 2017 11:31 AM
>>>> UTC Time: 19 March 2017 06:01
>>>> From: prc...@gmail.com
>>>> To: user@cassandra.apache.org <user@cassandra.apache.org>
>>>>
>>>>
>>>> Hi
>>>> I am trying to get started to play with Cassandra follow this doc:
>>>> http://cassandra.apache.org/doc/latest/getting_started/insta
>>>> lling.html#prerequisites
>>>>
>>>> But I always get the error:
>>>>
>>>> qlong@~/ws/cas/apache-cassandra-3.10 $ ./bin/cassandra -f
>>>> Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file
>>>> ./bin/../logs/gc.log due to No such file or directory
>>>>
>>>> Error: Could not find or load main class -ea
>>>> qlong@~/ws/cas/apache-cassandra-3.10 $ ./bin/cassandra
>>>> Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file
>>>> ./bin/../logs/gc.log due to No such file or directory
>>>>
>>>> qlong@~/ws/cas/apache-cassandra-3.10 $ Error: Could not find or load
>>>> main class -ea
>>>>
>>>> Did I miss something?
>>>>
>>>> My java is 1.8:
>>>> qlong@~ $ java -version
>>>> java version "1.8.0_121"
>>>> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
>>>> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>>
>>>> Thanks for any help,
>>>> Long
>>>>
>>>>
>>>>
>

Re: Running cassandra

2017-03-19 Thread benjamin roth

I once had the same problem. In my case it was the coloured output of grep
that injected ansi codes into the CS startup command.

Am 19.03.2017 18:07 schrieb "Long Quanzheng" :

> Hi
> It still doesn't work.
>
> The real problem is this error:
>
> Error: Could not find or load main class -ea
>
> Thanks
> Long
>
> On Sun, Mar 19, 2017 at 3:16 AM Vinci  wrote:
>
>> You need to have a log directory to be able to run cassandra.
>>
>> mkdir logs
>>
>> then start the cassandra process.
>>
>>  Original Message 
>> Subject: Running cassandra
>> Local Time: 19 March 2017 11:31 AM
>> UTC Time: 19 March 2017 06:01
>> From: prc...@gmail.com
>> To: user@cassandra.apache.org 
>>
>>
>> Hi
>> I am trying to get started to play with Cassandra follow this doc:
>> http://cassandra.apache.org/doc/latest/getting_started/
>> installing.html#prerequisites
>>
>> But I always get the error:
>>
>> qlong@~/ws/cas/apache-cassandra-3.10 $ ./bin/cassandra -f
>> Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file
>> ./bin/../logs/gc.log due to No such file or directory
>>
>> Error: Could not find or load main class -ea
>> qlong@~/ws/cas/apache-cassandra-3.10 $ ./bin/cassandra
>> Java HotSpot(TM) 64-Bit Server VM warning: Cannot open file
>> ./bin/../logs/gc.log due to No such file or directory
>>
>> qlong@~/ws/cas/apache-cassandra-3.10 $ Error: Could not find or load
>> main class -ea
>>
>> Did I miss something?
>>
>> My java is 1.8:
>> qlong@~ $ java -version
>> java version "1.8.0_121"
>> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
>>
>> Thanks for any help,
>> Long
>>
>>
>>

Re: repair performance

2017-03-17 Thread benjamin roth

The fork from thelastpickle is. I'd recommend to give it a try over pure
nodetool.

2017-03-17 22:30 GMT+01:00 Roland Otta <roland.o...@willhaben.at>:

> forgot to mention the version we are using:
>
> we are using 3.0.7 - so i guess we should have incremental repairs by
> default.
> it also prints out incremental:true when starting a repair
> INFO  [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 -
> Starting repair command #7, repairing keyspace xxx with repair options
> (parallelism: parallel, primary range: false, incremental: true, job
> threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of
> ranges: 1758)
>
> 3.0.7 is also the reason why we are not using reaper ... as far as i could
> figure out it's not compatible with 3.0+
>
>
>
> On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote:
>
> It depends a lot ...
>
> - Repairs can be very slow, yes! (And unreliable, due to timeouts,
> outages, whatever)
> - You can use incremental repairs to speed things up for regular repairs
> - You can use "reaper" to schedule repairs and run them sliced, automated,
> failsafe
>
> The time repairs actually may vary a lot depending on how much data has to
> be streamed or how inconsistent your cluster is.
>
> 50mbit/s is really a bit low! The actual performance depends on so many
> factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old
> nodes" of the cluster.
> This is a quite individual problem you have to track down individually.
>
> 2017-03-17 22:07 GMT+01:00 Roland Otta <roland.o...@willhaben.at>:
>
> hello,
>
> we are quite inexperienced with cassandra at the moment and are playing
> around with a new cluster we built up for getting familiar with
> cassandra and its possibilites.
>
> while getting familiar with that topic we recognized that repairs in
> our cluster take a long time. To get an idea of our current setup here
> are some numbers:
>
> our cluster currently consists of 4 nodes (replication factor 3).
> these nodes are all on dedicated physical hardware in our own
> datacenter. all of the nodes have
>
> 32 cores @2,9Ghz
> 64 GB ram
> 2 ssds (raid0) 900 GB each for data
> 1 seperate hdd for OS + commitlogs
>
> current dataset:
> approx 530 GB per node
> 21 tables (biggest one has more than 200 GB / node)
>
>
> i already tried setting compactionthroughput + streamingthroughput to
> unlimited for testing purposes ... but that did not change anything.
>
> when checking system resources i cannot see any bottleneck (cpus are
> pretty idle and we have no iowaits).
>
> when issuing a repair via
>
> nodetool repair -local on a node the repair takes longer than a day.
> is this normal or could we normally expect a faster repair?
>
> i also recognized that initalizing of new nodes in the datacenter was
> really slow (approx 50 mbit/s). also here i expected a much better
> performance - could those 2 problems be somehow related?
>
> br//
> roland
>
>
>

Re: repair performance

2017-03-17 Thread benjamin roth

It depends a lot ...

- Repairs can be very slow, yes! (And unreliable, due to timeouts, outages,
whatever)
- You can use incremental repairs to speed things up for regular repairs
- You can use "reaper" to schedule repairs and run them sliced, automated,
failsafe

The time repairs actually may vary a lot depending on how much data has to
be streamed or how inconsistent your cluster is.

50mbit/s is really a bit low! The actual performance depends on so many
factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old
nodes" of the cluster.
This is a quite individual problem you have to track down individually.

2017-03-17 22:07 GMT+01:00 Roland Otta :

> hello,
>
> we are quite inexperienced with cassandra at the moment and are playing
> around with a new cluster we built up for getting familiar with
> cassandra and its possibilites.
>
> while getting familiar with that topic we recognized that repairs in
> our cluster take a long time. To get an idea of our current setup here
> are some numbers:
>
> our cluster currently consists of 4 nodes (replication factor 3).
> these nodes are all on dedicated physical hardware in our own
> datacenter. all of the nodes have
>
> 32 cores @2,9Ghz
> 64 GB ram
> 2 ssds (raid0) 900 GB each for data
> 1 seperate hdd for OS + commitlogs
>
> current dataset:
> approx 530 GB per node
> 21 tables (biggest one has more than 200 GB / node)
>
>
> i already tried setting compactionthroughput + streamingthroughput to
> unlimited for testing purposes ... but that did not change anything.
>
> when checking system resources i cannot see any bottleneck (cpus are
> pretty idle and we have no iowaits).
>
> when issuing a repair via
>
> nodetool repair -local on a node the repair takes longer than a day.
> is this normal or could we normally expect a faster repair?
>
> i also recognized that initalizing of new nodes in the datacenter was
> really slow (approx 50 mbit/s). also here i expected a much better
> performance - could those 2 problems be somehow related?
>
> br//
> roland

Re: scylladb

2017-03-13 Thread benjamin roth

@Dor,Jeff:

I think Jeff pointed out an important fact: You cannot stop CS, swap
binaries and start Scylla. To be honest that was AFAIR the only "Oooh :(" I
had when reading the Scylla "marketing material".

If that worked it would be very valuable from both Scylla's and a users'
point of view. As a user I would love to give scylla a try as soon as it
provides all the features my application requires. But the hurdle is quite
high. I have to create a separate scylla cluster and I have to migrate a
lot of data and I have to manage somehow that my application can use (r+w)
both CS + Scylla at the same time to not run any risk of data loss or dead
end road if something goes wrong. And still: I would not be able to compare
CS + Scylla for my workload totally fair as the conditions changed. New
hardware, maybe partial dataset, probably only "test traffic".

However, if I was able to just replace a single node in an existing cluster
I'd have:
1. Superlow hurdle to give it a try: No risk, no effort
2. Fair comparison by comparing new node against some equally equipeed old
node in the same cluster with the same workload
3. Easy to make a decision if to continue or not

That would be totally awesome!


2017-03-12 23:16 GMT+01:00 Kant Kodali :

> I don't think ScyallDB guys started this conversation in the first place
> to suggest or promote "drop-in replacement". It was something that is
> brought up by one of the Cassandra users and ScyallDB guys just clarified
> it. They are gracious enough to share the internals in detail.
>
> honestly, I find it weird when I see questions like whether a question
> belongs  to a mailing list or not especially in this case. If one doesn't
> like it they can simply not follow the thread. I am not sure what is the
> harm here.
>
>
>
> On Sun, Mar 12, 2017 at 2:29 PM, James Carman 
> wrote:
>
>> Well, looking back, it appears this thread is from 2015, so apparently
>> everyone is okay with it.
>>
>> Promoting a value-add product that makes using Cassandra easier/more
>> efficient/etc would be cool, but coming to the Cassandra mailing list to
>> promote a "drop-in replacement" (use us, not Cassandra) isn't cool, IMHO.
>>
>>
>> On Sun, Mar 12, 2017 at 5:04 PM Kant Kodali  wrote:
>>
>> yes.
>>
>> On Sun, Mar 12, 2017 at 2:01 PM, James Carman > > wrote:
>>
>> Does all of this Scylla talk really even belong on the Cassandra user
>> mailing list in the first place?
>>
>>
>>
>>
>> On Sun, Mar 12, 2017 at 4:07 PM Jeff Jirsa  wrote:
>>
>>
>>
>> On 2017-03-11 22:33 (-0700), Dor Laor  wrote:
>> > On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>> > > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>> > > > Cassanda vs Scylla is a valid comparison because they both are
>> > > compatible. Scylla is a drop-in replacement for Cassandra.
>> > >
>> > > No, they aren't, and no, it isn't
>> > >
>> >
>> > Jeff is angry with us for some reason. I don't know why, it's natural
>> that
>> > when  a new opponent there are objections and the proof lies on us.
>>
>> I'm not angry. When I'm angry I send emails with paragraphs of
>> expletives. It doesn't happen very often.
>>
>> This is an open source ASF project, it's not about fighting for market
>> share against startups who find it necessary to inflate their level of
>> compatibility to sell support contracts, it's about providing software that
>> people can use (with a license that makes it easy to use). I don't work for
>> a company that makes money selling Cassandra based solutions and you're not
>> an opponent.
>>
>> >
>> > Scylla IS a drop in replacement for C*. We support the same CQL (from
>> > version 1.7 it's cql 3.3.1, protocol v4), the same SStable format
>> (based on
>> > 2.1.8).
>>
>> Scylla doesn't even run on all of the supported operating systems, let
>> alone have feature parity or network level compatibility (which you'd
>> probably need if you REALLY want to be drop-in
>> stop-one-cassandra-node-swap-binaries-start-it-up compatible, which is
>> what your site used to claim, but obviously isn't supported). You support a
>> subset of one query language and can read and write one sstable format. You
>> do it with great supporting tech and a great engineering team, but you're
>> not compatible, and if I were your cofounder I'd ask you to focus on the
>> tech strengths and not your drop-in compatibility, so engineers who care
>> about facts don't grow to resent your public lies.
>>
>> I've used a lot of databases in my life, but I don't know that I've ever
>> had someone call me angry because I pointed out that database A wasn't
>> compatible with database B, but I guess I'll chalk it up to 2017 and the
>> year of fake news / alternative facts.
>>
>> Hugs and kisses,
>> - Jeff
>>
>>
>>
>

Re: scylladb

2017-03-11 Thread benjamin roth

There is no reason to be angry. This is progress. This is the circle of
live.

It happens anywhere at any time.

Am 12.03.2017 07:34 schrieb "Dor Laor" :

> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>
>>
>>
>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>> > Cassanda vs Scylla is a valid comparison because they both are
>> compatible. Scylla is a drop-in replacement for Cassandra.
>>
>> No, they aren't, and no, it isn't
>>
>
> Jeff is angry with us for some reason. I don't know why, it's natural that
> when
> a new opponent there are objections and the proof lies on us.
> We go through great deal of doing it and we don't just throw comments
> without backing.
>
> Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
> 2.1.8). In 1.7 release we support cql uploader
> from 3.x. We will support the SStable format of 3.x natively in 3 month
> time. Soon all of the feature set will be implemented. We always have been
> using this page (not 100% up to date, we'll update it this week):
> http://www.scylladb.com/technology/status/
>
> We add a jmx-proxy daemon in java in order to make the transition as
> smooth as possible. Almost all the nodetool commands just work, for sure
> all the important ones.
> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
> jmx one.
>
> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
> users and we don't intend
> to decommission an api).
>
> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
> to fix it.
> Let's ignore them and just here what our users have to say:
> http://www.scylladb.com/users/
>
>
>

Re: scylladb

2017-03-11 Thread benjamin roth

Why?

Am 12.03.2017 07:02 schrieb "Jeff Jirsa" :

>
>
> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > Cassanda vs Scylla is a valid comparison because they both are
> compatible. Scylla is a drop-in replacement for Cassandra.
>
> No, they aren't, and no, it isn't
>
>
>
>
>

Re: scylladb

2017-03-11 Thread benjamin roth

Thanks a lot for your detailed explanation!
I am very curious about the future development of Scylladb! Especially
about mvs and lwt!

Am 11.03.2017 02:05 schrieb "Dor Laor" :

> On Fri, Mar 10, 2017 at 4:45 PM, Kant Kodali  wrote:
>
>> http://performanceterracotta.blogspot.com/2012/09/numa-java.html
>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/per
>> formance-enhancements-7.html
>> http://openjdk.java.net/jeps/163
>>
>>
> Java can exploit NUMA but it's not as a efficient as can be done in c++.
> Andrea Arcangeli is the engineer behind Linux transparent huge pages(THP),
> he
> reported to me and the idea belongs to Avi. We did it for KVM's sake but
> it was designed to any long running process like Cassandra.
> However, the entire software stack should be aware. If you get a huge page
> (2MB)
> but keep in it only 1KB you waste lots of mem. On top of this, threads
> need to
> touch their data structures and they need to be well aligned, otherwise
> the memory
> page will bounce between the different cores.
> With Cassandra it gets more complicated since there is a heap and off-heap
> data.
>
> Do programmers really track their data alignment? I doubt it.
> Do users run C* with the JVM numa options and the right Linux THP options?
> Again, I doubt.
>
> Scylla on the other side is designed for NUMA. We have 2-level sharding.
> The inner shards are transparent
> to the user and are per-core (hyper thread). Such a shard access RAM only
> within its numa node. Memory
> is bonded to each thread/numa node. We have our own malloc allocator built
> for this scheme.
>
>
>
>> If scyllaDB has efficient Secondary indexes, LWT and MV's then that is
>> something. I would be glad to see how they perform.
>>
>>
> MV will be in 1.8, we haven't measured performance yet. We did measure our
> counter implementation
> and it looks promising (4X better throughput and 4X better latency on a
> 8-core machine).
> The not-written yet LWT will kick-a** since our fully async engine is
> ideal for the larger number
> of round trips the LWT needs.
>
> This is with the Linux tcp stack, once we'll use our dpdk one, performance
> will improve further ;)
>
>
>
>>
>> On Fri, Mar 10, 2017 at 10:45 AM, Dor Laor  wrote:
>>
>>> Scylla isn't just about performance too.
>>>
>>> First, a disclaimer, I am a Scylla co-founder. I respect open source a
>>> lot,
>>> so you guys are welcome to shush me out of this thread. I only
>>> participate
>>> to provide value if I can (this is a thread about Scylla and our users
>>> are
>>> on our mailing list).
>>>
>>> Scylla is all about what Cassandra is plus:
>>>  - Efficient hardware utilization (scale-up, performance)
>>>  - Low tail latency
>>>  - Auto/dynamic tuning (no JVM tuning, we tune the OS ourselves, we have
>>> cpu scheduler,
>>>I/O userspace scheduler and more to come).
>>>  - SLA between compaction, repair, streaming and your r/w operations
>>>
>>> We started with a great foundation (C*) and wish to improve almost any
>>> aspect of it.
>>> Admittedly, we're way behind C* in terms of adoption. One need to start
>>> somewhere.
>>> However, users such as AppNexus run Scylla in production with 47
>>> physical nodes
>>> across 5 datacenters and their VP estimate that C* would have at least
>>> doubled the
>>> size. So this is equal for a 100-node C* cluster. Since we have the same
>>> gossip, murmur3 hash,
>>> CQL, nothing stops us to scale to 1,000 nodes. Another user (Mogujie)
>>> run 10s of TBs per node(!)
>>> in production.
>>>
>>> Also, since we try to compare Scylla and C* in a fair way, we invested a
>>> great deal of time
>>> to run C*. I can say it's not simple at all.
>>> Lastly, in a couple of months we'll reach parity in functionality with
>>> C* (counters are in 1.7 as experimental, in 1.8 counters will be stable and
>>> we'll have MV as experimental, LWT will be
>>> in the summer). We hope to collaborate with the C* community with the
>>> development of future
>>> features.
>>>
>>> Dor
>>>
>>>
>>> On Fri, Mar 10, 2017 at 10:19 AM, Jacques-Henri Berthemet <
>>> jacques-henri.berthe...@genesys.com> wrote:
>>>
 Cassandra is not about pure performance, there are many other DBs that
 are much faster than Cassandra. Cassandra strength is all about
 scalability, performance increases in a linear way as you add more nodes.
 During Cassandra summit 2014 Apple said they have a 10k node cluster. The
 usual limiting factor is your disk write speed and latency, I don’t see how
 C++ changes anything in this regard unless you can cache all your data in
 memory.



 I’d be curious to know how ScyllaDB performs with a 100+ nodes cluster
 with PBs of data compared to Cassandra.

 *--*

 *Jacques-Henri Berthemet*



 *From:* Rakesh Kumar [mailto:rakeshkumar...@outlook.com]
 *Sent:* vendredi 10 mars 2017 09:58

 *To:*

Re: Can I do point in time recover using nodetool

2017-03-08 Thread benjamin roth

I remember a very similar question on the list some months ago.
The short answer is that there is no short answer. I'd recommend you search
the mailing list archive for "backup" or "recover".

2017-03-08 10:17 GMT+01:00 Bhardwaj, Rahul :

> Hi All,
>
>
>
> Is there any possibility of restoring cassandra snapshots to point in time
> without using opscenter ?
>
>
>
>
>
>
>
>
>
> *Thanks and Regards*
>
> *Rahul Bhardwaj*
>
>
>

Re: Limit on number of keyspaces/tables

2017-03-05 Thread benjamin roth

Why do you think 1 table consumes 1m??

Am 05.03.2017 20:36 schrieb "Vladimir Yudovin" :

> Hi,
>
> there is no such hard limit, but each table consume at least 1M memory, so
> 1000 tables takes at least 1G.
>
> Best regards, Vladimir Yudovin,
> *Winguzone  - Cloud Cassandra Hosting*
>
>
>  On Sun, 05 Mar 2017 05:57:48 -0500 *Lata Kannan
> >* wrote 
>
> Hi
>
> I just wanted to check if there is any known limit to the number of
> keyspaces one can create in a Cassandra cluster? Alternatively is there
> a max on the number of tables that can be created in a cluster?
>
>
> --
> Thanks
> --lata
>
>
>

Re: Limit on number of keyspaces/tables

2017-03-05 Thread benjamin roth

No seriously.

Am 05.03.2017 2:54 nachm. schrieb "Rakesh Kumar" :

> > I ask back: what's your intention
>
> May be documenting the limitations of Cassandra to show Oracle is better
> :-)
>
> Am 05.03.2017 11:58 schrieb "Lata Kannan" >:
> ^
>
>

Re: Limit on number of keyspaces/tables

2017-03-05 Thread benjamin roth

I ask back: what's your intention

Am 05.03.2017 11:58 schrieb "Lata Kannan" :

> Hi
>
> I just wanted to check if there is any known limit to the number of
> keyspaces one can create in a Cassandra cluster? Alternatively is there a
> max on the number of tables that can be created in a cluster?
>
>
> --
> Thanks
> --lata
>
>

Rebuild / removenode with MV is inconsistent

2017-03-01 Thread benjamin roth

Hi there,

Today I come up with the following thesis:

A rebuild / removenode may break the base-table <> MV contract.
I'd even claim that a rebuild / removenode requires rebuilding all MVs to
guarantee MV consistency.

Reason:
A node can have base tables with MVs. This is no problem. If these are
streamed during rebuild/removenode, underlying MVs are updated by write
path and consistency contract will be fulfilled.
BUT a node may also contain ranges for MVs whose base table reside on a
different node. When these are streamed from a another node, then for
example base table on node A suddenly has the replica from the base table
of node B and this is not consistent any more.

Re: Non-zero nodes are marked as down after restarting cassandra process

2017-03-01 Thread benjamin roth

You should always drain nodes before stopping the daemon whenever possible.
This avoids commitlog replay on startup. This can take a while. But
according to your description commit log replay seems not to be the cause.

I once had a similar effect. Some nodes appeared down for some other nodes
and up for others. At that time the cluster had overall stability problems
due to some bugs. After those bugs have gone, I haven't seen this effect
any more.

If that happens again to you, you could check your logs or "nodetool
tpstats" for dropped messages, watch out for suspicious network-related
logs and the load of your nodes in general.

2017-03-01 17:36 GMT+01:00 Ben Dalling :

> Hi Andrew,
>
> We were having problems with gossip TCP connections being held open and
> changed our SOP for stopping cassandra to being:
>
> nodetool disablegossip
> nodetool drain
> service cassandra stop
>
> This seemed to close down the gossip cleanly (the nodetool drain is
> advised as well) and meant that the node rejoined the cluster fine after
> issuing "service cassandra start".
>
> *Ben*
>
> On 1 March 2017 at 16:29, Andrew Jorgensen 
> wrote:
>
>> Helllo,
>>
>> I have a cassandra cluster running on cassandra 3.0.3 and am seeing some
>> strange behavior that I cannot explain when restarting cassandra nodes. The
>> cluster is currently setup in a single datacenter and consists of 55 nodes.
>> I am currently in the process of restarting nodes in the cluster but have
>> noticed that after restarting the cassandra process with `service cassandra
>> start; service cassandra stop` when the node comes back and I run `nodetool
>> status` there is usually a non-zero number of nodes in the rest of the
>> cluster that are marked as DN. If I got to another node in the cluster,
>> from its perspective all nodes included the restarted one are marked as UN.
>> It seems to take ~15 to 20 minutes before the restarted node is updated to
>> show all nodes as UN. During the 15 minutes writes and reads . to the
>> cluster appear to be degraded and do not recover unless I stop the
>> cassandra process again or wait for all nodes to be marked as UN. The
>> cluster also has 3 seed nodes which during this process are up and
>> available the whole time.
>>
>> I have also tried doing `gossipinfo` on the restarted node and according
>> to the output all nodes have a status of NORMAL. Has anyone seen this
>> before and is there anything I can do to fix/reduce the impact of running a
>> restart on a cassandra node?
>>
>> Thanks,
>> Andrew Jorgensen
>> @ajorgensen
>>
>
>

Re: Resources for fire drills

2017-03-01 Thread benjamin roth

@Doc:
http://cassandra.apache.org/doc/latest/ is built from the git repo. So you
can add documentation in doc/source and submit a patch.
I personally think that is not the very best place or way to build a
knowledge DB but thats what we have.


2017-03-01 13:39 GMT+01:00 Malte Pickhan :

> Hi,
>
> really cool that this discussion gets attention.
>
> You are right my question was quite open.
>
> For me it would already be helpful to compile a list like Ben started with
> scenarios that can happen to a cluster
> and what actions/strategies you have to take to resolve the incident
> without loosing data and having a healthy cluster.
>
> Ideally we would add some kind of rating of hard the scenario is to be
> resolved so that teams can go through a kind of learning curve.
>
> For the beginning I think it would already be sufficient to document the
> steps how you can get a cluster into the situation which has been described
> in the scenario.
>
> Hope it’s a bit clearer now what I mean.
>
> Is there some kind of community space where we could start a document for
> this purpose?
>
> Best,
>
> Malte
>
> > On 1 Mar 2017, at 13:33, Stefan Podkowinski  wrote:
> >
> > I've been thinking about this for a while, but haven't found a practical
> > solution yet, although the term "fire drill" leaves a lot of room for
> > interpretation. The most basic requirements I'd have for these kind of
> > trainings would start with automated cluster provisioning for each
> > scenario (either for teams or individuals) and provisioning of test data
> > for the cluster, with optionally some kind of load generator constantly
> > running in the background. I started to work on some Ansible scripts
> > that would do that on AWS a couple of months ago, but it turned out to
> > be a lot of work with all the details you have to take care of. So I'd
> > be happy to hear about any existing resources on that as well!
> >
> >
> > On 01.03.2017 10:59, Malte Pickhan wrote:
> >> Hi Cassandra users,
> >>
> >> I am looking for some resources/guides for firedrill scenarios with
> apache cassandra.
> >>
> >> Do you know anything like that?
> >>
> >> Best,
> >>
> >> Malte
> >>
>
>

Re: Resources for fire drills

2017-03-01 Thread benjamin roth

But if you want to do fire-drills you only have to break things on purpose.

Examples:
- Cut off a commitlog file at a random position and restart CS
- Overwrite some bytes in an SSTables and read all data from it
- Delete some files in /var/lib/cassandra and try to restore them from
backups or different server
- Shut down more server than your RF settings allow
- Do all this while your system is under load

Btw.: Restoring from backups is NOT trivial and can lead to unwanted
resurrected data.

2017-03-01 11:06 GMT+01:00 Malte Pickhan <malte.pick...@zalando.de>:

> Yeah thats the point.
>
> What I mean are some overview for basic scenarios for firedrills, so that
> you can exercise them with your team.
>
> Best
>
>
> On 1 Mar 2017, at 11:01, benjamin roth <brs...@gmail.com> wrote:
>
> Could you specify it a little bit? There are really a lot of things that
> can go wrong.
>
> 2017-03-01 10:59 GMT+01:00 Malte Pickhan <malte.pick...@zalando.de>:
>
>> Hi Cassandra users,
>>
>> I am looking for some resources/guides for firedrill scenarios with
>> apache cassandra.
>>
>> Do you know anything like that?
>>
>> Best,
>>
>> Malte
>
>
>
>

Re: Resources for fire drills

2017-03-01 Thread benjamin roth

As far as I know there is no such resource, at least not officially. IMHO
things like this can be improved a lot within the CS community.

I just proposed on the dev-list to move the official docs out of the repo
into an easier to maintain place like a Wiki or sth.
This could help the community to tackle these issues in a better way -
faster (editing + deployment), easier access, better tools.

2017-03-01 11:06 GMT+01:00 Malte Pickhan <malte.pick...@zalando.de>:

> Yeah thats the point.
>
> What I mean are some overview for basic scenarios for firedrills, so that
> you can exercise them with your team.
>
> Best
>
>
> On 1 Mar 2017, at 11:01, benjamin roth <brs...@gmail.com> wrote:
>
> Could you specify it a little bit? There are really a lot of things that
> can go wrong.
>
> 2017-03-01 10:59 GMT+01:00 Malte Pickhan <malte.pick...@zalando.de>:
>
>> Hi Cassandra users,
>>
>> I am looking for some resources/guides for firedrill scenarios with
>> apache cassandra.
>>
>> Do you know anything like that?
>>
>> Best,
>>
>> Malte
>
>
>
>

Re: Resources for fire drills

2017-03-01 Thread benjamin roth

Could you specify it a little bit? There are really a lot of things that
can go wrong.

2017-03-01 10:59 GMT+01:00 Malte Pickhan :

> Hi Cassandra users,
>
> I am looking for some resources/guides for firedrill scenarios with apache
> cassandra.
>
> Do you know anything like that?
>
> Best,
>
> Malte

Re: Is periodic manual repair necessary?

2017-02-28 Thread benjamin roth

Hi Jayesh,

Your statements are mostly right, except:
Yes, compactions do purge tombstones but that *does not avoid resurrection*.
A resurrection takes place in this situation:

Node A:
Key A is written
Key A is deleted

Node B:
Key A is written
- Deletion never happens for example because of a dropped mutation-

Then after gc_grace_seconds:
Node A:
Compaction removes both write and tombstone, so data is completely gone

Node B:
Still contains Key A

Then you do a repair
Node A:
Receives Key A from Node B

Got it?

But I was thinking a bit about your situation. If you NEVER do deletes and
have ONLY TTLs, this could change the game. Difference? If you have only
TTLs, the delete information and the write information resides always on
the same node and never exists alone, so the write-delete pair should
always be consistent. As far as i can see there will no be ressurections
then.
BUT: Please don't nail me down on it. *I have neither tested it nor read
the source code to prove it in theory.*

Maybe some other guys have some more thoughts or information on this.

By the way:
CS itself is not fragile. Distributed systems are. It's like the old
saying: Things that can go wrong will go wrong. Network fails, hardware
fails, software fails. You can have timeouts, dropped messages (timeouts
help a cluster/node to survive high pressure situations), a crashed daemon.
Yes things go wrong. All the time. Even on a 1 node system (like MySQL)
ensuring absolute consistency is not so easy and requires many safety nets
like unbuffered IO and battery backed HD controllers which can harm
performance a lot.

You could also create a perfectly consistent distributed system like CS but
it would be slow and not partition tolerant or not highly available.

2017-02-28 16:06 GMT+01:00 Thakrar, Jayesh <jthak...@conversantmedia.com>:

> Thanks - getting a better picture of things.
>
>
>
> So "entropy" is tendency of a C* datastore to be inconsistent due to
> writes/updates not taking place across ALL nodes that carry replica of a
> row (can happen if nodes are down for maintenance)
>
> It can also happen due to node crashes/restarts that can result in loss of
> uncommitted data.
>
> This can result in either stale data or ghost data (column/row
> re-appearing after a delete).
>
> So there are the "anti-entropy" processes in place to help with this
>
> - hinted handoff
>
> - read repair (can happen while performing a consistent read OR also async
> as driven/configured by *_read_repair_chance AFTER consistent read)
>
> - commit logs
>
> - explicit/manual repair via command
>
> - compaction (compaction is indirect mechanism to purge tombstone, thereby
> ensuring that stale data will NOT resurrect)
>
>
>
> So for an application where you have only timeseries data or where data is
> always inserted, I would like to know the need for manual repair?
>
>
>
> I see/hear advice that there should always be a periodic (mostly weekly)
> manual/explicit repair in a C* system - and that's what I am trying to
> understand.
>
> Repair is a real expensive process and would like to justify the need to
> expend resources (when and how much) for it.
>
>
>
> Among other things, this advice also gives an impression to people not
> familiar with C* (e.g. me) that it is too fragile and needs substantial
> manual intervention.
>
>
>
> Appreciate all the feedback and details that you have been sharing.
>
>
>
> *From: *Edward Capriolo <edlinuxg...@gmail.com>
> *Date: *Monday, February 27, 2017 at 8:00 PM
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Cc: *Benjamin Roth <benjamin.r...@jaumo.com>
> *Subject: *Re: Is periodic manual repair necessary?
>
>
>
> There are 4 anti entropy systems in cassandra.
>
>
>
> Hinted handoff
>
> Read repair
>
> Commit logs
>
> Repair commamd
>
>
>
> All are basically best effort.
>
>
>
> Commit logs get corrupt and only flush periodically.
>
>
>
> Bits rot on disk and while crossing networks network
>
>
>
> Read repair is async and only happens randomly
>
>
>
> Hinted handoff stops after some time and is not guarenteed.
> On Monday, February 27, 2017, Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> Thanks Roth and Oskar for your quick responses.
>
>
>
> This is a single datacenter, multi-rack setup.
>
>
>
> > A TTL is technically similar to a delete - in the end both create
> tombstones.
>
> >If you want to eliminate the possibility of resurrected deleted data, you
> should run repairs.
>
> So why do I need to worry about data resurrection?
>
> Because, the TTL for the data is specified at the row

unsubscribe

2017-02-28 Thread Benjamin Roth

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Is periodic manual repair necessary?

2017-02-27 Thread Benjamin Roth

A TTL is technically similar to a delete - in the end both create
tombstones.
If you want to eliminate the possibility of resurrected deleted data, you
should run repairs.

If you can guarantuee a 100% that data is read-repaired before
gc_grace_seconds after the data has been TTL'ed, you won't need an extra
repair.

2017-02-27 18:29 GMT+01:00 Oskar Kjellin <oskar.kjel...@gmail.com>:

> Are you running multi dc?
>
> Skickat från min iPad
>
> 27 feb. 2017 kl. 16:08 skrev Thakrar, Jayesh <jthak...@conversantmedia.com
> >:
>
> Suppose I have an application, where there are no deletes, only 5-10% of
> rows being occasionally updated (and that too only once) and a lot of reads.
>
>
>
> Furthermore, I have replication = 3 and both read and write are configured
> for local_quorum.
>
>
>
> Occasionally, servers do go into maintenance.
>
>
>
> I understand when the maintenance is longer than the period for
> hinted_handoffs to be preserved, they are lost and servers may have stale
> data.
>
> But I do expect it to be rectified on reads. If the stale data is not read
> again, I don’t care for it to be corrected as then the data will be
> automatically purged because of TTL.
>
>
>
> In such a situation, do I need to have a periodic (weekly?) manual/batch
> read_repair process?
>
>
>
> Thanks,
>
> Jayesh Thakrar
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Which compaction strategy when modeling a dumb set

2017-02-27 Thread Benjamin Roth

This is not a queue pattern and I'd recommend LCS for better read
performance.

2017-02-27 16:06 GMT+01:00 Rakesh Kumar <rakeshkumar...@outlook.com>:

> Do you update this table when an event is processed?  If yes, it is
> considered a good practice for Cassandra.  I read somewhere that using
> Cassandra as a queuing table is anti pattern.
> 
> From: Vincent Rischmann <m...@vrischmann.me>
> Sent: Friday, February 24, 2017 06:24
> To: user@cassandra.apache.org
> Subject: Which compaction strategy when modeling a dumb set
>
> Hello,
>
> I'm using a table like this:
>
>CREATE TABLE myset (id uuid PRIMARY KEY)
>
> which is basically a set I use for deduplication, id is a unique id for an
> event, when I process the event I insert the id, and before processing I
> check if it has already been processed for deduplication.
>
> It works well enough, but I'm wondering which compaction strategy I should
> use. I expect maybe 1% or less of events will end up duplicated (thus not
> generating an insert), so the workload will probably be 50% writes 50% read.
>
> Is LCS a good strategy here or should I stick with STCS ?
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Understanding of proliferation of sstables during a repair

2017-02-26 Thread Benjamin Roth

Too many open files. Which is 100k by default and we had >40k sstables.
Normally the are around 500-1000.

Am 27.02.2017 02:40 schrieb "Seth Edwards" <s...@pubnub.com>:

> This makes a lot more sense. What does TMOF stand for?
>
> On Sun, Feb 26, 2017 at 1:01 PM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> Hi Seth,
>>
>> Repairs can create a lot of tiny SSTables. I also encountered the
>> creation of so many sstables that the node died because of TMOF. At that
>> time the affected nodes were REALLY inconsistent.
>>
>> One reason can be immense inconsistencies spread over many
>> partition(-ranges) with a lot of subrange repairs that trigger a lot of
>> independant streams. Each stream results in a single SSTable that can be
>> very small. No matter how small it is, it has to be compacted and can cause
>> a compaction impact that is a lot bigger than expected from a tiny little
>> table.
>>
>> Also consider that there is a theoretical race condition that can cause
>> repairs even though data is not inconsistent due to "flighing in mutations"
>> during merkle tree calculation.
>>
>> 2017-02-26 20:41 GMT+01:00 Seth Edwards <s...@pubnub.com>:
>>
>>> Hello,
>>>
>>> We just ran a repair on a keyspace using TWCS and a mixture of TTLs
>>> .This caused a large proliferation of sstables and compactions. There is
>>> likely a lot of entropy in this keyspace. I am trying to better understand
>>> why this is.
>>>
>>> I've also read that you may not want to run repairs on short TTL data
>>> and rely upon other anti-entropy mechanisms to achieve consistency instead.
>>> Is this generally true?
>>>
>>>
>>> Thanks!
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>

Re: Understanding of proliferation of sstables during a repair

2017-02-26 Thread Benjamin Roth

Hi Seth,

Repairs can create a lot of tiny SSTables. I also encountered the creation
of so many sstables that the node died because of TMOF. At that time the
affected nodes were REALLY inconsistent.

One reason can be immense inconsistencies spread over many
partition(-ranges) with a lot of subrange repairs that trigger a lot of
independant streams. Each stream results in a single SSTable that can be
very small. No matter how small it is, it has to be compacted and can cause
a compaction impact that is a lot bigger than expected from a tiny little
table.

Also consider that there is a theoretical race condition that can cause
repairs even though data is not inconsistent due to "flighing in mutations"
during merkle tree calculation.

2017-02-26 20:41 GMT+01:00 Seth Edwards <s...@pubnub.com>:

> Hello,
>
> We just ran a repair on a keyspace using TWCS and a mixture of TTLs .This
> caused a large proliferation of sstables and compactions. There is likely a
> lot of entropy in this keyspace. I am trying to better understand why this
> is.
>
> I've also read that you may not want to run repairs on short TTL data and
> rely upon other anti-entropy mechanisms to achieve consistency instead. Is
> this generally true?
>
>
> Thanks!
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-24 Thread Benjamin Roth

It was only the schema change.

2017-02-24 19:18 GMT+01:00 kurt greaves <k...@instaclustr.com>:

> How many CFs are we talking about here? Also, did the script also kick off
> the scrubs or was this purely from changing the schemas?
> 
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Does C* coordinator writes to replicas in same order or different order?

2017-02-21 Thread Benjamin Roth

For eventual consistency, it does not matter if it is sync or async. LWW
always works as long as clocks are synchronized.
Thats a design pattern of CS or EC databases in general. Every write has a
timestamp and no matter at what time it arrives, the last write will win
even if a "sooner" write arrives late due to network latency oder a
unavailable server that receives a hint after 1 hour.
Doing replication sync will kill all the benefits you have from CS's design:
- low latency
- partition tolerance
- high availability

Doing sync replication would also not guarantee a state as another client
could "interfer" with your write. So you still have no "linearizability".
Only LWT does this.
You cannot rely on orders in CS. No matter how replication works. You only
can rely "eventually" on it but there is never a point in time you can tell
100% your system is completely consistent.

Maybe what you could do if you are talking of "orders" and that pointer
thing you mentioned earlier: Try sth similar like MVs do.
Create a trigger, operate on your local dataset, read the order based on PK
(locally) and update "the pointer" on every write (also locally). If you
then store your pointer with the last known timestamp of your base data,
you also have a LWW on your pointer so also the last pointer wins when
reading with > CL_ONE.
But that will probably harm your write performance.

2017-02-21 10:36 GMT+01:00 Kant Kodali <k...@peernova.com>:

> @Benjamin I am more looking for how C* replication works underneath. There
> are few things here that I would need some clarification.
>
> 1. Does C* uses sync replication or async replication? If it is async
> replication how can one get performance especially when there is an
> ordering constraint among requests to comply with LWW.  Also below is a
> statement from C* website so how can one choose between sync or async
> replication? any configuration parameter that needs to be passed in?
>
> "Choose between synchronous or asynchronous replication for each update."
>
> http://cassandra.apache.org/
>
> 2. Is it Guaranteed that C* coordinator writes data in the same order to
> all the replicas (either sync or async)?
>
> Thanks,
> kant
>
> On Tue, Feb 21, 2017 at 1:23 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> To me that sounds like a completely different design pattern and a
>> different use case.
>> CS was not designed to guarantee order. It was build to be linear
>> scalable, highly concurrent and eventual consistent.
>> To me it sounds like a ACID DB better serves what you are asking for.
>>
>> 2017-02-21 10:17 GMT+01:00 Kant Kodali <k...@peernova.com>:
>>
>>> Agreed that async performs better than sync in general but the catch
>>> here to me is the "order".
>>>
>>> The whole point of async is to do out of order processing by which I
>>> mean say if a request 1 comes in at time t1 and a request 2 comes in at
>>> time t2 where t1 < t2 and say now that t1 is taking longer to process than
>>> t2 in which case request 2 should get a response first and subsequently a
>>> response for request 1. This is where I would imagine all the benefits of
>>> async come in but the moment you introduce order by saying for Last Write
>>> Wins all the async requests should be processed in order I would imagine
>>> all the benefits of async are lost.
>>>
>>> Let's see if anyone can comment about how it works inside C*.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Mon, Feb 20, 2017 at 10:54 PM, Dor Laor <d...@scylladb.com> wrote:
>>>
>>>> Could be. Let's stay tuned to see if someone else pick it up.
>>>> Anyway, if it's synchronous, you'll have a large penalty for latency.
>>>>
>>>> On Mon, Feb 20, 2017 at 10:11 PM, Kant Kodali <k...@peernova.com>
>>>> wrote:
>>>>
>>>>> Thanks again for the response! if they mean it between client and
>>>>> server I am not sure why they would use the word "replication" in the
>>>>> statement below since there is no replication between client and server(
>>>>> coordinator).
>>>>>
>>>>> "Choose between synchronous or asynchronous replication for each
>>>>>> update."
>>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Feb 20, 2017, at 5:30 PM, Dor Laor <d...@scylladb.com> wrote:
>>>>>
>>>>> I think they mean the client to server and not among the

Re: Does C* coordinator writes to replicas in same order or different order?

2017-02-21 Thread Benjamin Roth

;> timestamp right?  What I am really looking for is that if I send write
>>>>> request concurrently for record 1 and record 2 are they guaranteed to be
>>>>> inserted in the same order across replicas? (Whatever order coordinator 
>>>>> may
>>>>> choose is fine but I want the same order across all replicas and with 
>>>>> async
>>>>> replication I am not sure how that is possible ? for example,  if a 
>>>>> request
>>>>> arrives with timestamp t1 and another request arrives with a timestamp t2
>>>>> where t1 < t2...with async replication what if one replica chooses to
>>>>> execute t2 first and then t1 simply because t1 is slow while another
>>>>> replica choose to execute t1 first and then t2..how would that work?  )*
>>>>>
>>>>>>
>>>>>> Note that C* each node can be a coordinator (one per request) and its
>>>>>> the desired case in order to load balance the incoming requests. Once
>>>>>> again,
>>>>>> timestamps determine the order among the requests.
>>>>>>
>>>>>> Cheers,
>>>>>> Dor
>>>>>>
>>>>>> On Mon, Feb 20, 2017 at 4:12 PM, Kant Kodali <k...@peernova.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> when C* coordinator writes to replicas does it write it in same
>>>>>>> order or
>>>>>>> different order? other words, Does the replication happen
>>>>>>> synchronously or
>>>>>>> asynchrnoulsy ? Also does this depend sync or async client? What
>>>>>>> happens in
>>>>>>> the case of concurrent writes to a coordinator ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> kant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-20 Thread Benjamin Roth

Hah! Found the problem!

After setting read_ahead to 0 and compression chunk size to 4kb on all CFs,
the situation was PERFECT (nearly, please see below)! I scrubbed some CFs
but not the whole dataset, yet. I knew it was not too few RAM.

Some stats:
- Latency of a quite large CF: https://cl.ly/1r3e0W0S393L
- Disk throughput: https://cl.ly/2a0Z250S1M3c
- Dstat: https://gist.github.com/brstgt/c92bbd46ab76283e534b853b88ad3b26
- This shows, that the request distribution remained the same, so no
dyn-snitch magic: https://cl.ly/3E0t1T1z2c0J

Btw. I stumbled across this one:
https://groups.google.com/forum/#!topic/scylladb-dev/j_qXSP-6-gY
Maybe we should also think about lowering default chunk length.

*Unfortunately schema changes had a disturbing effect:*
- I changed the chunk size with a script, so there were a lot of schema
changes in a small period.
- After all tables were changed, one of the seed hosts (cas1) went TOTALLY
crazy.
- Latency on this host was 10x of all other hosts.
- There were more ParNew GCs.
- Load was very high (up to 80, 100% CPU)
- Whole system was unstable due to unpredictable latencies and
backpressures (https://cl.ly/1m022g2W1Q3d)
- Even SELECT * FROM system_schema.table etc appeared as slow query in the
logs
- It was the 1st server in the connect host list for the PHP client
- CS restart didn't help. Reboot did not help (cold page cache made it
probably worse).
- All other nodes were totally ok.
- Stopping CS on cas1 helped to keep the system stable. Brought down
latency again, but was no solution.

=> Only replacing the node (with a newer, faster node) in the connect-host
list helped that situation.

Any ideas why changing schemas and/or chunk size could have such an effect?
For some time the situation was really critical.


2017-02-20 10:48 GMT+01:00 Bhuvan Rawal :

> Hi Benjamin,
>
> Yes, Read ahead of 8 would imply more IO count from disk but it should not
> cause more data read off the disk as is happening in your case.
>
> One probable reason for high disk io would be because the 512 vnode has
> less page to RAM ratio of 22% (100G buff /437G data) as compared to 46%
> (100G/237G). And as your avg record size is in bytes for every disk io you
> are fetching complete 64K block to get a row.
>
> Perhaps you can balance the node by adding equivalent RAM ?
>
> Regards,
> Bhuvan
>

Re: Count(*) is not working

2017-02-20 Thread Benjamin Roth

+1 I also encountered timeouts many many times (using DS DevCenter).
Roughly this occured when count(*) > 1.000.000

2017-02-20 14:42 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:

> Seems worth it to file a bug since some here are under the impression it
> almost always works and others are under the impression it almost never
> works.
>
> On Friday, February 17, 2017, kurt greaves <k...@instaclustr.com> wrote:
>
>> really... well that's good to know. it still almost never works though. i
>> guess every time I've seen it it must have timed out due to tombstones.
>>
>> On 17 Feb. 2017 22:06, "Sylvain Lebresne" <sylv...@datastax.com> wrote:
>>
>> On Fri, Feb 17, 2017 at 11:54 AM, kurt greaves <k...@instaclustr.com>
>> wrote:
>>
>>> if you want a reliable count, you should use spark. performing a count
>>> (*) will inevitably fail unless you make your server read timeouts and
>>> tombstone fail thresholds ridiculous
>>>
>>
>> That's just not true. count(*) is paged internally so while it is not
>> particular fast, it shouldn't require bumping neither the read timeout nor
>> the tombstone fail threshold in any way to work.
>>
>> In that case, it seems the partition does have many tombstones (more than
>> live rows) and so the tombstone threshold is doing its job of warning about
>> it.
>>
>>
>>>
>>> On 17 Feb. 2017 04:34, "Jan" <j...@dafuer.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> could you post the output of nodetool cfstats for the table?
>>>>
>>>> Cheers,
>>>>
>>>> Jan
>>>>
>>>> Am 16.02.2017 um 17:00 schrieb Selvam Raman:
>>>>
>>>> I am not getting count as result. Where i keep on getting n number of
>>>> results below.
>>>>
>>>> Read 100 live rows and 1423 tombstone cells for query SELECT * FROM
>>>> keysace.table WHERE token(id) > token(test:ODP0144-0883E-022R-002/047-052)
>>>> LIMIT 100 (see tombstone_warn_threshold)
>>>>
>>>> On Thu, Feb 16, 2017 at 12:37 PM, Jan Kesten <j...@dafuer.de> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> do you got a result finally?
>>>>>
>>>>> Those messages are simply warnings telling you that c* had to read
>>>>> many tombstones while processing your query - rows that are deleted but 
>>>>> not
>>>>> garbage collected/compacted. This warning gives you some explanation why
>>>>> things might be much slower than expected because per 100 rows that count
>>>>> c* had to read about 15 times rows that were deleted already.
>>>>>
>>>>> Apart from that, count(*) is almost always slow - and there is a
>>>>> default limit of 10.000 rows in a result.
>>>>>
>>>>> Do you really need the actual live count? To get a idea you can always
>>>>> look at nodetool cfstats (but those numbers also contain deleted rows).
>>>>>
>>>>>
>>>>> Am 16.02.2017 um 13:18 schrieb Selvam Raman:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I want to know the total records count in table.
>>>>>
>>>>> I fired the below query:
>>>>>select count(*) from tablename;
>>>>>
>>>>> and i have got the below output
>>>>>
>>>>> Read 100 live rows and 1423 tombstone cells for query SELECT * FROM
>>>>> keysace.table WHERE token(id) > token(test:ODP0144-0883E-022R-002/047-052)
>>>>> LIMIT 100 (see tombstone_warn_threshold)
>>>>>
>>>>> Read 100 live rows and 1435 tombstone cells for query SELECT * FROM
>>>>> keysace.table WHERE token(id) > token(test:2565-AMK-2) LIMIT 100 (see
>>>>> tombstone_warn_threshold)
>>>>>
>>>>> Read 96 live rows and 1385 tombstone cells for query SELECT * FROM
>>>>> keysace.table WHERE token(id) > token(test:-2220-UV033/04) LIMIT 100 (see
>>>>> tombstone_warn_threshold).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Can you please help me to get the total count of the table.
>>>>>
>>>>> --
>>>>> Selvam Raman
>>>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Selvam Raman
>>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>>
>>>>
>>>>
>>
>>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Cassandra blob vs base64 text

2017-02-20 Thread Benjamin Roth

You could save space when storing your data (base64-)decoded as blobs.

2017-02-20 13:38 GMT+01:00 Oskar Kjellin :

> We currently have some cases where we store base64 as a text field instead
> of a blob (running version 2.0.17).
> I would like to move these to blob but wondering what benefits and
> optimizations there are? The possible ones I can think of is (but there's
> probably more):
>
> * blob is stored as off heap ByteBuffers?
> * blob won't be decompressed server side?
>
> Are there any other reasons to switch to blobs? Or are we not going to see
> any difference?
>
> Thanks!
>

Re: High disk io read load

2017-02-19 Thread Benjamin Roth

This is the output of sar:
https://gist.github.com/anonymous/9545fb69fbb28a20dc99b2ea5e14f4cd
<https://www.google.com/url?q=https%3A%2F%2Fgist.github.com%2Fanonymous%2F9545fb69fbb28a20dc99b2ea5e14f4cd=D=1=AFQjCNH6r_GCSN0ZxmDx1f8xGRJPweV-EQ>

It seems to me that there es not enough page cache to handle all data in a
reasonable way.
As pointed out yesterday, the read rate with empty page cache is ~800MB/s.
Thats really (!!!) much for 4-5MB/s network output.

I stumbled across the compression chunk size, which I always left untouched
from the default of 64kb (https://cl.ly/2w0V3U1q1I1Y). I guess setting a
read ahead of 8kb is totally pointless if CS reads 64kb if it only has to
fetch a single row, right? Are there recommendations for that setting?

2017-02-19 19:15 GMT+01:00 Bhuvan Rawal <bhu1ra...@gmail.com>:

> Hi Edward,
>
> This could have been a valid case here but if hotspots indeed existed then
> along with really high disk io , the node should have been doing
> proportionate high network io as well. -  higher queries per second as well.
>
> But from the output shared by Benjamin that doesnt appear to be the case
> and things look balanced.
>
> Regards,
>
> On Sun, Feb 19, 2017 at 7:47 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>>
>>
>> On Sat, Feb 18, 2017 at 3:35 PM, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>>> We are talking about a read IO increase of over 2000% with 512 tokens
>>> compared to 256 tokens. 100% increase would be linear which would be
>>> perfect. 200% would even okay, taking the RAM/Load ratio for caching into
>>> account. But > 20x the read IO is really incredible.
>>> The nodes are configured with puppet, they share the same roles and no
>>> manual "optimizations" are applied. So I can't imagine, a different
>>> configuration is responsible for it.
>>>
>>> 2017-02-18 21:28 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>>>
>>>> This is status of the largest KS of these both nodes:
>>>> UN  10.23.71.10  437.91 GiB  512  49.1%
>>>> 2679c3fa-347e-4845-bfc1-c4d0bc906576  RAC1
>>>> UN  10.23.71.9   246.99 GiB  256  28.3%
>>>> 2804ef8a-26c8-4d21-9e12-01e8b6644c2f  RAC1
>>>>
>>>> So roughly as expected.
>>>>
>>>> 2017-02-17 23:07 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>>>>
>>>>> what's the Owns % for the relevant keyspace from nodetool status?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Benjamin Roth
>>>> Prokurist
>>>>
>>>> Jaumo GmbH · www.jaumo.com
>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>>>> <07161%203048801>
>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>
>>>
>>>
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>> <+49%207161%203048801>
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>> When I read articles like this:
>>
>> http://www.doanduyhai.com/blog/?p=1930
>>
>> And see the word hot-spot.
>>
>> "Another performance consideration worth mentioning is hot-spot. Similar
>> to manual denormalization, if your view partition key is chosen poorly,
>> you’ll end up with hot spots in your cluster. A simple example with our
>> *user* table is to create a materialized
>>
>> *view user_by_gender"It leads me to ask a question back: What can you say
>> about hotspots in your data? Even if your nodes had the identical number of
>> tokens this autho seems to suggesting that you still could have hotspots.
>> Maybe the issue is you have a hotspot 2x hotspots, or your application has
>> a hotspot that would be present even with perfect token balancing.*
>>
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-18 Thread Benjamin Roth

Just for the record, that's what dstat looks like while CS is starting:

root@cas10:~# dstat -lrnv 10
---load-avg--- --io/total- -net/total- ---procs--- --memory-usage-
---paging-- -dsk/total- ---system-- total-cpu-usage
 1m   5m  15m | read  writ| recv  send|run blk new| used  buff  cach  free|
 in   out | read  writ| int   csw |usr sys idl wai hiq siq
0.69 0.18 0.06| 228  24.3 |   0 0 |0.0   0  24|17.8G 3204k  458M  108G|
  0 0 |5257k  417k|  17k 3319 |  2   1  97   0   0   0
0.96 0.26 0.09| 591  27.9 | 522k  476k|4.1   0  69|18.3G 3204k  906M  107G|
  0 0 |  45M  287k|  22k 6943 |  7   1  92   0   0   0
13.2 2.83 0.92|2187  28.7 |1311k  839k|5.3  90  18|18.9G 3204k 9008M 98.1G|
  0 0 | 791M 8346k|  49k   25k| 17   1  36  46   0   0
30.6 6.91 2.27|2188  67.0 |4200k 3610k|8.8 106  27|19.5G 3204k 17.9G 88.4G|
  0 0 | 927M 8396k| 116k  119k| 24   2  17  57   0   0
43.6 10.5 3.49|2136  24.3 |4371k 3708k|6.3 108 1.0|19.5G 3204k 26.7G 79.6G|
  0 0 | 893M   13M| 117k  159k| 15   1  17  66   0   0
56.9 14.4 4.84|2152  32.5 |3937k 3767k| 11  83 5.0|19.5G 3204k 35.5G 70.7G|
  0 0 | 894M   14M| 126k  160k| 16   1  16  65   0   0
63.2 17.1 5.83|2135  44.1 |4601k 4185k|6.9  99  35|19.6G 3204k 44.3G 61.9G|
  0 0 | 879M   15M| 133k  168k| 19   2  19  60   0   0
64.6 18.9 6.54|2174  42.2 |4393k 3522k|8.4  93 2.2|20.0G 3204k 52.7G 53.0G|
  0 0 | 897M   14M| 138k  160k| 14   2  15  69   0   0

The IO shoots up (791M) as soon as CS has started up and accepts requests.
I also diffed sysctl of the both machines. No significant differences. Only
CPU-related, random values and some hashes differ.

2017-02-18 21:49 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> 256 tokens:
>
> root@cas9:/sys/block/dm-0# blockdev --report
> RORA   SSZ   BSZ   StartSecSize   Device
> rw   256   512  4096  067108864   /dev/ram0
> rw   256   512  4096  067108864   /dev/ram1
> rw   256   512  4096  067108864   /dev/ram2
> rw   256   512  4096  067108864   /dev/ram3
> rw   256   512  4096  067108864   /dev/ram4
> rw   256   512  4096  067108864   /dev/ram5
> rw   256   512  4096  067108864   /dev/ram6
> rw   256   512  4096  067108864   /dev/ram7
> rw   256   512  4096  067108864   /dev/ram8
> rw   256   512  4096  067108864   /dev/ram9
> rw   256   512  4096  067108864   /dev/ram10
> rw   256   512  4096  067108864   /dev/ram11
> rw   256   512  4096  067108864   /dev/ram12
> rw   256   512  4096  067108864   /dev/ram13
> rw   256   512  4096  067108864   /dev/ram14
> rw   256   512  4096  067108864   /dev/ram15
> rw16   512  4096  0800166076416 <0800%20166076416>
> /dev/sda
> rw16   512  4096   2048800164151296   /dev/sda1
> rw16   512  4096  0644245094400 <06442%2045094400>
> /dev/dm-0
> rw16   512  4096  0  2046820352   /dev/dm-1
> rw16   512  4096  0  1023410176   /dev/dm-2
> rw16   512  4096  0800166076416 <0800%20166076416>
> /dev/sdb
>
> 512 tokens:
> root@cas10:/sys/block# blockdev --report
> RORA   SSZ   BSZ   StartSecSize   Device
> rw   256   512  4096  067108864   /dev/ram0
> rw   256   512  4096  067108864   /dev/ram1
> rw   256   512  4096  067108864   /dev/ram2
> rw   256   512  4096  067108864   /dev/ram3
> rw   256   512  4096  067108864   /dev/ram4
> rw   256   512  4096  067108864   /dev/ram5
> rw   256   512  4096  067108864   /dev/ram6
> rw   256   512  4096  067108864   /dev/ram7
> rw   256   512  4096  067108864   /dev/ram8
> rw   256   512  4096  067108864   /dev/ram9
> rw   256   512  4096  067108864   /dev/ram10
> rw   256   512  4096  067108864   /dev/ram11
> rw   256   512  4096  067108864   /dev/ram12
> rw   256   512  4096  067108864   /dev/ram13
> rw   256   512  4096  067108864   /dev/ram14
> rw   256   512  4096  067108864   /dev/ram15
> rw16   512  4096  0800166076416 <0800%20166076416>
> /dev/sda
> rw16   512  4096   2048800164151296   /dev/sda1
> rw16   512  4096  0800166076416 <0800%20166076416>
> /dev/sdb
> rw16   512  4096   2048800165027840   /dev/sdb1
> rw16   512  4096  0   1073741824000   /dev/dm-0
> rw16   512  4096  0  2046820352   /dev/

Re: High disk io read load

2017-02-18 Thread Benjamin Roth

256 tokens:

root@cas9:/sys/block/dm-0# blockdev --report
RORA   SSZ   BSZ   StartSecSize   Device
rw   256   512  4096  067108864   /dev/ram0
rw   256   512  4096  067108864   /dev/ram1
rw   256   512  4096  067108864   /dev/ram2
rw   256   512  4096  067108864   /dev/ram3
rw   256   512  4096  067108864   /dev/ram4
rw   256   512  4096  067108864   /dev/ram5
rw   256   512  4096  067108864   /dev/ram6
rw   256   512  4096  067108864   /dev/ram7
rw   256   512  4096  067108864   /dev/ram8
rw   256   512  4096  067108864   /dev/ram9
rw   256   512  4096  067108864   /dev/ram10
rw   256   512  4096  067108864   /dev/ram11
rw   256   512  4096  067108864   /dev/ram12
rw   256   512  4096  067108864   /dev/ram13
rw   256   512  4096  067108864   /dev/ram14
rw   256   512  4096  067108864   /dev/ram15
rw16   512  4096  0800166076416   /dev/sda
rw16   512  4096   2048800164151296   /dev/sda1
rw16   512  4096  0644245094400   /dev/dm-0
rw16   512  4096  0  2046820352   /dev/dm-1
rw16   512  4096  0  1023410176   /dev/dm-2
rw16   512  4096  0800166076416   /dev/sdb

512 tokens:
root@cas10:/sys/block# blockdev --report
RORA   SSZ   BSZ   StartSecSize   Device
rw   256   512  4096  067108864   /dev/ram0
rw   256   512  4096  067108864   /dev/ram1
rw   256   512  4096  067108864   /dev/ram2
rw   256   512  4096  067108864   /dev/ram3
rw   256   512  4096  067108864   /dev/ram4
rw   256   512  4096  067108864   /dev/ram5
rw   256   512  4096  067108864   /dev/ram6
rw   256   512  4096  067108864   /dev/ram7
rw   256   512  4096  067108864   /dev/ram8
rw   256   512  4096  067108864   /dev/ram9
rw   256   512  4096  067108864   /dev/ram10
rw   256   512  4096  067108864   /dev/ram11
rw   256   512  4096  067108864   /dev/ram12
rw   256   512  4096  067108864   /dev/ram13
rw   256   512  4096  067108864   /dev/ram14
rw   256   512  4096  067108864   /dev/ram15
rw16   512  4096  0800166076416   /dev/sda
rw16   512  4096   2048800164151296   /dev/sda1
rw16   512  4096  0800166076416   /dev/sdb
rw16   512  4096   2048800165027840   /dev/sdb1
rw16   512  4096  0   1073741824000   /dev/dm-0
rw16   512  4096  0  2046820352   /dev/dm-1
rw16   512  4096  0  1023410176   /dev/dm-2

2017-02-18 21:41 GMT+01:00 Bhuvan Rawal <bhu1ra...@gmail.com>:

> Hi Ben,
>
> If its same on both machines then something else could be the issue. We
> faced high disk io due to misconfigured read ahead which resulted in high
> amount of disk io for comparatively insignificant network transfer.
>
> Can you post output of blockdev --report for a normal node and 512 token
> node.
>
> Regards,
>
> On Sun, Feb 19, 2017 at 2:07 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> cat /sys/block/sda/queue/read_ahead_kb
>> => 8
>>
>> On all CS nodes. Is that what you mean?
>>
>> 2017-02-18 21:32 GMT+01:00 Bhuvan Rawal <bhu1ra...@gmail.com>:
>>
>>> Hi Benjamin,
>>>
>>> What is the disk read ahead on both nodes?
>>>
>>> Regards,
>>> Bhuvan
>>>
>>> On Sun, Feb 19, 2017 at 1:58 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>>> wrote:
>>>
>>>> This is status of the largest KS of these both nodes:
>>>> UN  10.23.71.10  437.91 GiB  512  49.1%
>>>> 2679c3fa-347e-4845-bfc1-c4d0bc906576  RAC1
>>>> UN  10.23.71.9   246.99 GiB  256  28.3%
>>>> 2804ef8a-26c8-4d21-9e12-01e8b6644c2f  RAC1
>>>>
>>>> So roughly as expected.
>>>>
>>>> 2017-02-17 23:07 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>>>>
>>>>> what's the Owns % for the relevant keyspace from nodetool status?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Benjamin Roth
>>>> Prokurist
>>>>
>>>> Jaumo GmbH · www.jaumo.com
>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>>>> <07161%203048801>
>>>> AG Ulm ·

Re: High disk io read load

2017-02-18 Thread Benjamin Roth

cat /sys/block/sda/queue/read_ahead_kb
=> 8

On all CS nodes. Is that what you mean?

2017-02-18 21:32 GMT+01:00 Bhuvan Rawal <bhu1ra...@gmail.com>:

> Hi Benjamin,
>
> What is the disk read ahead on both nodes?
>
> Regards,
> Bhuvan
>
> On Sun, Feb 19, 2017 at 1:58 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> This is status of the largest KS of these both nodes:
>> UN  10.23.71.10  437.91 GiB  512  49.1%
>> 2679c3fa-347e-4845-bfc1-c4d0bc906576  RAC1
>> UN  10.23.71.9   246.99 GiB  256  28.3%
>> 2804ef8a-26c8-4d21-9e12-01e8b6644c2f  RAC1
>>
>> So roughly as expected.
>>
>> 2017-02-17 23:07 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>>
>>> what's the Owns % for the relevant keyspace from nodetool status?
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> <07161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-18 Thread Benjamin Roth

We are talking about a read IO increase of over 2000% with 512 tokens
compared to 256 tokens. 100% increase would be linear which would be
perfect. 200% would even okay, taking the RAM/Load ratio for caching into
account. But > 20x the read IO is really incredible.
The nodes are configured with puppet, they share the same roles and no
manual "optimizations" are applied. So I can't imagine, a different
configuration is responsible for it.

2017-02-18 21:28 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> This is status of the largest KS of these both nodes:
> UN  10.23.71.10  437.91 GiB  512  49.1%
> 2679c3fa-347e-4845-bfc1-c4d0bc906576  RAC1
> UN  10.23.71.9   246.99 GiB  256  28.3%
> 2804ef8a-26c8-4d21-9e12-01e8b6644c2f  RAC1
>
> So roughly as expected.
>
> 2017-02-17 23:07 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>
>> what's the Owns % for the relevant keyspace from nodetool status?
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-18 Thread Benjamin Roth

This is status of the largest KS of these both nodes:
UN  10.23.71.10  437.91 GiB  512  49.1%
2679c3fa-347e-4845-bfc1-c4d0bc906576  RAC1
UN  10.23.71.9   246.99 GiB  256  28.3%
2804ef8a-26c8-4d21-9e12-01e8b6644c2f  RAC1

So roughly as expected.

2017-02-17 23:07 GMT+01:00 kurt greaves <k...@instaclustr.com>:

> what's the Owns % for the relevant keyspace from nodetool status?
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: sasi index question (read timeout on many selects)

2017-02-17 Thread Benjamin Roth

Btw:

They break incremental repair if you use CDC: https://issues.apache.
org/jira/browse/CASSANDRA-12888


Not only when using CDC! You shouldn't use incremental repairs with MVs.
Never (right now).

2017-02-16 17:42 GMT+01:00 Jonathan Haddad <j...@jonhaddad.com>:

> My advice to avoid them is based on the issues that have been filed in
> Jira.  Benjamin Roth is one of the only people talking about his MV usage,
> and has filed a few JIRAs discussing their problems when bootstrapping new
> nodes, as well as issues repairing.
>
> https://issues.apache.org/jira/browse/CASSANDRA-12730?
> jql=project%20%3D%20CASSANDRA%20and%20reporter%20%3D%
> 20brstgt%20and%20text%20~%20%22materialized%22
>
> They also can't be altered: https://issues.apache.org/jira/browse/
> CASSANDRA-9736
>
> They may be less performant than managing the data yourself:
> https://issues.apache.org/jira/browse/CASSANDRA-10295, https://
> issues.apache.org/jira/browse/CASSANDRA-10307
>
> They're not as flexible as your own tables: https://issues.apache.
> org/jira/browse/CASSANDRA-9928, https://issues.apache.org/
> jira/browse/CASSANDRA-11194, https://issues.apache.org/jira/
> browse/CASSANDRA-12463
>
> They break incremental repair if you use CDC: https://issues.apache.
> org/jira/browse/CASSANDRA-12888
>
> I don't know why DataStax advises using them.  Perhaps ask them?
>
> Jon
>
> On Thu, Feb 16, 2017 at 7:57 AM Micha <mich...@fantasymail.de> wrote:
>
>>
>>
>> On 16.02.2017 16:33, Jonathan Haddad wrote:
>> >
>> > Regarding MVs, do not use the ones that shipped with 3.x.  They're not
>> > ready for production.  Manage it yourself by using a second table and
>> > inserting a second record there.
>> >
>>
>> Out of interest... there is a slight discrepance between the advice not
>> to use mv and the docu about the feature on the datastax side. Or do I
>> have to use another cassandra version (instead of 3.9)?
>>
>>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-17 Thread Benjamin Roth

Hi Nate,

See here dstat results:
https://gist.github.com/brstgt/216c662b525a9c5b653bbcd8da5b3fcb
Network volume does not correspond to Disk IO, not even close.

@heterogenous vnode count:
I did this to test how load behaves on a new server class we ordered for
CS. The new nodes had much faster CPUs than our older nodes. If not
assigning more tokens to new nodes, what else would you recommend to give
more weight + load to newer and usually faster servers.

2017-02-16 23:21 GMT+01:00 Nate McCall <n...@thelastpickle.com>:

>
> - Node A has 512 tokens and Node B 256. So it has double the load (data).
>> - Node A also has 2 SSDs, Node B only 1 SSD (according to load)
>>
>
> I very rarely see heterogeneous vnode counts in the same cluster. I would
> almost guarantee you are the only one doing this with MVs as well.
>
> That said, since you have different IO hardware, are you sure the system
> configurations (eg. block size, read ahead, etc) are the same on both
> machines? Is dstat showing a similar order of magnitude of network traffic
> in vs. IO for what you would expect?
>
>
> --
> -
> Nate McCall
> Wellington, NZ
> @zznate
>
> CTO
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: sasi index question (read timeout on many selects)

2017-02-16 Thread Benjamin Roth

No matter what has to be indexed here, the preferrable way is most probably
denormalization instead of another index.

2017-02-16 15:09 GMT+01:00 DuyHai Doan <doanduy...@gmail.com>:

> [image: Inline image 1]
>
> On Thu, Feb 16, 2017 at 3:08 PM, Micha <mich...@fantasymail.de> wrote:
>
>>
>>
>> On 16.02.2017 14:30, DuyHai Doan wrote:
>> > Why indexing BLOB data ? It does not make any sense
>>
>> My partition key is a secure hash sum,  I don't index a blob.
>>
>>
>>
>>
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-15 Thread Benjamin Roth

Erm sorry, forgot to mention. In this case "cas10" is Node A with 512
tokens and "cas9" Node B with 256 tokens.

2017-02-16 6:38 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> It doesn't really look like that:
> https://cl.ly/2c3Z1u2k0u2I
>
> Thats the ReadLatency.count metric aggregated by host which represents the
> actual read operations, correct?
>
> 2017-02-15 23:01 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:
>
>> I think it has more than double the load. It is double the data. More
>> read repair chances. More load can swing it's way during node failures etc.
>>
>> On Wednesday, February 15, 2017, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> Following situation in cluster with 10 nodes:
>>> Node A's disk read IO is ~20 times higher than the read load of node B.
>>> The nodes are exactly the same except:
>>> - Node A has 512 tokens and Node B 256. So it has double the load (data).
>>> - Node A also has 2 SSDs, Node B only 1 SSD (according to load)
>>>
>>> Node A has roughly 460GB, Node B 260GB total disk usage.
>>> Both nodes have 128GB RAM and 40 cores.
>>>
>>> Of course I assumed that Node A does more reads because cache / load
>>> ratio is worse but a factor of 20 makes me very sceptic.
>>>
>>> Of course Node A has a much higher and less predictable latency due to
>>> the wait states.
>>>
>>> Has anybody experienced similar situations?
>>> Any hints how to analyze or optimize this - I mean 128GB cache for 460GB
>>> payload is not that few. I am pretty sure that not the whole dataset of
>>> 460GB is "hot".
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>>> <07161%203048801>
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>>
>> --
>> Sorry this was sent from mobile. Will do less grammar and spell check
>> than usual.
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: High disk io read load

2017-02-15 Thread Benjamin Roth

It doesn't really look like that:
https://cl.ly/2c3Z1u2k0u2I

Thats the ReadLatency.count metric aggregated by host which represents the
actual read operations, correct?

2017-02-15 23:01 GMT+01:00 Edward Capriolo <edlinuxg...@gmail.com>:

> I think it has more than double the load. It is double the data. More read
> repair chances. More load can swing it's way during node failures etc.
>
> On Wednesday, February 15, 2017, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> Hi there,
>>
>> Following situation in cluster with 10 nodes:
>> Node A's disk read IO is ~20 times higher than the read load of node B.
>> The nodes are exactly the same except:
>> - Node A has 512 tokens and Node B 256. So it has double the load (data).
>> - Node A also has 2 SSDs, Node B only 1 SSD (according to load)
>>
>> Node A has roughly 460GB, Node B 260GB total disk usage.
>> Both nodes have 128GB RAM and 40 cores.
>>
>> Of course I assumed that Node A does more reads because cache / load
>> ratio is worse but a factor of 20 makes me very sceptic.
>>
>> Of course Node A has a much higher and less predictable latency due to
>> the wait states.
>>
>> Has anybody experienced similar situations?
>> Any hints how to analyze or optimize this - I mean 128GB cache for 460GB
>> payload is not that few. I am pretty sure that not the whole dataset of
>> 460GB is "hot".
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> <07161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

High disk io read load

2017-02-15 Thread Benjamin Roth

Hi there,

Following situation in cluster with 10 nodes:
Node A's disk read IO is ~20 times higher than the read load of node B.
The nodes are exactly the same except:
- Node A has 512 tokens and Node B 256. So it has double the load (data).
- Node A also has 2 SSDs, Node B only 1 SSD (according to load)

Node A has roughly 460GB, Node B 260GB total disk usage.
Both nodes have 128GB RAM and 40 cores.

Of course I assumed that Node A does more reads because cache / load ratio
is worse but a factor of 20 makes me very sceptic.

Of course Node A has a much higher and less predictable latency due to the
wait states.

Has anybody experienced similar situations?
Any hints how to analyze or optimize this - I mean 128GB cache for 460GB
payload is not that few. I am pretty sure that not the whole dataset of
460GB is "hot".

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: How does cassandra achieve Linearizability?

2017-02-10 Thread Benjamin Roth

ith a GPS modules
>>>> is not terribly complex. Low latency and jitter on servers you manage.
>>>> 140ms is a long way away network-wise, and I would suggest that was a
>>>> poor choice of upstream (probably stratum 2 or 3) source.
>>>>
>>>> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
>>>> need as close as you can get, you'll probably need to do it yourself.
>>>>
>>>> (I run several stratum 2 ntpd servers for pool.ntp.org)
>>>>
>>>> --
>>>> Kind regards,
>>>> Michael
>>>>
>>>> On 02/09/2017 06:47 PM, Kant Kodali wrote:
>>>> > Hi Justin,
>>>> >
>>>> > There are bunch of issues w.r.t to synchronization of clocks when we
>>>> > used ntpd. Also the time it took to sync the clocks was approx 140ms
>>>> > (don't quote me on it though because it is reported by our devops :)
>>>> >
>>>> > we have multiple clients (for example bunch of micro services are
>>>> > reading from Cassandra) I am not sure how one can achieve
>>>> > Linearizability by setting timestamps on the clients ? since there is
>>>> no
>>>> > total ordering across multiple clients.
>>>> >
>>>> > Thanks!
>>>> >
>>>> >
>>>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron <
>>>> jus...@instaclustr.com
>>>> > <mailto:jus...@instaclustr.com>> wrote:
>>>> >
>>>> > Hi Kant,
>>>> >
>>>> > Clock synchronization is important - you should ensure that ntpd
>>>> is
>>>> > properly configured on all nodes. If your particular use case is
>>>> > especially sensitive to out-of-order mutations it is possible to
>>>> set
>>>> > timestamps on the client side using the
>>>> > drivers. https://docs.datastax.com/en/d
>>>> eveloper/java-driver/3.1/manual/query_timestamps/
>>>> > <https://docs.datastax.com/en/developer/java-driver/3.1/man
>>>> ual/query_timestamps/>
>>>> >
>>>> > We use our own NTP cluster to reduce clock drift as much as
>>>> > possible, but public NTP servers are good enough for most
>>>> > uses. https://www.instaclustr.com/bl
>>>> og/2015/11/05/apache-cassandra-synchronization/
>>>> > <https://www.instaclustr.com/blog/2015/11/05/apache-cassand
>>>> ra-synchronization/>
>>>> >
>>>> > Cheers,
>>>> > Justin
>>>> >
>>>> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali <k...@peernova.com
>>>> > <mailto:k...@peernova.com>> wrote:
>>>> >
>>>> > How does Cassandra achieve Linearizability with “Last write
>>>> > wins” (conflict resolution methods based on time-of-day
>>>> clocks) ?
>>>> >
>>>> > Relying on synchronized clocks are almost certainly
>>>> > non-linearizable, because clock timestamps cannot be
>>>> guaranteed
>>>> > to be consistent with actual event ordering due to clock skew.
>>>> > isn't it?
>>>> >
>>>> > Thanks!
>>>> >
>>>> > --
>>>> >
>>>> > Justin Cameron
>>>> >
>>>> > Senior Software Engineer | Instaclustr
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > This email has been sent on behalf of Instaclustr Pty Ltd
>>>> > (Australia) and Instaclustr Inc (USA).
>>>> >
>>>> > This email and any attachments may contain confidential and
>>>> legally
>>>> > privileged information.  If you are not the intended recipient, do
>>>> > not copy or disclose its content, but please reply to this email
>>>> > immediately and highlight the error to the sender and then
>>>> > immediately delete the message.
>>>> >
>>>> >
>>>>
>>>>
>>>
>>>
>>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: cassandra user request log

2017-02-10 Thread Benjamin Roth

If you want to audit write operations only, you could maybe use CDC, this
is a quite new feature in 3.x (I think it was introduced in 3.9 or 3.10)

2017-02-10 10:10 GMT+01:00 vincent gromakowski <
vincent.gromakow...@gmail.com>:

> tx
>
> 2017-02-10 10:01 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>
>> you could write a custom trigger that logs access to specific CFs. But be
>> aware that this may have a big performance impact.
>>
>> 2017-02-10 9:58 GMT+01:00 vincent gromakowski <
>> vincent.gromakow...@gmail.com>:
>>
>>> GDPR compliancy...we need to trace user activity on personal data. Maybe
>>> there is another way ?
>>>
>>> 2017-02-10 9:46 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>>>
>>>> On a cluster with just a little bit load, that would cause zillions of
>>>> petabytes of logs (just roughly ;)). I don't think this is viable.
>>>> There are many many JMX metrics on an aggregated level. But none per
>>>> authed used.
>>>> What exactly do you want to find out? Is it for debugging purposes?
>>>>
>>>>
>>>> 2017-02-10 9:42 GMT+01:00 vincent gromakowski <
>>>> vincent.gromakow...@gmail.com>:
>>>>
>>>>> Hi all,
>>>>> Is there any way to trace user activity at the server level to see
>>>>> which user is accessing which data ? Do you thin it would be simple to
>>>>> implement ?
>>>>> Tx
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Benjamin Roth
>>>> Prokurist
>>>>
>>>> Jaumo GmbH · www.jaumo.com
>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>>> <+49%207161%203048801>
>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>
>>>
>>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: cassandra user request log

2017-02-10 Thread Benjamin Roth

you could write a custom trigger that logs access to specific CFs. But be
aware that this may have a big performance impact.

2017-02-10 9:58 GMT+01:00 vincent gromakowski <vincent.gromakow...@gmail.com
>:

> GDPR compliancy...we need to trace user activity on personal data. Maybe
> there is another way ?
>
> 2017-02-10 9:46 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>
>> On a cluster with just a little bit load, that would cause zillions of
>> petabytes of logs (just roughly ;)). I don't think this is viable.
>> There are many many JMX metrics on an aggregated level. But none per
>> authed used.
>> What exactly do you want to find out? Is it for debugging purposes?
>>
>>
>> 2017-02-10 9:42 GMT+01:00 vincent gromakowski <
>> vincent.gromakow...@gmail.com>:
>>
>>> Hi all,
>>> Is there any way to trace user activity at the server level to see which
>>> user is accessing which data ? Do you thin it would be simple to implement ?
>>> Tx
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: cassandra user request log

2017-02-10 Thread Benjamin Roth

On a cluster with just a little bit load, that would cause zillions of
petabytes of logs (just roughly ;)). I don't think this is viable.
There are many many JMX metrics on an aggregated level. But none per authed
used.
What exactly do you want to find out? Is it for debugging purposes?


2017-02-10 9:42 GMT+01:00 vincent gromakowski <vincent.gromakow...@gmail.com
>:

> Hi all,
> Is there any way to trace user activity at the server level to see which
> user is accessing which data ? Do you thin it would be simple to implement ?
> Tx
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

Ok now I REALLY got it :)
Thanks Sylvain!

2017-02-09 11:42 GMT+01:00 Sylvain Lebresne <sylv...@datastax.com>:

> On Thu, Feb 9, 2017 at 10:52 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> Ok got it.
>>
>> But it's interesting that this is supported:
>> DELETE/SELECT FROM ks.cf WHERE (pk1) IN ((1), (2), (3));
>>
>> This is technically mostly the same (Token awareness,
>> coordination/routing, read performance, ...), right?
>>
>
> It is. That's what I meant by "there is something to be said for the
> consistency of the CQL language in general". In other words, look for no
> externally logical reason for this being unsupported, it's unsupported
> simply due to how the CQL code evolved. But as I said, we didn't fix that
> inconsistency because we're all busy and it's not really that important in
> practice. The project of course welcome any contributions though :)
>
>
>>
>> 2017-02-09 10:43 GMT+01:00 Sylvain Lebresne <sylv...@datastax.com>:
>>
>>> This is a statement on multiple partitions and there is really no
>>> optimization the code internally does on that. In fact, I strongly advise
>>> you to not use a batch but rather simply do a for loop client side and send
>>> statement individually. That way, your driver will be able to use proper
>>> token-awareness for each request (while if you send a batch, one
>>> coordinator will be picked up and will have to forward most statement,
>>> doing more network hops at the end of the day). The only case where using a
>>> batch is indeed legit is if you care about all the statement being atomic,
>>> but in that case it's a logged batch you want.
>>>
>>> That's btw more or less why we never bothered implementing that: it's
>>> totally doable technically, but it's not really such a good idea
>>> performance wise in practice most of the time, and you can easily work it
>>> around with a batch if you need atomicity.
>>>
>>> Which is not saying it will never be and shouldn't be supported btw,
>>> there is something to be said for the consistency of the CQL language in
>>> general. But it's why no-one took time to do it so far.
>>>
>>> On Thu, Feb 9, 2017 at 10:36 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>>> wrote:
>>>
>>>> Yes, thats the workaround - I'll try that.
>>>>
>>>> Would you agree it would be better for internal optimizations to
>>>> process this within a single statement?
>>>>
>>>> 2017-02-09 10:32 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>>>
>>>>> Yep, that makes it clear. I think an unlogged batch of prepared
>>>>> statements with one statement per PK tuple would be roughly equivalent? 
>>>>> And
>>>>> probably no more complex to generate in the client?
>>>>>
>>>>> On Thu, 9 Feb 2017 at 20:22 Benjamin Roth <benjamin.r...@jaumo.com>
>>>>> wrote:
>>>>>
>>>>>> Maybe that makes it clear:
>>>>>>
>>>>>> DELETE FROM ks.cf WHERE (partitionkey1, partitionkey2) IN ((1, 2),
>>>>>> (1, 3), (2, 3), (3, 4));
>>>>>>
>>>>>> If want to delete or select a bunch of records identified by their
>>>>>> multi-partitionkey tuples.
>>>>>>
>>>>>> 2017-02-09 10:18 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>>>>>
>>>>>> Are you looking this to be equivalent to (PK1=1 AND PK2=2) or are you
>>>>>> looking for (PK1 IN (1,2) AND PK2 IN (1,2)) or something else?
>>>>>>
>>>>>> Cheers
>>>>>> Ben
>>>>>>
>>>>>> On Thu, 9 Feb 2017 at 20:09 Benjamin Roth <benjamin.r...@jaumo.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> CQL says this is not allowed:
>>>>>>
>>>>>> DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));
>>>>>>
>>>>>> 1. Is there a reason for it? There shouldn't be a performance
>>>>>> penalty, it is a PK lookup, the same thing works with a single pk column
>>>>>> 2. Is there a known workaround for it?
>>>>>>
>>>>>> It would be much of a help to have it for daily business, IMHO it's a
>>>>>> waste of resources to run multiple queries just to fe

Re: DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

This doesn't really belong to this topic but I also experienced what Ben
says.
I was migrating (and still am) tons of data from MySQL to CS. I measured
several approached (async parallel, prepared stmt, sync with unlogged
batches) and it turned out that batches where really fast and produced less
problems with cluster overloading with MVs.

2017-02-09 11:28 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:

> That’s a very good point from Sylvain that I forgot/missed. That said,
> we’ve seen plenty of scenarios where overall system throughput is improved
> through unlogged batches. One of my colleagues did quite a bit of
> benchmarking on this topic for his talk at last year’s C* summit:
> http://www.slideshare.net/DataStax/microbatching-
> highperformance-writes-adam-zegelin-instaclustr-cassandra-summit-2016
>
> On Thu, 9 Feb 2017 at 20:52 Benjamin Roth <benjamin.r...@jaumo.com> wrote:
>
>> Ok got it.
>>
>> But it's interesting that this is supported:
>> DELETE/SELECT FROM ks.cf WHERE (pk1) IN ((1), (2), (3));
>>
>> This is technically mostly the same (Token awareness,
>> coordination/routing, read performance, ...), right?
>>
>> 2017-02-09 10:43 GMT+01:00 Sylvain Lebresne <sylv...@datastax.com>:
>>
>> This is a statement on multiple partitions and there is really no
>> optimization the code internally does on that. In fact, I strongly advise
>> you to not use a batch but rather simply do a for loop client side and send
>> statement individually. That way, your driver will be able to use proper
>> token-awareness for each request (while if you send a batch, one
>> coordinator will be picked up and will have to forward most statement,
>> doing more network hops at the end of the day). The only case where using a
>> batch is indeed legit is if you care about all the statement being atomic,
>> but in that case it's a logged batch you want.
>>
>> That's btw more or less why we never bothered implementing that: it's
>> totally doable technically, but it's not really such a good idea
>> performance wise in practice most of the time, and you can easily work it
>> around with a batch if you need atomicity.
>>
>> Which is not saying it will never be and shouldn't be supported btw,
>> there is something to be said for the consistency of the CQL language in
>> general. But it's why no-one took time to do it so far.
>>
>> On Thu, Feb 9, 2017 at 10:36 AM, Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>> Yes, thats the workaround - I'll try that.
>>
>> Would you agree it would be better for internal optimizations to process
>> this within a single statement?
>>
>> 2017-02-09 10:32 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>
>> Yep, that makes it clear. I think an unlogged batch of prepared
>> statements with one statement per PK tuple would be roughly equivalent? And
>> probably no more complex to generate in the client?
>>
>> On Thu, 9 Feb 2017 at 20:22 Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>> Maybe that makes it clear:
>>
>> DELETE FROM ks.cf WHERE (partitionkey1, partitionkey2) IN ((1, 2), (1,
>> 3), (2, 3), (3, 4));
>>
>> If want to delete or select a bunch of records identified by their
>> multi-partitionkey tuples.
>>
>> 2017-02-09 10:18 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>
>> Are you looking this to be equivalent to (PK1=1 AND PK2=2) or are you
>> looking for (PK1 IN (1,2) AND PK2 IN (1,2)) or something else?
>>
>> Cheers
>> Ben
>>
>> On Thu, 9 Feb 2017 at 20:09 Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>> Hi Guys,
>>
>> CQL says this is not allowed:
>>
>> DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));
>>
>> 1. Is there a reason for it? There shouldn't be a performance penalty, it
>> is a PK lookup, the same thing works with a single pk column
>> 2. Is there a known workaround for it?
>>
>> It would be much of a help to have it for daily business, IMHO it's a
>> waste of resources to run multiple queries just to fetch a bunch of records
>> by a PK.
>>
>> Thanks in advance for any reply
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>> --
>> ————
>> Ben Slater
>> Chief Pro

Re: DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

Ok got it.

But it's interesting that this is supported:
DELETE/SELECT FROM ks.cf WHERE (pk1) IN ((1), (2), (3));

This is technically mostly the same (Token awareness, coordination/routing,
read performance, ...), right?

2017-02-09 10:43 GMT+01:00 Sylvain Lebresne <sylv...@datastax.com>:

> This is a statement on multiple partitions and there is really no
> optimization the code internally does on that. In fact, I strongly advise
> you to not use a batch but rather simply do a for loop client side and send
> statement individually. That way, your driver will be able to use proper
> token-awareness for each request (while if you send a batch, one
> coordinator will be picked up and will have to forward most statement,
> doing more network hops at the end of the day). The only case where using a
> batch is indeed legit is if you care about all the statement being atomic,
> but in that case it's a logged batch you want.
>
> That's btw more or less why we never bothered implementing that: it's
> totally doable technically, but it's not really such a good idea
> performance wise in practice most of the time, and you can easily work it
> around with a batch if you need atomicity.
>
> Which is not saying it will never be and shouldn't be supported btw, there
> is something to be said for the consistency of the CQL language in general.
> But it's why no-one took time to do it so far.
>
> On Thu, Feb 9, 2017 at 10:36 AM, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> Yes, thats the workaround - I'll try that.
>>
>> Would you agree it would be better for internal optimizations to process
>> this within a single statement?
>>
>> 2017-02-09 10:32 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>
>>> Yep, that makes it clear. I think an unlogged batch of prepared
>>> statements with one statement per PK tuple would be roughly equivalent? And
>>> probably no more complex to generate in the client?
>>>
>>> On Thu, 9 Feb 2017 at 20:22 Benjamin Roth <benjamin.r...@jaumo.com>
>>> wrote:
>>>
>>>> Maybe that makes it clear:
>>>>
>>>> DELETE FROM ks.cf WHERE (partitionkey1, partitionkey2) IN ((1, 2), (1,
>>>> 3), (2, 3), (3, 4));
>>>>
>>>> If want to delete or select a bunch of records identified by their
>>>> multi-partitionkey tuples.
>>>>
>>>> 2017-02-09 10:18 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>>>
>>>> Are you looking this to be equivalent to (PK1=1 AND PK2=2) or are you
>>>> looking for (PK1 IN (1,2) AND PK2 IN (1,2)) or something else?
>>>>
>>>> Cheers
>>>> Ben
>>>>
>>>> On Thu, 9 Feb 2017 at 20:09 Benjamin Roth <benjamin.r...@jaumo.com>
>>>> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> CQL says this is not allowed:
>>>>
>>>> DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));
>>>>
>>>> 1. Is there a reason for it? There shouldn't be a performance penalty,
>>>> it is a PK lookup, the same thing works with a single pk column
>>>> 2. Is there a known workaround for it?
>>>>
>>>> It would be much of a help to have it for daily business, IMHO it's a
>>>> waste of resources to run multiple queries just to fetch a bunch of records
>>>> by a PK.
>>>>
>>>> Thanks in advance for any reply
>>>>
>>>> --
>>>> Benjamin Roth
>>>> Prokurist
>>>>
>>>> Jaumo GmbH · www.jaumo.com
>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>>> <+49%207161%203048801>
>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>
>>>> --
>>>> 
>>>> Ben Slater
>>>> Chief Product Officer
>>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>>> +61 437 929 798 <+61%20437%20929%20798>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Benjamin Roth
>>>> Prokurist
>>>>
>>>> Jaumo GmbH · www.jaumo.com
>>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>>> <+49%207161%203048801>
>>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>>
>>> --
>>> 
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798 <+61%20437%20929%20798>
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

Yes, thats the workaround - I'll try that.

Would you agree it would be better for internal optimizations to process
this within a single statement?

2017-02-09 10:32 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:

> Yep, that makes it clear. I think an unlogged batch of prepared statements
> with one statement per PK tuple would be roughly equivalent? And probably
> no more complex to generate in the client?
>
> On Thu, 9 Feb 2017 at 20:22 Benjamin Roth <benjamin.r...@jaumo.com> wrote:
>
>> Maybe that makes it clear:
>>
>> DELETE FROM ks.cf WHERE (partitionkey1, partitionkey2) IN ((1, 2), (1,
>> 3), (2, 3), (3, 4));
>>
>> If want to delete or select a bunch of records identified by their
>> multi-partitionkey tuples.
>>
>> 2017-02-09 10:18 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:
>>
>> Are you looking this to be equivalent to (PK1=1 AND PK2=2) or are you
>> looking for (PK1 IN (1,2) AND PK2 IN (1,2)) or something else?
>>
>> Cheers
>> Ben
>>
>> On Thu, 9 Feb 2017 at 20:09 Benjamin Roth <benjamin.r...@jaumo.com>
>> wrote:
>>
>> Hi Guys,
>>
>> CQL says this is not allowed:
>>
>> DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));
>>
>> 1. Is there a reason for it? There shouldn't be a performance penalty, it
>> is a PK lookup, the same thing works with a single pk column
>> 2. Is there a known workaround for it?
>>
>> It would be much of a help to have it for daily business, IMHO it's a
>> waste of resources to run multiple queries just to fetch a bunch of records
>> by a PK.
>>
>> Thanks in advance for any reply
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798 <+61%20437%20929%20798>
>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
> --
> 
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798 <+61%20437%20929%20798>
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

Maybe that makes it clear:

DELETE FROM ks.cf WHERE (partitionkey1, partitionkey2) IN ((1, 2), (1, 3),
(2, 3), (3, 4));

If want to delete or select a bunch of records identified by their
multi-partitionkey tuples.

2017-02-09 10:18 GMT+01:00 Ben Slater <ben.sla...@instaclustr.com>:

> Are you looking this to be equivalent to (PK1=1 AND PK2=2) or are you
> looking for (PK1 IN (1,2) AND PK2 IN (1,2)) or something else?
>
> Cheers
> Ben
>
> On Thu, 9 Feb 2017 at 20:09 Benjamin Roth <benjamin.r...@jaumo.com> wrote:
>
>> Hi Guys,
>>
>> CQL says this is not allowed:
>>
>> DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));
>>
>> 1. Is there a reason for it? There shouldn't be a performance penalty, it
>> is a PK lookup, the same thing works with a single pk column
>> 2. Is there a known workaround for it?
>>
>> It would be much of a help to have it for daily business, IMHO it's a
>> waste of resources to run multiple queries just to fetch a bunch of records
>> by a PK.
>>
>> Thanks in advance for any reply
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
> --
> 
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798 <+61%20437%20929%20798>
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

DELETE/SELECT with multi-column PK and IN

2017-02-09 Thread Benjamin Roth

Hi Guys,

CQL says this is not allowed:

DELETE FROM ks.cf WHERE (pk1, pk2) IN ((1, 2));

1. Is there a reason for it? There shouldn't be a performance penalty, it
is a PK lookup, the same thing works with a single pk column
2. Is there a known workaround for it?

It would be much of a help to have it for daily business, IMHO it's a waste
of resources to run multiple queries just to fetch a bunch of records by a
PK.

Thanks in advance for any reply

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Why does CockroachDB github website say Cassandra has no Availability on datacenter failure?

2017-02-07 Thread Benjamin Roth

Ask for forgiveness not for permission if you do marketing ;)

Am 07.02.2017 13:11 schrieb "Kant Kodali" :

> lol. But seriously are they even allowed to say something that is not true
> about another product ?
>
> On Tue, Feb 7, 2017 at 4:05 AM, kurt greaves  wrote:
>
>> Marketing never lies. Ever
>>
>
>

Re: CS process killed by kernel OOM

2017-02-06 Thread Benjamin Roth

Alright. Thanks a lot for that information!

2017-02-06 14:35 GMT+01:00 Avi Kivity <a...@scylladb.com>:

> It is a bug.  In some contexts, the kernel needs to be able to reclaim
> memory instantly, but this is not one of them.  Here, the java process is
> creating a new thread, and the kernel is allocating 16kB for its kernel
> stack; that is a regular allocation, not atomic. If you decide the gfp_mask
> value you'll see that the kernel is allowed to initiate I/O and perform
> filesystem operations to satisfy the allocation, which it apparently did
> not.
>
>
> I do recommend reporting it, it will help others avoid encountering the
> same problem if it gets fixed.
>
> On 02/06/2017 03:07 PM, Benjamin Roth wrote:
>
> Thanks for the reply. We got rid of the OOMs by increasing
> vm.min_free_kbytes, it's default of approx 90mb is maybe a bit low for
> systems with 128GB.
> I guess the OOM happens because the kernel could not reclaim enough paged
> memory instantly.
> I can't tell if this is really a kernel bug or not. It also was my first
> thought but in the end the main thing is, it works again and it does with
> more mibn_free_kbytes
>
> 2017-02-06 11:53 GMT+01:00 Avi Kivity <a...@scylladb.com>:
>
>>
>> On 01/26/2017 07:36 AM, Benjamin Roth wrote:
>>
>> Hi there,
>>
>> We installed 2 new nodes these days. They run on ubuntu (Ubuntu 16.04.1
>> LTS) with kernel 4.4.0-59-generic. On these nodes (and only on these) CS
>> gets killed by the kernel due to OOM. It seems very strange to me because,
>> CS only takes roughly 20GB (out of 128GB), most of RAM is allocated to page
>> cache.
>>
>> Top looks typically like this:
>> KiB Mem : 13191691+total,  1974964 free, 20278184 used,
>> 10966376+buff/cache
>> KiB Swap:0 total,0 free,0 used. 11051503+avail Mem
>>
>> This is what kern.log says:
>> https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9
>>
>> Has anyone encountered sth like this before?
>>
>>
>> 2017-01-26T03:10:45.679458+00:00 cas10 kernel: [52226.449989] Node 0
>> Normal: 33850*4kB (UMEH) 8*8kB (UMH) 1*16kB (H) 0*32kB 0*64kB 0*128kB
>> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 135480kB
>> 2017-01-26T03:10:45.679460+00:00 cas10 kernel: [52226.449995] Node 1
>> Normal: 34213*4kB (UME) 176*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
>> 0*512kB 0*1024kB 0*2048kB 0*4096kB = 138260kB
>>
>>
>> There is plenty of free memory left (33850+34213)*4kB = 270 MB, but it is
>> fragmented into 4k and 8k blocks, while the kernel is trying to allocate
>> 16kB.  Still, the kernel could have evicted some page cache or swapped out
>> anonymous memory.  You should report this to lkml, it is a kernel bug.
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
>> <07161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>>
>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: CS process killed by kernel OOM

2017-02-06 Thread Benjamin Roth

Thanks for the reply. We got rid of the OOMs by increasing
vm.min_free_kbytes, it's default of approx 90mb is maybe a bit low for
systems with 128GB.
I guess the OOM happens because the kernel could not reclaim enough paged
memory instantly.
I can't tell if this is really a kernel bug or not. It also was my first
thought but in the end the main thing is, it works again and it does with
more mibn_free_kbytes

2017-02-06 11:53 GMT+01:00 Avi Kivity <a...@scylladb.com>:

>
> On 01/26/2017 07:36 AM, Benjamin Roth wrote:
>
> Hi there,
>
> We installed 2 new nodes these days. They run on ubuntu (Ubuntu 16.04.1
> LTS) with kernel 4.4.0-59-generic. On these nodes (and only on these) CS
> gets killed by the kernel due to OOM. It seems very strange to me because,
> CS only takes roughly 20GB (out of 128GB), most of RAM is allocated to page
> cache.
>
> Top looks typically like this:
> KiB Mem : 13191691+total,  1974964 free, 20278184 used, 10966376+buff/cache
> KiB Swap:0 total,0 free,0 used. 11051503+avail Mem
>
> This is what kern.log says:
> https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9
>
> Has anyone encountered sth like this before?
>
>
> 2017-01-26T03:10:45.679458+00:00 cas10 kernel: [52226.449989] Node 0
> Normal: 33850*4kB (UMEH) 8*8kB (UMH) 1*16kB (H) 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 135480kB
> 2017-01-26T03:10:45.679460+00:00 cas10 kernel: [52226.449995] Node 1
> Normal: 34213*4kB (UME) 176*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
> 0*512kB 0*1024kB 0*2048kB 0*4096kB = 138260kB
>
>
> There is plenty of free memory left (33850+34213)*4kB = 270 MB, but it is
> fragmented into 4k and 8k blocks, while the kernel is trying to allocate
> 16kB.  Still, the kernel could have evicted some page cache or swapped out
> anonymous memory.  You should report this to lkml, it is a kernel bug.
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

RE: Is it possible to have a column which can hold any data type (for inserting as json)

2017-02-01 Thread Benjamin Roth

This has to be done in your app. You can store your data as JSON in a text
column. You can use your favourite serializer. You can cast floats to
strings. You can even build a custom type. You can store it serialized as
blob. But there is no all purpose store all data in a magic way field.

Am 02.02.2017 05:30 schrieb "Rajeswari Menon" <rajeswar...@thinkpalm.com>:

> Yes. Is there any way to define value to accept any data type as the json
> value data may vary? Or is there any way to do the same without defining a
> schema?
>
>
>
> Regards,
>
> Rajeswari
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* 01 February 2017 15:36
> *To:* user@cassandra.apache.org
> *Subject:* RE: Is it possible to have a column which can hold any data
> type (for inserting as json)
>
>
>
> Value is defined as text column and you try to insert a double. That's
> simply not allowed
>
>
>
> Am 01.02.2017 09:02 schrieb "Rajeswari Menon" <rajeswar...@thinkpalm.com>:
>
> Given below is the sql query I executed.
>
>
>
> *insert* *into* data JSON'{
>
>   "id": 1,
>
>"address":"",
>
>"datatype":"DOUBLE",
>
>"name":"Longitude",
>
>"attributes":{
>
>   "ID":"1"
>
>},
>
>"category":"REAL",
>
>"value":1.390692,
>
>"timestamp":1485923271718,
>
>"quality":"GOOD"
>
> }';
>
>
>
> Regards,
>
> Rajeswari
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* 01 February 2017 12:35
> *To:* user@cassandra.apache.org
> *Subject:* Re: Is it possible to have a column which can hold any data
> type (for inserting as json)
>
>
>
> You should post the whole CQL query you try to execute! Why don't you use
> a native JSON type for your JSON data?
>
>
>
> 2017-02-01 7:51 GMT+01:00 Rajeswari Menon <rajeswar...@thinkpalm.com>:
>
> Hi,
>
>
>
> I have a json data as shown below.
>
>
>
> {
>
> "address":"127.0.0.1",
>
> "datatype":"DOUBLE",
>
> "name":"Longitude",
>
>  "attributes":{
>
> "Id":"1"
>
> },
>
> "category":"REAL",
>
> "value":1.390692,
>
> "timestamp":1485923271718,
>
> "quality":"GOOD"
>
> }
>
>
>
> To store the above json to Cassandra, I defined a table as shown below
>
>
>
> *create* *table* data
>
> (
>
>   id *int* *primary* *key*,
>
>   address text,
>
>   datatype text,
>
>   name text,
>
>   *attributes* *map* < text, text >,
>
>   category text,
>
>   value text,
>
>   "timestamp" *timestamp*,
>
>   quality text
>
> );
>
>
>
> When I try to insert the data as JSON I got the error : *Error decoding
> JSON value for value: Expected a UTF-8 string, but got a Double: 1.390692*.
> The message is clear that a double value cannot be inserted to text column.
> The real issue is that the value can be of any data type, so the schema
> cannot be predefined. Is there a way to create a column which can hold
> value of any data type. (I don’t want to hold the entire json as string. My
> preferred way is to define a schema.)
>
>
>
> Regards,
>
> Rajeswari
>
>
>
>
>
> --
>
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>

RE: Is it possible to have a column which can hold any data type (for inserting as json)

2017-02-01 Thread Benjamin Roth

Value is defined as text column and you try to insert a double. That's
simply not allowed

Am 01.02.2017 09:02 schrieb "Rajeswari Menon" <rajeswar...@thinkpalm.com>:

> Given below is the sql query I executed.
>
>
>
> *insert* *into* data JSON'{
>
>   "id": 1,
>
>"address":"",
>
>"datatype":"DOUBLE",
>
>"name":"Longitude",
>
>"attributes":{
>
>   "ID":"1"
>
>},
>
>"category":"REAL",
>
>"value":1.390692,
>
>"timestamp":1485923271718,
>
>"quality":"GOOD"
>
> }';
>
>
>
> Regards,
>
> Rajeswari
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* 01 February 2017 12:35
> *To:* user@cassandra.apache.org
> *Subject:* Re: Is it possible to have a column which can hold any data
> type (for inserting as json)
>
>
>
> You should post the whole CQL query you try to execute! Why don't you use
> a native JSON type for your JSON data?
>
>
>
> 2017-02-01 7:51 GMT+01:00 Rajeswari Menon <rajeswar...@thinkpalm.com>:
>
> Hi,
>
>
>
> I have a json data as shown below.
>
>
>
> {
>
> "address":"127.0.0.1",
>
> "datatype":"DOUBLE",
>
> "name":"Longitude",
>
>  "attributes":{
>
> "Id":"1"
>
> },
>
> "category":"REAL",
>
> "value":1.390692,
>
> "timestamp":1485923271718,
>
> "quality":"GOOD"
>
> }
>
>
>
> To store the above json to Cassandra, I defined a table as shown below
>
>
>
> *create* *table* data
>
> (
>
>   id *int* *primary* *key*,
>
>   address text,
>
>   datatype text,
>
>   name text,
>
>   *attributes* *map* < text, text >,
>
>   category text,
>
>   value text,
>
>   "timestamp" *timestamp*,
>
>   quality text
>
> );
>
>
>
> When I try to insert the data as JSON I got the error : *Error decoding
> JSON value for value: Expected a UTF-8 string, but got a Double: 1.390692*.
> The message is clear that a double value cannot be inserted to text column.
> The real issue is that the value can be of any data type, so the schema
> cannot be predefined. Is there a way to create a column which can hold
> value of any data type. (I don’t want to hold the entire json as string. My
> preferred way is to define a schema.)
>
>
>
> Regards,
>
> Rajeswari
>
>
>
>
>
> --
>
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>

Re: Is it possible to have a column which can hold any data type (for inserting as json)

2017-01-31 Thread Benjamin Roth

You should post the whole CQL query you try to execute! Why don't you use a
native JSON type for your JSON data?

2017-02-01 7:51 GMT+01:00 Rajeswari Menon <rajeswar...@thinkpalm.com>:

> Hi,
>
>
>
> I have a json data as shown below.
>
>
>
> {
>
> "address":"127.0.0.1",
>
> "datatype":"DOUBLE",
>
> "name":"Longitude",
>
>  "attributes":{
>
> "Id":"1"
>
> },
>
> "category":"REAL",
>
> "value":1.390692,
>
> "timestamp":1485923271718,
>
> "quality":"GOOD"
>
> }
>
>
>
> To store the above json to Cassandra, I defined a table as shown below
>
>
>
> *create* *table* data
>
> (
>
>   id *int* *primary* *key*,
>
>   address text,
>
>   datatype text,
>
>   name text,
>
>   *attributes* *map* < text, text >,
>
>   category text,
>
>   value text,
>
>   "timestamp" *timestamp*,
>
>   quality text
>
> );
>
>
>
> When I try to insert the data as JSON I got the error : *Error decoding
> JSON value for value: Expected a UTF-8 string, but got a Double: 1.390692*.
> The message is clear that a double value cannot be inserted to text column.
> The real issue is that the value can be of any data type, so the schema
> cannot be predefined. Is there a way to create a column which can hold
> value of any data type. (I don’t want to hold the entire json as string. My
> preferred way is to define a schema.)
>
>
>
> Regards,
>
> Rajeswari
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Time series data model and tombstones

2017-01-28 Thread Benjamin Roth

Maybe trace your queries to see what's happening in detail.

Am 28.01.2017 21:32 schrieb "John Sanda" :

Thanks for the response. This version of the code is using STCS.
gc_grace_seconds was set to one day and then I changed it to zero since RF
= 1. I understand that expired data will still generate tombstones and that
STCS is not the best. More recent versions of the code use DTCS, and we'll
be switching over to TWCS shortly. The suggestions raised are excellent
ones, but I tend to think of them as optimizations that might not address
my issue which I think may be 1) a problem with my data model, 2) problem
with the queries used or 3) some misunderstanding of Cassandra performs
range scans.

I am doing append-only writes. There is no out of order data. There are no
deletes, just TTLs. Data is stored on disk in descending order, and queries
access recent data and never query past the TTL of seven days. Given this I
would not except to be reading tombstones, certainly not the large numbers
that I am seeing.

On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad  wrote:

> Since you didn't specify a compaction strategy I'm guessing you're using
> STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
> for this type of workload.
> On Sat, Jan 28, 2017 at 8:30 AM John Sanda  wrote:
>
>> I have a time series data model that is basically:
>>
>> CREATE TABLE metrics (
>> id text,
>> time timeuuid,
>> value double,
>> PRIMARY KEY (id, time)
>> ) WITH CLUSTERING ORDER BY (time DESC);
>>
>> I do append-only writes, no deletes, and use a TTL of seven days. Data
>> points are written every seconds. The UI queries data for the past hour,
>> two hours, day, or week. The UI refreshes and executes queries every 30
>> seconds. In one test environment I am seeing lots of tombstone threshold
>> warnings and Cassandra has even OOME'd. Since I am storing data in
>> descending order and always query for recent data, I do not understand why
>> I am running into this problem.
>>
>> I know that it is recommended to do some date partitioning in part to
>> ensure partitions do not grow too large. I already have some changes in
>> place to partition by day.. Before I make those changes I want to
>> understand why I am scanning so many tombstones so that I can be more
>> confident that the date partitioning changes will help.
>>
>> Thanks
>>
>> - John
>>
>

-- 

- John

Re: Disc size for cluster

2017-01-26 Thread Benjamin Roth

Hi!

This is basically right, but:
1. How do you know the 3TB storage will be 3TB on cassandra? This depends
how the data is serialized, compressed and how often it changes and it
depends on your compaction settings
2. 50% free space on STCS is only required if you do a full compaction of a
single CF that takes all the space. Normally you need as much free space as
the target SSTable of a compaction will take. If you split your data across
more CFs, its unlikely you really hit this value.

.. probably you should do some tests. But in the end it is always good to
have some headroom. I personally would scale out if free space is < 30% but
that always depends on your model.


2017-01-26 9:56 GMT+01:00 Raphael Vogel <raphael.vo...@web.de>:

> Hi
> Just want to validate my estimation for a C* cluster which should have
> around 3 TB of usable storage.
> Assuming a RF of 3 and SizeTiered Compaction Strategy.
> Is it correct, that SizeTiered Compaction Strategy needs (in the worst
> case) 50% free disc space during compaction?
>
> So this would then result in a cluster of 3TB x 3 x 2 == 18 TB of raw
> storage?
>
> Thanks and Regards
> Raphael Vogel
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

CS process killed by kernel OOM

2017-01-25 Thread Benjamin Roth

Hi there,

We installed 2 new nodes these days. They run on ubuntu (Ubuntu 16.04.1
LTS) with kernel 4.4.0-59-generic. On these nodes (and only on these) CS
gets killed by the kernel due to OOM. It seems very strange to me because,
CS only takes roughly 20GB (out of 128GB), most of RAM is allocated to page
cache.

Top looks typically like this:
KiB Mem : 13191691+total,  1974964 free, 20278184 used, 10966376+buff/cache
KiB Swap:0 total,0 free,0 used. 11051503+avail Mem

This is what kern.log says:
https://gist.github.com/brstgt/0f1aa6afb558a56d1cadce958db46cf9

Has anyone encountered sth like this before?

-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: [Multi DC] Old Data Not syncing from Existing cluster to new Cluster

2017-01-24 Thread Benjamin Roth

Have you also altered RF of system_distributed as stated in the tutorial?

2017-01-24 16:45 GMT+01:00 Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in>:

> My Mistake,
>
>
>
> Both clusters are up and running.
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.XX.XX  1.65 GB   256  ?   
> badf985b-37da-4735-b468-8d3a058d4b60
> 01
>
> UN  172.29.XX.XX  1.64 GB   256  ?   
> 317061b2-c19f-44ba-a776-bcd91c70bbdd
> 03
>
> UN  172.29.XX.XX  1.64 GB   256  ?   
> 9bf0d1dc-6826-4f3b-9c56-cec0c9ce3b6c
> 02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.26.XX.XX   79.90 GB   256  ?   
> 3e8133ed-98b5-418d-96b5-690a1450cd30
> RACK1
>
> UN  172.26.XX.XX   80.21 GB   256  ?   
> 7d3f5b25-88f9-4be7-b0f5-746619153543
> RACK2
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* Tuesday, January 24, 2017 9:11 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [Multi DC] Old Data Not syncing from Existing cluster to
> new Cluster
>
>
>
> I am not an expert in bootstrapping new DCs but shouldn't the OLD nodes
> appear as UP to be used as a streaming source in rebuild?
>
>
>
> 2017-01-24 16:32 GMT+01:00 Abhishek Kumar Maheshwari <Abhishek.Maheshwari@
> timesinternet.in>:
>
> Yes, I take all steps. While I am inserting new data is replicating on
> both DC. But only old data is not replication in new cluster.
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* Tuesday, January 24, 2017 8:55 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [Multi DC] Old Data Not syncing from Existing cluster to
> new Cluster
>
>
>
> There is much more to it than just changing the RF in the keyspace!
>
>
>
> See here: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/
> opsAddDCToCluster.html
>
>
>
> 2017-01-24 16:18 GMT+01:00 Abhishek Kumar Maheshwari <Abhishek.Maheshwari@
> timesinternet.in>:
>
> Hi All,
>
>
>
> I have Cassandra stack with 2 Dc
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.xx.xxx  256  MB   256  ?   
> b6b8cbb9-1fed-471f-aea9-6a657e7ac80a
> 01
>
> UN  172.29.xx.xxx  240 MB   256  ?   
> 604abbf5-8639-4104-8f60-fd6573fb2e17
> 03
>
> UN  172.29. xx.xxx  240 MB   256  ?   
> 32fa79ee-93c6-4e5b-a910-f27a1e9d66c1
> 02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> DN  172.26. .xx.xxx  78.97 GB   256  ?
> 3e8133ed-98b5-418d-96b5-690a1450cd30  RACK1
>
> DN  172.26. .xx.xxx  79.18 GB   256  ?
> 7d3f5b25-88f9-4be7-b0f5-746619153543  RACK2
>
>
>
> dc_india is old Dc which contains all data.
>
> I update keyspace as per below:
>
>
>
> alter KEYSPACE wls WITH replication = {'class': 'NetworkTopologyStrategy',
> 'DRPOCcluster': '2','dc_india':'2'}  AND durable_writes = true;
>
>
>
> but old data is not updating in DRPOCcluster(which is new). Also, while
> running nodetool rebuild getting below exception:
>
> Cammand: ./nodetool rebuild -dc dc_india
>
>
>
> Exception : nodetool: U

Re: [Multi DC] Old Data Not syncing from Existing cluster to new Cluster

2017-01-24 Thread Benjamin Roth

I am not an expert in bootstrapping new DCs but shouldn't the OLD nodes
appear as UP to be used as a streaming source in rebuild?

2017-01-24 16:32 GMT+01:00 Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in>:

> Yes, I take all steps. While I am inserting new data is replicating on
> both DC. But only old data is not replication in new cluster.
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* Tuesday, January 24, 2017 8:55 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [Multi DC] Old Data Not syncing from Existing cluster to
> new Cluster
>
>
>
> There is much more to it than just changing the RF in the keyspace!
>
>
>
> See here: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/
> opsAddDCToCluster.html
>
>
>
> 2017-01-24 16:18 GMT+01:00 Abhishek Kumar Maheshwari <Abhishek.Maheshwari@
> timesinternet.in>:
>
> Hi All,
>
>
>
> I have Cassandra stack with 2 Dc
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.xx.xxx  256  MB   256  ?   
> b6b8cbb9-1fed-471f-aea9-6a657e7ac80a
> 01
>
> UN  172.29.xx.xxx  240 MB   256  ?   
> 604abbf5-8639-4104-8f60-fd6573fb2e17
> 03
>
> UN  172.29. xx.xxx  240 MB   256  ?   
> 32fa79ee-93c6-4e5b-a910-f27a1e9d66c1
> 02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> DN  172.26. .xx.xxx  78.97 GB   256  ?
> 3e8133ed-98b5-418d-96b5-690a1450cd30  RACK1
>
> DN  172.26. .xx.xxx  79.18 GB   256  ?
> 7d3f5b25-88f9-4be7-b0f5-746619153543  RACK2
>
>
>
> dc_india is old Dc which contains all data.
>
> I update keyspace as per below:
>
>
>
> alter KEYSPACE wls WITH replication = {'class': 'NetworkTopologyStrategy',
> 'DRPOCcluster': '2','dc_india':'2'}  AND durable_writes = true;
>
>
>
> but old data is not updating in DRPOCcluster(which is new). Also, while
> running nodetool rebuild getting below exception:
>
> Cammand: ./nodetool rebuild -dc dc_india
>
>
>
> Exception : nodetool: Unable to find sufficient sources for streaming
> range (-875697427424852,-8755484427030035332] in keyspace
> system_distributed
>
>
>
> Cassandra version : 3.0.9
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
>
>
>
>
> --
>
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: [Multi DC] Old Data Not syncing from Existing cluster to new Cluster

2017-01-24 Thread Benjamin Roth

There is much more to it than just changing the RF in the keyspace!

See here:
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html

2017-01-24 16:18 GMT+01:00 Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in>:

> Hi All,
>
>
>
> I have Cassandra stack with 2 Dc
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.xx.xxx  256  MB   256  ?   
> b6b8cbb9-1fed-471f-aea9-6a657e7ac80a
> 01
>
> UN  172.29.xx.xxx  240 MB   256  ?   
> 604abbf5-8639-4104-8f60-fd6573fb2e17
> 03
>
> UN  172.29. xx.xxx  240 MB   256  ?   
> 32fa79ee-93c6-4e5b-a910-f27a1e9d66c1
> 02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> DN  172.26. .xx.xxx  78.97 GB   256  ?
> 3e8133ed-98b5-418d-96b5-690a1450cd30  RACK1
>
> DN  172.26. .xx.xxx  79.18 GB   256  ?
> 7d3f5b25-88f9-4be7-b0f5-746619153543  RACK2
>
>
>
> dc_india is old Dc which contains all data.
>
> I update keyspace as per below:
>
>
>
> alter KEYSPACE wls WITH replication = {'class': 'NetworkTopologyStrategy',
> 'DRPOCcluster': '2','dc_india':'2'}  AND durable_writes = true;
>
>
>
> but old data is not updating in DRPOCcluster(which is new). Also, while
> running nodetool rebuild getting below exception:
>
> Cammand: ./nodetool rebuild -dc dc_india
>
>
>
> Exception : nodetool: Unable to find sufficient sources for streaming
> range (-875697427424852,-8755484427030035332] in keyspace
> system_distributed
>
>
>
> Cassandra version : 3.0.9
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Huge size of system.batches table after dropping an incomplete Materialized View

2017-01-23 Thread Benjamin Roth

What exactly persists? I didn't really understand you, could you be more
specific?

2017-01-23 15:40 GMT+01:00 Vinci <vi...@protonmail.com>:

> Thanks for the response.
>
> After the MV failure and errors, MV was dropped and the table was
> truncated.
> Then I recreated the MV and Table from scratch which worked as expected.
>
> The huge sizes of sstables as I have mentioned are after that. Somehow it
> still persists with same last modification timestamps.
>
> Not sure if i can safely rm these sstables or truncate system.batches on
> that node.
>
>
>  Original Message 
> Subject: Re: Huge size of system.batches table after dropping an
> incomplete Materialized View
> Local Time: 22 January 2017 11:41 PM
> UTC Time: 22 January 2017 18:11
> From: benjamin.r...@jaumo.com
> To: user@cassandra.apache.org, Vinci <vi...@protonmail.com>
>
> I cannot tell you were these errors like "Attempting to mutate ..." come
> from but under certain circumstances all view mutations are stored in
> batches, so the batchlog can grow insanely large. I don't see why a repair
> should help you in this situation. I guess what you want is to recreate the
> table.
>
> 1. You should not repair MVs directly. The current design is to only
> repairs the base table - though it's not properly documented. Repairing MVs
> can create inconsistent states. Only repairing the base tables wont.
> 2. A repair does only repair data and won't fix schema-issues
> 3. A repair of a base table that contains an MV is incredibly slow if the
> state is very inconsistent (which is probably the case in your situation)
>
> What to do?
> - If you don't care about the data of the MV, you of course can delete all
> SSTables (when CS is stopped) and all data will be gone. But I don't know
> if it helps.
> - If you are 100% sure that no other batch logs are going on, you could
> also truncate the system.batches, otherwise your log may be flooded with
> "non-existant table" things if the batch log is replayed. It is annoying
> but should not harm anyone.
>
> => Start over, try to drop and create the MV. Watch out for logs referring
> to schema changes and errors
>
> Side note:
> I'd recommend not to use MVs (yet) if you don't have an "inside"
> understanding of them or "know what you are doing". They can have a very
> big impact on your cluster performance in some situations and are not
> generally considered as stable yet.
>
> 2017-01-22 18:42 GMT+01:00 Vinci <vi...@protonmail.com>:
>
>> Hi there,
>>
>> Version :- Cassandra 3.0.7
>>
>> I attempted to create a Materialized View on a certain table and it
>> failed with never-ending WARN message "Mutation of  bytes is too
>> large for the maximum size of ".
>>
>> "nodetool stop VIEW_BUILD" also did not help.
>>
>> That seems to be a result of https://issues.apache.org/j
>> ira/browse/CASSANDRA-11670 which is fixed in newer versions.
>>
>> So I tried dropping the view and that generated error messages like
>> following :-
>>
>> ERROR [CompactionExecutor:632] [Timestamp] Keyspace.java:475 - Attempting
>> to mutate non-existant table 7c2e1c40-b82b-11e6-9d20-4b0190661423
>> (keyspace_name.view_name)
>>
>> I performed an incremental repair of the table on which view was created
>> and a rolling restart to stop these errors.
>>
>> Now I see huge size of system.batches table on one of the nodes. It seems
>> related to issues mentioned above since last modification timestamps of the
>> sstable files inside system/batches is same as when I tried to drop the MV.
>>
>> Some insight and suggestions regarding it will be very helpful. I will
>> like to know if i can safely truncate the table, rm the files or any other
>> approach to clean it up?
>>
>> Thanks.
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <07161%203048806> · Fax +49 7161 304880-1
> <07161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Getting Error while Writing in Multi DC mode when Remote Dc is Down.

2017-01-23 Thread Benjamin Roth

Sorry for the short answer, I am on the run:
I guess your hints expired. Default setting is 3h. If a node is down for a
longertime, no hints will be written.
Only a repair will help then.

2017-01-23 12:47 GMT+01:00 Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in>:

> Hi Benjamin,
>
>
>
> I find the issue. while I was making query, I was overriding LOCAL_QUORUM
> to QUORUM.
>
>
>
> Also, one more Question,
>
>
>
> I was able insert data in DRPOCcluster. But when I bring up dc_india DC,
> data doesn’t seem in dc_india keyspace and column family (I wait near about
> 30 min)?
>
>
>
>
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
> *From:* Benjamin Roth [mailto:benjamin.r...@jaumo.com]
> *Sent:* Monday, January 23, 2017 5:05 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Getting Error while Writing in Multi DC mode when Remote
> Dc is Down.
>
>
>
> The query has QUORUM not LOCAL_QUORUM. So 3 of 5 nodes are required. Maybe
> 1 node in DRPOCcluster also was temporarily unavailable during that query?
>
>
>
> 2017-01-23 12:16 GMT+01:00 Abhishek Kumar Maheshwari <Abhishek.Maheshwari@
> timesinternet.in>:
>
> Hi All,
>
>
>
> I have Cassandra stack with 2 Dc
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.xx.xxx  88.88 GB   256  ?   
> b6b8cbb9-1fed-471f-aea9-6a657e7ac80a
> 01
>
> UN  172.29.xx.xxx  73.95 GB   256  ?   
> 604abbf5-8639-4104-8f60-fd6573fb2e17
> 03
>
> UN  172.29. xx.xxx  66.42 GB   256  ?
> 32fa79ee-93c6-4e5b-a910-f27a1e9d66c1  02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> DN  172.26. .xx.xxx  78.97 GB   256  ?
> 3e8133ed-98b5-418d-96b5-690a1450cd30  RACK1
>
> DN  172.26. .xx.xxx  79.18 GB   256  ?
> 7d3f5b25-88f9-4be7-b0f5-746619153543  RACK2
>
>
>
>
>
> I am using below code to connect with java driver:
>
>
>
> cluster = Cluster.*builder*().addContactPoints(hostAddresses
> ).withRetryPolicy(DefaultRetryPolicy.*INSTANCE*)
>
>.withReconnectionPolicy(*new*
> ConstantReconnectionPolicy(3L))
>
>.withLoadBalancingPolicy(*new*
> TokenAwarePolicy(*new* DCAwareRoundRobinPolicy.Builder().withLocalDc("
> DRPOCcluster").withUsedHostsPerRemoteDc(2).build())).build();
>
> cluster.getConfiguration().getQueryOptions().setConsistencyLevel(
> ConsistencyLevel.LOCAL_QUORUM);
>
>
>
> hostAddresses is 172.29.xx.xxx  . when Dc with IP 172.26. .xx.xxx   is
> down, we are getting below exception :
>
>
>
>
>
> Exception in thread "main" 
> com.datastax.driver.core.exceptions.UnavailableException:
> Not enough replicas available for query at consistency QUORUM (3 required
> but only 2 alive)
>
>at com.datastax.driver.core.exceptions.UnavailableException.copy(
> UnavailableException.java:109)
>
>at com.datastax.driver.core.exceptions.UnavailableException.copy(
> UnavailableException.java:27)
>
>at com.datastax.driver.core.DriverThrowables.propagateCause(
> DriverThrowables.java:37)
>
>at com.datastax.driver.core.DefaultResultSetFuture.
> getUninterruptibly(DefaultResultSetFuture.java:245)
>
>
>
> Cassandra version : 3.0.9
>
> Datastax Java Driver Version:
>
>
>
> 
>
> com.datastax.cassandra
>
> cassandra-driver-
> core
>
> 3.1.2
>
> 
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
>
> We the soldiers of

Re: Getting Error while Writing in Multi DC mode when Remote Dc is Down.

2017-01-23 Thread Benjamin Roth

The query has QUORUM not LOCAL_QUORUM. So 3 of 5 nodes are required. Maybe
1 node in DRPOCcluster also was temporarily unavailable during that query?

2017-01-23 12:16 GMT+01:00 Abhishek Kumar Maheshwari <
abhishek.maheshw...@timesinternet.in>:

> Hi All,
>
>
>
> I have Cassandra stack with 2 Dc
>
>
>
> Datacenter: DRPOCcluster
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> UN  172.29.xx.xxx  88.88 GB   256  ?   
> b6b8cbb9-1fed-471f-aea9-6a657e7ac80a
> 01
>
> UN  172.29.xx.xxx  73.95 GB   256  ?   
> 604abbf5-8639-4104-8f60-fd6573fb2e17
> 03
>
> UN  172.29. xx.xxx  66.42 GB   256  ?
> 32fa79ee-93c6-4e5b-a910-f27a1e9d66c1  02
>
> Datacenter: dc_india
>
> 
>
> Status=Up/Down
>
> |/ State=Normal/Leaving/Joining/Moving
>
> --  AddressLoad   Tokens   OwnsHost
> ID   Rack
>
> DN  172.26. .xx.xxx  78.97 GB   256  ?
> 3e8133ed-98b5-418d-96b5-690a1450cd30  RACK1
>
> DN  172.26. .xx.xxx  79.18 GB   256  ?
> 7d3f5b25-88f9-4be7-b0f5-746619153543  RACK2
>
>
>
>
>
> I am using below code to connect with java driver:
>
>
>
> cluster = Cluster.*builder*().addContactPoints(hostAddresses
> ).withRetryPolicy(DefaultRetryPolicy.*INSTANCE*)
>
>.withReconnectionPolicy(*new*
> ConstantReconnectionPolicy(3L))
>
>.withLoadBalancingPolicy(*new*
> TokenAwarePolicy(*new* DCAwareRoundRobinPolicy.Builder().withLocalDc("
> DRPOCcluster").withUsedHostsPerRemoteDc(2).build())).build();
>
> cluster.getConfiguration().getQueryOptions().setConsistencyLevel(
> ConsistencyLevel.LOCAL_QUORUM);
>
>
>
> hostAddresses is 172.29.xx.xxx  . when Dc with IP 172.26. .xx.xxx   is
> down, we are getting below exception :
>
>
>
>
>
> Exception in thread "main" 
> com.datastax.driver.core.exceptions.UnavailableException:
> Not enough replicas available for query at consistency QUORUM (3 required
> but only 2 alive)
>
>at com.datastax.driver.core.exceptions.UnavailableException.copy(
> UnavailableException.java:109)
>
>at com.datastax.driver.core.exceptions.UnavailableException.copy(
> UnavailableException.java:27)
>
>at com.datastax.driver.core.DriverThrowables.propagateCause(
> DriverThrowables.java:37)
>
>at com.datastax.driver.core.DefaultResultSetFuture.
> getUninterruptibly(DefaultResultSetFuture.java:245)
>
>
>
> Cassandra version : 3.0.9
>
> Datastax Java Driver Version:
>
>
>
> 
>
> com.datastax.cassandra
>
> cassandra-driver-
> core
>
> 3.1.2
>
> 
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Kumar Maheshwari*
> *+91- 805591 <+91%208%2005591> (Mobile)*
>
> Times Internet Ltd. | A Times of India Group Company
>
> FC - 6, Sector 16A, Film City,  Noida,  U.P. 201301 | INDIA
>
> *P** Please do not print this email unless it is absolutely necessary.
> Spread environmental awareness.*
>
>
> We the soldiers of our new economy, pledge to stop doubting and start
> spending, to enable others to go digital, to use less cash. We pledge to
> #RemonetiseIndia. Join the Times Network ‘Remonetise India’ movement today.
> To pledge for growth, give a missed call on +91 9223515515
> <+91%2092235%2015515>. Visit www.remonetiseindia.com
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: Huge size of system.batches table after dropping an incomplete Materialized View

2017-01-22 Thread Benjamin Roth

I cannot tell you were these errors like "Attempting to mutate ..." come
from but under certain circumstances all view mutations are stored in
batches, so the batchlog can grow insanely large. I don't see why a repair
should help you in this situation. I guess what you want is to recreate the
table.

1. You should not repair MVs directly. The current design is to only
repairs the base table - though it's not properly documented. Repairing MVs
can create inconsistent states. Only repairing the base tables wont.
2. A repair does only repair data and won't fix schema-issues
3. A repair of a base table that contains an MV is incredibly slow if the
state is very inconsistent (which is probably the case in your situation)

What to do?
- If you don't care about the data of the MV, you of course can delete all
SSTables (when CS is stopped) and all data will be gone. But I don't know
if it helps.
- If you are 100% sure that no other batch logs are going on, you could
also truncate the system.batches, otherwise your log may be flooded with
"non-existant table" things if the batch log is replayed. It is annoying
but should not harm anyone.

=> Start over, try to drop and create the MV. Watch out for logs referring
to schema changes and errors

Side note:
I'd recommend not to use MVs (yet) if you don't have an "inside"
understanding of them or "know what you are doing". They can have a very
big impact on your cluster performance in some situations and are not
generally considered as stable yet.

2017-01-22 18:42 GMT+01:00 Vinci <vi...@protonmail.com>:

> Hi there,
>
> Version :- Cassandra 3.0.7
>
> I attempted to create a Materialized View on a certain table and it failed
> with never-ending WARN message "Mutation of  bytes is too large for
> the maximum size of ".
>
> "nodetool stop VIEW_BUILD" also did not help.
>
> That seems to be a result of https://issues.apache.org/
> jira/browse/CASSANDRA-11670 which is fixed in newer versions.
>
> So I tried dropping the view and that generated error messages like
> following :-
>
> ERROR [CompactionExecutor:632] [Timestamp] Keyspace.java:475 - Attempting
> to mutate non-existant table 7c2e1c40-b82b-11e6-9d20-4b0190661423
> (keyspace_name.view_name)
>
> I performed an incremental repair of the table on which view was created
> and a rolling restart to stop these errors.
>
> Now I see huge size of system.batches table on one of the nodes. It seems
> related to issues mentioned above since last modification timestamps of the
> sstable files inside system/batches is same as when I tried to drop the MV.
>
> Some insight and suggestions regarding it will be very helpful. I will
> like to know if i can safely truncate the table, rm the files or any other
> approach to clean it up?
>
> Thanks.
>



-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

Re: parallel processing - splitting data

2017-01-19 Thread Benjamin Roth

I meant the global whole token range which is -(2^64/2) to ((2^64) / 2 - 1)
I remember there are classes that already generate the right slices but
don't know by heart which one it was.

2017-01-19 13:29 GMT+01:00 Frank Hughes <frankhughes...@gmail.com>:

> I have tried to retrieve the token range and slice in 4, but the response
> i get for the following code is different on each node:
>
> TokenRange[] tokenRanges = 
> unwrapTokenRanges(metadata.getTokenRanges(keyspaceName,
> localHost)).toArray(new TokenRange[0]);
>
> On each node, the 1024 token ranges are different, so Im not sure how to
> do the split.
>
> e.g. from node 1
>
> Token ranges - start:-5144720537407094184 end:-5129226025397315327
>
> This token range isn't returned by node 2, 3 or 4.
>
> Thanks again
>
> Frank
>
> On 19 January 2017 at 12:19, Benjamin Roth <benjamin.r...@jaumo.com>
> wrote:
>
>> If you have 4 Nodes with RF 4 then all data is on every node. So you can
>> just slice the whole token range into 4 pieces and let each node process 1
>> slice.
>> Determining local ranges also only helps if you read with CL_ONE.
>>
>> 2017-01-19 13:05 GMT+01:00 Frank Hughes <frankhughes...@gmail.com>:
>>
>>> Hello there,
>>>
>>> I'm running a 4 node cluster of Cassandra 3.9 with a replication factor
>>> of 4.
>>>
>>> I want to be able to run a java process on each node only selecting a
>>> 25% of the data on each node,
>>> so i can process all of the data in parallel on each node.
>>>
>>> What is the best way to do this with the java driver ?
>>>
>>> I was assuming I could retrieve the token ranges for each node and page
>>> through the data using these ranges, but this includes the replicated data.
>>> I was hoping there was away of only selecting the data that a node is
>>> responsible for and avoiding the replicated data.
>>>
>>> Many thanks for any help and guidance,
>>>
>>> Frank Hughes
>>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>
>


-- 
Benjamin Roth
Prokurist

Jaumo GmbH · www.jaumo.com
Wehrstraße 46 · 73035 Göppingen · Germany
Phone +49 7161 304880-6 · Fax +49 7161 304880-1
AG Ulm · HRB 731058 · Managing Director: Jens Kammerer

1 2 >

1 - 100 of 175 matches

Mail list logo