Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Oleksandr Shulgin
On Wed, Feb 21, 2018 at 7:54 PM, Durity, Sean R  wrote:

>
>
> However, I think the shots at Cassandra are generally unfair. When I
> started working with it, the DataStax documentation was some of the best
> documentation I had seen on any project, especially an open source one.
>

Oh, don't get me started on documentation, especially the DataStax one.  I
come from Postgres.  In comparison, Cassandra documentation is mostly
non-existent (and this is just a way to avoid listing other uncomfortable
epithets).

Not sure if I would be able to submit patches to improve that, however,
since most of the time it would require me to already know the answer to my
questions when the doc is incomplete.

The move from DataStax to Apache.org for docs is actually good, IMO, since
the docs were maintained very poorly and there was no real leverage to
influence that.

Cheers,
--
Alex


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread kurt greaves
>
> Also, I was wondering if the key cache maintains a count of how many local
> accesses a key undergoes. Such information might be very useful for
> compactions of sstables by splitting data by frequency of use so that those
> can be preferentially compacted.

No we don't currently have metrics for that, only overall cache
hits/misses. Measuring individual local accesses would probably have a
performance and memory impact but there's probably a way to do it
efficiently.

Has this been exploited... ever?

Not that I know of. I've theorised about using it previously with some
friends, but never got around to trying it. I imagine if you did you'd
probably have to fix some parts of the code to make it work (like
potentially discoverComponentsFor).
​
Typically I think any conversation that is relevant to the internals of
Cassandra is fine for the dev list, and that's the desired audience. Not
every dev watches the user list and only developers will really be able to
answer these questions. Lets face it, the dev list is pretty dead so not
sure why we care about a few emails landing there.


RE: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Kenneth Brotman
 

Jeff,

 

I already addressed everything you said.  Boy! Would I like to bring up the out 
of date articles on the web that trip people up and the lousy documentation on 
the Apache website but I can’t because a lot of folks don’t know me or why I’m 
saying these things.  

 

I will be making another post that I hope clarifies what’s going on with me.  
After that I will either be a freakishly valuable asset to this community or I 
will be a freakishly valuable asset to another community.  

 

You sure have a funny way of reigning in people that are used to helping out.  
You sure misjudged me.  Wow.

 

Kenneth Brotman

 

From: Jeff Jirsa [mailto:jji...@gmail.com] 
Sent: Wednesday, February 21, 2018 3:12 PM
To: cassandra
Cc: Cassandra DEV
Subject: Re: Cassandra Needs to Grow Up by Version Five!

 

 

On Wed, Feb 21, 2018 at 2:53 PM, Kenneth Brotman  
wrote:

Hi Akash,

I get the part about outside work which is why in replying to Jeff Jirsa I was 
suggesting the big companies could justify taking it on easy enough and you 
know actually pay the people who would be working at it so those people could 
have a life.

The part I don't get is the aversion to usability.  Isn't that what you think 
about when you are coding?  "Am I making this thing I'm building easy to use?"  
If you were programming for me, we would be constantly talking about what we 
are building and how we can make things easier for users.  If I had to fight 
with a developer, architect or engineer about usability all the time, they 
would be gone and quick.  How do approach programming if you aren't trying to 
make things easy.

 

 

There's no aversion to usability, you're assuming things that just aren't true 
Nobody's against usability, we've just prioritized other things HIGHER. We make 
those decisions in part by looking at open JIRAs and determining what's asked 
for the most, what members of the community have contributed, and then balance 
that against what we ourselves care about. You're making a statement that it 
should be the top priority for the next release, with no JIRA, and history of 
contributing (and indeed, no real clear sign that you even understand the full 
extent of the database), no sign that you're willing to do the work yourself, 
and making a ton of assumptions about the level of effort and ROI.

 

I would love for Cassandra to be easier to use, I'm sure everyone does. There's 
a dozen features I'd love to add if I had infinite budget and infinite 
manpower. But what you're asking for is A LOT of effort and / or A LOT of 
money, and you're assuming someone's going to step up and foot the bill, but 
there's no real reason to believe that's the case. 

 

In the mean time, everyone's spending hours replying to this thread that is 0% 
actionable. We would all have been objectively better off had everyone ignored 
this thread and just spent 10 minutes writing some section of the docs. So the 
next time I get the urge to reply, I'm just going to do that instead.

 

 

 



Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread kurt greaves
>
> Instead of saying "Make X better" you can quantify "Here's how we can make
> X better" in a jira and the conversation will continue with interested
> parties (opening jiras are free!). Being combative and insulting project on
> mailing list may help vent some frustrations but it is counter productive
> and makes people defensive.

Yep. In the Cassandra project you'll have a very hard time convincing
someone else (under someone elses pay) to work on what you want even if you
approach it in the right way. Being assertive/aggressive is sure to remove
all chances entirely.
OSS for such large projects as Cassandra only works if we have a variety of
perspectives all working on the project together, as it's not very feasible
for volunteers to get into the C* project on their own time (nor will it
ever be). At the moment we don't have enough different perspectives working
on the project and the only way to improve that is get involved (preferably
writing some code).

I have to disagree with people here and point out that just creating JIRA's
and (trying to) have discussions about these issues will not lead to change
in any reasonable timeframe, because everyone who could do the work has an
endless list of bigger fish to fry. I strongly encourage you to get
involved and write some code, or pay someone to do it, because to put it
bluntly, it's *very* unlikely your JIRA's will get actioned unless you
contribute significantly to them yourself.

Of course there are also other ways to contribute as well, but by far the
most effective would be to contribute fixes, the next most effective would
be to contribute documentation and help users on the mailing list. Your
Slender Cassandra project is a great example of this, because despite C*
being hard to administer, it would give a lot of users examples to work
off. If people can get it working properly with the right advice, usability
is not such a big issue.
​


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Chris Lohfink
Instead of saying "Make X better" you can quantify "Here's how we can make X 
better" in a jira and the conversation will continue with interested parties 
(opening jiras are free!). Being combative and insulting project on mailing 
list may help vent some frustrations but it is counter productive and makes 
people defensive.

People are not averse to usability, quite the opposite actually. People do tend 
to be averse to conversations opened up with "cassandra is an idiot" with no 
clear definition of how to make it better or what a better solution would look 
like though. Note however that saying "make backups better" or "look at 
marketing literature for these guys" is hard for an engineer or architect to 
break into actionable item. Coming up with cool ideas on how to do something 
will more likely hook a developer into working on it then trying to shame the 
community with a sales pitch from another DB's sales guy.

Chris

> On Feb 21, 2018, at 4:53 PM, Kenneth Brotman  
> wrote:
> 
> Hi Akash,
> 
> I get the part about outside work which is why in replying to Jeff Jirsa I 
> was suggesting the big companies could justify taking it on easy enough and 
> you know actually pay the people who would be working at it so those people 
> could have a life.
> 
> The part I don't get is the aversion to usability.  Isn't that what you think 
> about when you are coding?  "Am I making this thing I'm building easy to 
> use?"  If you were programming for me, we would be constantly talking about 
> what we are building and how we can make things easier for users.  If I had 
> to fight with a developer, architect or engineer about usability all the 
> time, they would be gone and quick.  How do approach programming if you 
> aren't trying to make things easy.
> 
> Kenneth Brotman
> 
> -Original Message-
> From: Akash Gangil [mailto:akashg1...@gmail.com] 
> Sent: Wednesday, February 21, 2018 2:24 PM
> To: d...@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
> 
> I would second Jon in the arguments he made. Contributing outside work is 
> draining and really requires a lot of commitment. If someone requires 
> features around usability etc, just pay for it, period.
> 
> On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman < 
> kenbrot...@yahoo.com.invalid> wrote:
> 
>> Jon,
>> 
>> Very sorry that you don't see the value of the time I'm taking for this.
>> I don't have demands; I do have a stern warning and I'm right Jon.  
>> Please be very careful not to mischaracterized my words Jon.
>> 
>> You suggest I put things in JIRA's, then seem to suggest that I'd be 
>> lucky if anyone looked at it and did anything. That's what I figured too.
>> 
>> I don't appreciate the hostility.  You will understand more fully in 
>> the next post where I'm coming from.  Try to keep the conversation civilized.
>> I'm trying or at least so you understand I think what I'm doing is 
>> saving your gig and mine.  I really like a lot of people is this group.
>> 
>> I've come to a preliminary assessment on things.  Soon the cloud will 
>> clear or I'll be gone.  Don't worry.  I'm a very peaceful person and 
>> like you I am driven by real important projects that I feel compelled 
>> to work on for the good of others.  I don't have time for people to 
>> hand hold a database and I can't get stuck with my projects on the wrong 
>> stuff.
>> 
>> Kenneth Brotman
>> 
>> 
>> -Original Message-
>> From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon 
>> Haddad
>> Sent: Wednesday, February 21, 2018 12:44 PM
>> To: user@cassandra.apache.org
>> Cc: d...@cassandra.apache.org
>> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>> 
>> Ken,
>> 
>> Maybe it’s not clear how open source projects work, so let me try to 
>> explain.  There’s a bunch of us who either get paid by someone or 
>> volunteer on our free time.  The folks that get paid, (yay!) usually 
>> take direction on what the priorities are, and work on projects that 
>> directly affect our jobs.  That means that someone needs to care 
>> enough about the features you want to work on them, if you’re not going to 
>> do it yourself.
>> 
>> Now as others have said already, please put your list of demands in 
>> JIRA, if someone is interested, they will work on it.  You may need to 
>> contribute a little more than you’ve done already, be prepared to get 
>> involved if you actually want to to see something get done.  Perhaps 
>> learning a little more about Cassandra’s internals and the people 
>> involved will reveal some of the design decisions and priorities of the 
>> project.
>> 
>> Third, you seem to be a little obsessed with market share.  While 
>> market share is fun to talk about, *most* of us that are working on 
>> and contributing to Cassandra do so because it does actually solve a 
>> problem we have, and solves it reasonably well.  If some magic open 
>> 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Jason Brown
Hi all,

I'd like to deescalate a bit here.

Since this is an Apache and an OSS project, contributions come in many
forms: code, speaking/advocacy, documentation, support, project management,
and so on. None of these things come for free.

Ken, I appreciate you bring up these usability topics; they are certainly
valid concerns. You've mentioned you are working on posting of some sort
that I think will amount to an enumerated list of the topics/issues you
feel need addressing. Some may be simple changes, some may be more
invasive, some we can consider implementing, some not. I look forward to a
positive discussion.

I think what would be best would be for you to complete that list and work
with the community, in a *positive and constructive manner*, towards
getting it done. That is certainly contributing, and contributing in a big
way: project management. Working with the community is going to be the most
beneficial path for everyone.

Ken, if you feel like you'd like some help getting such an initiative
going, and contributing substantively to it (not necessarily in terms of
code) please feel free to reach out to me directly (jasedbr...@gmail.com).

Hoping this leads somewhere positive, that benefits everyone,

-Jason



On Wed, Feb 21, 2018 at 2:53 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Hi Akash,
>
> I get the part about outside work which is why in replying to Jeff Jirsa I
> was suggesting the big companies could justify taking it on easy enough and
> you know actually pay the people who would be working at it so those people
> could have a life.
>
> The part I don't get is the aversion to usability.  Isn't that what you
> think about when you are coding?  "Am I making this thing I'm building easy
> to use?"  If you were programming for me, we would be constantly talking
> about what we are building and how we can make things easier for users.  If
> I had to fight with a developer, architect or engineer about usability all
> the time, they would be gone and quick.  How do approach programming if you
> aren't trying to make things easy.
>
> Kenneth Brotman
>
> -Original Message-
> From: Akash Gangil [mailto:akashg1...@gmail.com]
> Sent: Wednesday, February 21, 2018 2:24 PM
> To: d...@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> I would second Jon in the arguments he made. Contributing outside work is
> draining and really requires a lot of commitment. If someone requires
> features around usability etc, just pay for it, period.
>
> On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> > Jon,
> >
> > Very sorry that you don't see the value of the time I'm taking for this.
> > I don't have demands; I do have a stern warning and I'm right Jon.
> > Please be very careful not to mischaracterized my words Jon.
> >
> > You suggest I put things in JIRA's, then seem to suggest that I'd be
> > lucky if anyone looked at it and did anything. That's what I figured too.
> >
> > I don't appreciate the hostility.  You will understand more fully in
> > the next post where I'm coming from.  Try to keep the conversation
> civilized.
> > I'm trying or at least so you understand I think what I'm doing is
> > saving your gig and mine.  I really like a lot of people is this group.
> >
> > I've come to a preliminary assessment on things.  Soon the cloud will
> > clear or I'll be gone.  Don't worry.  I'm a very peaceful person and
> > like you I am driven by real important projects that I feel compelled
> > to work on for the good of others.  I don't have time for people to
> > hand hold a database and I can't get stuck with my projects on the wrong
> stuff.
> >
> > Kenneth Brotman
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon
> > Haddad
> > Sent: Wednesday, February 21, 2018 12:44 PM
> > To: user@cassandra.apache.org
> > Cc: d...@cassandra.apache.org
> > Subject: Re: Cassandra Needs to Grow Up by Version Five!
> >
> > Ken,
> >
> > Maybe it’s not clear how open source projects work, so let me try to
> > explain.  There’s a bunch of us who either get paid by someone or
> > volunteer on our free time.  The folks that get paid, (yay!) usually
> > take direction on what the priorities are, and work on projects that
> > directly affect our jobs.  That means that someone needs to care
> > enough about the features you want to work on them, if you’re not going
> to do it yourself.
> >
> > Now as others have said already, please put your list of demands in
> > JIRA, if someone is interested, they will work on it.  You may need to
> > contribute a little more than you’ve done already, be prepared to get
> > involved if you actually want to to see something get done.  Perhaps
> > learning a little more about Cassandra’s internals and the people
> > involved will reveal some of the design decisions and priorities of the
> project.
> >

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Also, I was wondering if the key cache maintains a count of how many local
accesses a key undergoes. Such information might be very useful for
compactions of sstables by splitting data by frequency of use so that those
can be preferentially compacted.

On Wed, Feb 21, 2018 at 5:08 PM, Carl Mueller 
wrote:

> Looking through the 2.1.X code I see this:
>
> org.apache.cassandra.io.sstable.Component.java
>
> In the enum for component types there is a CUSTOM enum value which seems
> to indicate a catchall for providing metadata for sstables.
>
> Has this been exploited... ever? I noticed in some of the patches for the
> archival options on TWCS there are complaints about being able to identify
> sstables that are archived and those that aren't.
>
> I would be interested in order to mark the sstables with metadata
> indicating the date range an sstable is targetted at for compactions.
>
> discoverComponentsFor seems to explicitly exclude the loadup of any
> files/sstable components that are CUSTOM in SStable.java
>
> On Wed, Feb 21, 2018 at 10:05 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> jon: I am planning on writing a custom compaction strategy. That's why
>> the question is here, I figured the specifics of memtable -> sstable and
>> cassandra internals are not a user question. If that still isn't deep
>> enough for the dev thread, I will move all those questions to user.
>>
>> On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller <
>> carl.muel...@smartthings.com> wrote:
>>
>>> Thank you all!
>>>
>>> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves 
>>> wrote:
>>>
 Probably a lot of work but it would be incredibly useful for vnodes if
 flushing was range aware (to be used with RangeAwareCompactionStrategy).
 The writers are already range aware for JBOD, but that's not terribly
 valuable ATM.

 On 20 February 2018 at 21:57, Jeff Jirsa  wrote:

> There are some arguments to be made that the flush should consider
> compaction strategy - would allow a bug flush to respect LCS filesizes or
> break into smaller pieces to try to minimize range overlaps going from l0
> into l1, for example.
>
> I have no idea how much work would be involved, but may be worthwhile.
>
>
> --
> Jeff Jirsa
>
>
> On Feb 20,  2018, at 1:26 PM, Jon Haddad  wrote:
>
> The file format is independent from compaction.  A compaction strategy
> only selects sstables to be compacted, that’s it’s only job.  It could 
> have
> side effects, like generating other files, but any decent compaction
> strategy will account for the fact that those other files don’t exist.
>
> I wrote a blog post a few months ago going over some of the nuance of
> compaction you mind find informative: http://thelastpic
> kle.com/blog/2017/03/16/compaction-nuance.html
>
> This is also the wrong mailing list, please direct future user
> questions to the user list.  The dev list is for development of Cassandra
> itself.
>
> Jon
>
> On Feb 20, 2018, at 1:10 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
> When memtables/CommitLogs are flushed to disk/sstable, does the
> sstable go
> through sstable organization specific to each compaction strategy, or
> is
> the sstable creation the same for all compactionstrats and it is up to
> the
> compaction strategy to recompact the sstable if desired?
>
>
>

>>>
>>
>


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Jeff Jirsa
On Wed, Feb 21, 2018 at 2:53 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Hi Akash,
>
> I get the part about outside work which is why in replying to Jeff Jirsa I
> was suggesting the big companies could justify taking it on easy enough and
> you know actually pay the people who would be working at it so those people
> could have a life.
>
> The part I don't get is the aversion to usability.  Isn't that what you
> think about when you are coding?  "Am I making this thing I'm building easy
> to use?"  If you were programming for me, we would be constantly talking
> about what we are building and how we can make things easier for users.  If
> I had to fight with a developer, architect or engineer about usability all
> the time, they would be gone and quick.  How do approach programming if you
> aren't trying to make things easy.
>


There's no aversion to usability, you're assuming things that just aren't
true. Nobody's against usability, we've just prioritized other things
HIGHER. We make those decisions in part by looking at open JIRAs and
determining what's asked for the most, what members of the community have
contributed, and then balance that against what we ourselves care about.
You're making a statement that it should be the top priority for the next
release, with no JIRA, and history of contributing (and indeed, no real
clear sign that you even understand the full extent of the database), no
sign that you're willing to do the work yourself, and making a ton of
assumptions about the level of effort and ROI.

I would love for Cassandra to be easier to use, I'm sure everyone does.
There's a dozen features I'd love to add if I had infinite budget and
infinite manpower. But what you're asking for is A LOT of effort and / or A
LOT of money, and you're assuming someone's going to step up and foot the
bill, but there's no real reason to believe that's the case.

In the mean time, everyone's spending hours replying to this thread that is
0% actionable. We would all have been objectively better off had everyone
ignored this thread and just spent 10 minutes writing some section of the
docs. So the next time I get the urge to reply, I'm just going to do that
instead.


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Brandon Williams
The only progress from this point is what Jon said: enumerate and detail
your issues in jira tickets.

On Wed, Feb 21, 2018 at 4:53 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Hi Akash,
>
> I get the part about outside work which is why in replying to Jeff Jirsa I
> was suggesting the big companies could justify taking it on easy enough and
> you know actually pay the people who would be working at it so those people
> could have a life.
>
> The part I don't get is the aversion to usability.  Isn't that what you
> think about when you are coding?  "Am I making this thing I'm building easy
> to use?"  If you were programming for me, we would be constantly talking
> about what we are building and how we can make things easier for users.  If
> I had to fight with a developer, architect or engineer about usability all
> the time, they would be gone and quick.  How do approach programming if you
> aren't trying to make things easy.
>
> Kenneth Brotman
>
> -Original Message-
> From: Akash Gangil [mailto:akashg1...@gmail.com]
> Sent: Wednesday, February 21, 2018 2:24 PM
> To: d...@cassandra.apache.org
> Cc: user@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> I would second Jon in the arguments he made. Contributing outside work is
> draining and really requires a lot of commitment. If someone requires
> features around usability etc, just pay for it, period.
>
> On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> > Jon,
> >
> > Very sorry that you don't see the value of the time I'm taking for this.
> > I don't have demands; I do have a stern warning and I'm right Jon.
> > Please be very careful not to mischaracterized my words Jon.
> >
> > You suggest I put things in JIRA's, then seem to suggest that I'd be
> > lucky if anyone looked at it and did anything. That's what I figured too.
> >
> > I don't appreciate the hostility.  You will understand more fully in
> > the next post where I'm coming from.  Try to keep the conversation
> civilized.
> > I'm trying or at least so you understand I think what I'm doing is
> > saving your gig and mine.  I really like a lot of people is this group.
> >
> > I've come to a preliminary assessment on things.  Soon the cloud will
> > clear or I'll be gone.  Don't worry.  I'm a very peaceful person and
> > like you I am driven by real important projects that I feel compelled
> > to work on for the good of others.  I don't have time for people to
> > hand hold a database and I can't get stuck with my projects on the wrong
> stuff.
> >
> > Kenneth Brotman
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon
> > Haddad
> > Sent: Wednesday, February 21, 2018 12:44 PM
> > To: user@cassandra.apache.org
> > Cc: d...@cassandra.apache.org
> > Subject: Re: Cassandra Needs to Grow Up by Version Five!
> >
> > Ken,
> >
> > Maybe it’s not clear how open source projects work, so let me try to
> > explain.  There’s a bunch of us who either get paid by someone or
> > volunteer on our free time.  The folks that get paid, (yay!) usually
> > take direction on what the priorities are, and work on projects that
> > directly affect our jobs.  That means that someone needs to care
> > enough about the features you want to work on them, if you’re not going
> to do it yourself.
> >
> > Now as others have said already, please put your list of demands in
> > JIRA, if someone is interested, they will work on it.  You may need to
> > contribute a little more than you’ve done already, be prepared to get
> > involved if you actually want to to see something get done.  Perhaps
> > learning a little more about Cassandra’s internals and the people
> > involved will reveal some of the design decisions and priorities of the
> project.
> >
> > Third, you seem to be a little obsessed with market share.  While
> > market share is fun to talk about, *most* of us that are working on
> > and contributing to Cassandra do so because it does actually solve a
> > problem we have, and solves it reasonably well.  If some magic open
> > source DB appears out of no where and does everything you want
> > Cassandra to, and is bug free, keeps your data consistent,
> > automatically does backups, comes with really nice cert management, ad
> > hoc querying, amazing materialized views that are perfect, no caveats
> > to secondary indexes, and somehow still gives you linear scalability
> > without any mental overhead whatsoever then sure, people might start
> > using it.  And that’s actually OK, because if that happens we’ll all
> > be incredibly pumped out of our minds because we won’t have to work as
> > hard.  If on the slim chance that doesn’t manifest, those of us that
> > use Cassandra and are part of the community will keep working on the
> > things we care about, iterating, and improving things.  Maybe someone
> will even take a look at your JIRA issues.
> >
> > 

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Looking through the 2.1.X code I see this:

org.apache.cassandra.io.sstable.Component.java

In the enum for component types there is a CUSTOM enum value which seems to
indicate a catchall for providing metadata for sstables.

Has this been exploited... ever? I noticed in some of the patches for the
archival options on TWCS there are complaints about being able to identify
sstables that are archived and those that aren't.

I would be interested in order to mark the sstables with metadata
indicating the date range an sstable is targetted at for compactions.

discoverComponentsFor seems to explicitly exclude the loadup of any
files/sstable components that are CUSTOM in SStable.java

On Wed, Feb 21, 2018 at 10:05 AM, Carl Mueller  wrote:

> jon: I am planning on writing a custom compaction strategy. That's why the
> question is here, I figured the specifics of memtable -> sstable and
> cassandra internals are not a user question. If that still isn't deep
> enough for the dev thread, I will move all those questions to user.
>
> On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> Thank you all!
>>
>> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves 
>> wrote:
>>
>>> Probably a lot of work but it would be incredibly useful for vnodes if
>>> flushing was range aware (to be used with RangeAwareCompactionStrategy).
>>> The writers are already range aware for JBOD, but that's not terribly
>>> valuable ATM.
>>>
>>> On 20 February 2018 at 21:57, Jeff Jirsa  wrote:
>>>
 There are some arguments to be made that the flush should consider
 compaction strategy - would allow a bug flush to respect LCS filesizes or
 break into smaller pieces to try to minimize range overlaps going from l0
 into l1, for example.

 I have no idea how much work would be involved, but may be worthwhile.


 --
 Jeff Jirsa


 On Feb 20,  2018, at 1:26 PM, Jon Haddad  wrote:

 The file format is independent from compaction.  A compaction strategy
 only selects sstables to be compacted, that’s it’s only job.  It could have
 side effects, like generating other files, but any decent compaction
 strategy will account for the fact that those other files don’t exist.

 I wrote a blog post a few months ago going over some of the nuance of
 compaction you mind find informative: http://thelastpic
 kle.com/blog/2017/03/16/compaction-nuance.html

 This is also the wrong mailing list, please direct future user
 questions to the user list.  The dev list is for development of Cassandra
 itself.

 Jon

 On Feb 20, 2018, at 1:10 PM, Carl Mueller 
 wrote:

 When memtables/CommitLogs are flushed to disk/sstable, does the sstable
 go
 through sstable organization specific to each compaction strategy, or is
 the sstable creation the same for all compactionstrats and it is up to
 the
 compaction strategy to recompact the sstable if desired?



>>>
>>
>


RE: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Kenneth Brotman
Hi Akash,

I get the part about outside work which is why in replying to Jeff Jirsa I was 
suggesting the big companies could justify taking it on easy enough and you 
know actually pay the people who would be working at it so those people could 
have a life.

The part I don't get is the aversion to usability.  Isn't that what you think 
about when you are coding?  "Am I making this thing I'm building easy to use?"  
If you were programming for me, we would be constantly talking about what we 
are building and how we can make things easier for users.  If I had to fight 
with a developer, architect or engineer about usability all the time, they 
would be gone and quick.  How do approach programming if you aren't trying to 
make things easy.

Kenneth Brotman

-Original Message-
From: Akash Gangil [mailto:akashg1...@gmail.com] 
Sent: Wednesday, February 21, 2018 2:24 PM
To: d...@cassandra.apache.org
Cc: user@cassandra.apache.org
Subject: Re: Cassandra Needs to Grow Up by Version Five!

I would second Jon in the arguments he made. Contributing outside work is 
draining and really requires a lot of commitment. If someone requires features 
around usability etc, just pay for it, period.

On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman < 
kenbrot...@yahoo.com.invalid> wrote:

> Jon,
>
> Very sorry that you don't see the value of the time I'm taking for this.
> I don't have demands; I do have a stern warning and I'm right Jon.  
> Please be very careful not to mischaracterized my words Jon.
>
> You suggest I put things in JIRA's, then seem to suggest that I'd be 
> lucky if anyone looked at it and did anything. That's what I figured too.
>
> I don't appreciate the hostility.  You will understand more fully in 
> the next post where I'm coming from.  Try to keep the conversation civilized.
> I'm trying or at least so you understand I think what I'm doing is 
> saving your gig and mine.  I really like a lot of people is this group.
>
> I've come to a preliminary assessment on things.  Soon the cloud will 
> clear or I'll be gone.  Don't worry.  I'm a very peaceful person and 
> like you I am driven by real important projects that I feel compelled 
> to work on for the good of others.  I don't have time for people to 
> hand hold a database and I can't get stuck with my projects on the wrong 
> stuff.
>
> Kenneth Brotman
>
>
> -Original Message-
> From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon 
> Haddad
> Sent: Wednesday, February 21, 2018 12:44 PM
> To: user@cassandra.apache.org
> Cc: d...@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> Ken,
>
> Maybe it’s not clear how open source projects work, so let me try to 
> explain.  There’s a bunch of us who either get paid by someone or 
> volunteer on our free time.  The folks that get paid, (yay!) usually 
> take direction on what the priorities are, and work on projects that 
> directly affect our jobs.  That means that someone needs to care 
> enough about the features you want to work on them, if you’re not going to do 
> it yourself.
>
> Now as others have said already, please put your list of demands in 
> JIRA, if someone is interested, they will work on it.  You may need to 
> contribute a little more than you’ve done already, be prepared to get 
> involved if you actually want to to see something get done.  Perhaps 
> learning a little more about Cassandra’s internals and the people 
> involved will reveal some of the design decisions and priorities of the 
> project.
>
> Third, you seem to be a little obsessed with market share.  While 
> market share is fun to talk about, *most* of us that are working on 
> and contributing to Cassandra do so because it does actually solve a 
> problem we have, and solves it reasonably well.  If some magic open 
> source DB appears out of no where and does everything you want 
> Cassandra to, and is bug free, keeps your data consistent, 
> automatically does backups, comes with really nice cert management, ad 
> hoc querying, amazing materialized views that are perfect, no caveats 
> to secondary indexes, and somehow still gives you linear scalability 
> without any mental overhead whatsoever then sure, people might start 
> using it.  And that’s actually OK, because if that happens we’ll all 
> be incredibly pumped out of our minds because we won’t have to work as 
> hard.  If on the slim chance that doesn’t manifest, those of us that 
> use Cassandra and are part of the community will keep working on the 
> things we care about, iterating, and improving things.  Maybe someone will 
> even take a look at your JIRA issues.
>
> Further filling the mailing list with your grievances will likely not 
> help you progress towards your goal of a Cassandra that’s easier to 
> use, so I encourage you to try to be a little more productive and try 
> to help rather than just complain, which is not constructive.  I did a 
> quick search for your name on the 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Akash Gangil
I would second Jon in the arguments he made. Contributing outside work is
draining and really requires a lot of commitment. If someone requires
features around usability etc, just pay for it, period.

On Wed, Feb 21, 2018 at 2:20 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Jon,
>
> Very sorry that you don't see the value of the time I'm taking for this.
> I don't have demands; I do have a stern warning and I'm right Jon.  Please
> be very careful not to mischaracterized my words Jon.
>
> You suggest I put things in JIRA's, then seem to suggest that I'd be lucky
> if anyone looked at it and did anything. That's what I figured too.
>
> I don't appreciate the hostility.  You will understand more fully in the
> next post where I'm coming from.  Try to keep the conversation civilized.
> I'm trying or at least so you understand I think what I'm doing is saving
> your gig and mine.  I really like a lot of people is this group.
>
> I've come to a preliminary assessment on things.  Soon the cloud will
> clear or I'll be gone.  Don't worry.  I'm a very peaceful person and like
> you I am driven by real important projects that I feel compelled to work on
> for the good of others.  I don't have time for people to hand hold a
> database and I can't get stuck with my projects on the wrong stuff.
>
> Kenneth Brotman
>
>
> -Original Message-
> From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon
> Haddad
> Sent: Wednesday, February 21, 2018 12:44 PM
> To: user@cassandra.apache.org
> Cc: d...@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> Ken,
>
> Maybe it’s not clear how open source projects work, so let me try to
> explain.  There’s a bunch of us who either get paid by someone or volunteer
> on our free time.  The folks that get paid, (yay!) usually take direction
> on what the priorities are, and work on projects that directly affect our
> jobs.  That means that someone needs to care enough about the features you
> want to work on them, if you’re not going to do it yourself.
>
> Now as others have said already, please put your list of demands in JIRA,
> if someone is interested, they will work on it.  You may need to contribute
> a little more than you’ve done already, be prepared to get involved if you
> actually want to to see something get done.  Perhaps learning a little more
> about Cassandra’s internals and the people involved will reveal some of the
> design decisions and priorities of the project.
>
> Third, you seem to be a little obsessed with market share.  While market
> share is fun to talk about, *most* of us that are working on and
> contributing to Cassandra do so because it does actually solve a problem we
> have, and solves it reasonably well.  If some magic open source DB appears
> out of no where and does everything you want Cassandra to, and is bug free,
> keeps your data consistent, automatically does backups, comes with really
> nice cert management, ad hoc querying, amazing materialized views that are
> perfect, no caveats to secondary indexes, and somehow still gives you
> linear scalability without any mental overhead whatsoever then sure, people
> might start using it.  And that’s actually OK, because if that happens
> we’ll all be incredibly pumped out of our minds because we won’t have to
> work as hard.  If on the slim chance that doesn’t manifest, those of us
> that use Cassandra and are part of the community will keep working on the
> things we care about, iterating, and improving things.  Maybe someone will
> even take a look at your JIRA issues.
>
> Further filling the mailing list with your grievances will likely not help
> you progress towards your goal of a Cassandra that’s easier to use, so I
> encourage you to try to be a little more productive and try to help rather
> than just complain, which is not constructive.  I did a quick search for
> your name on the mailing list, and I’ve seen very little from you, so to
> everyone’s who’s been around for a while and trying to help you it looks
> like you’re just some random dude asking for people to work for free on the
> things you’re asking for, without offering anything back in return.
>
> Jon
>
>
> > On Feb 21, 2018, at 11:56 AM, Kenneth Brotman
>  wrote:
> >
> > Josh,
> >
> > To say nothing is indifference.  If you care about your community,
> sometimes don't you have to bring up a subject even though you know it's
> also temporarily adding some discomfort?
> >
> > As to opening a JIRA, I've got a very specific topic to try in mind
> now.  An easy one I'll work on and then announce.  Someone else will have
> to do the coding.  A year from now I would probably just knock it out to
> make sure it's as easy as I expect it to be but to be honest, as I've been
> saying, I'm not set up to do that right now.  I've barely looked at any
> Cassandra code; for one; everyone on this list probably codes more than I
> do, secondly; and 

RE: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Kenneth Brotman
Jon,

Very sorry that you don't see the value of the time I'm taking for this.  I 
don't have demands; I do have a stern warning and I'm right Jon.  Please be 
very careful not to mischaracterized my words Jon.

You suggest I put things in JIRA's, then seem to suggest that I'd be lucky if 
anyone looked at it and did anything. That's what I figured too.  

I don't appreciate the hostility.  You will understand more fully in the next 
post where I'm coming from.  Try to keep the conversation civilized.  I'm 
trying or at least so you understand I think what I'm doing is saving your gig 
and mine.  I really like a lot of people is this group.

I've come to a preliminary assessment on things.  Soon the cloud will clear or 
I'll be gone.  Don't worry.  I'm a very peaceful person and like you I am 
driven by real important projects that I feel compelled to work on for the good 
of others.  I don't have time for people to hand hold a database and I can't 
get stuck with my projects on the wrong stuff.  

Kenneth Brotman


-Original Message-
From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon Haddad
Sent: Wednesday, February 21, 2018 12:44 PM
To: user@cassandra.apache.org
Cc: d...@cassandra.apache.org
Subject: Re: Cassandra Needs to Grow Up by Version Five!

Ken,

Maybe it’s not clear how open source projects work, so let me try to explain.  
There’s a bunch of us who either get paid by someone or volunteer on our free 
time.  The folks that get paid, (yay!) usually take direction on what the 
priorities are, and work on projects that directly affect our jobs.  That means 
that someone needs to care enough about the features you want to work on them, 
if you’re not going to do it yourself. 

Now as others have said already, please put your list of demands in JIRA, if 
someone is interested, they will work on it.  You may need to contribute a 
little more than you’ve done already, be prepared to get involved if you 
actually want to to see something get done.  Perhaps learning a little more 
about Cassandra’s internals and the people involved will reveal some of the 
design decisions and priorities of the project.  

Third, you seem to be a little obsessed with market share.  While market share 
is fun to talk about, *most* of us that are working on and contributing to 
Cassandra do so because it does actually solve a problem we have, and solves it 
reasonably well.  If some magic open source DB appears out of no where and does 
everything you want Cassandra to, and is bug free, keeps your data consistent, 
automatically does backups, comes with really nice cert management, ad hoc 
querying, amazing materialized views that are perfect, no caveats to secondary 
indexes, and somehow still gives you linear scalability without any mental 
overhead whatsoever then sure, people might start using it.  And that’s 
actually OK, because if that happens we’ll all be incredibly pumped out of our 
minds because we won’t have to work as hard.  If on the slim chance that 
doesn’t manifest, those of us that use Cassandra and are part of the community 
will keep working on the things we care about, iterating, and improving things. 
 Maybe someone will even take a look at your JIRA issues.  

Further filling the mailing list with your grievances will likely not help you 
progress towards your goal of a Cassandra that’s easier to use, so I encourage 
you to try to be a little more productive and try to help rather than just 
complain, which is not constructive.  I did a quick search for your name on the 
mailing list, and I’ve seen very little from you, so to everyone’s who’s been 
around for a while and trying to help you it looks like you’re just some random 
dude asking for people to work for free on the things you’re asking for, 
without offering anything back in return.

Jon


> On Feb 21, 2018, at 11:56 AM, Kenneth Brotman  
> wrote:
> 
> Josh,
> 
> To say nothing is indifference.  If you care about your community, sometimes 
> don't you have to bring up a subject even though you know it's also 
> temporarily adding some discomfort?  
> 
> As to opening a JIRA, I've got a very specific topic to try in mind now.  An 
> easy one I'll work on and then announce.  Someone else will have to do the 
> coding.  A year from now I would probably just knock it out to make sure it's 
> as easy as I expect it to be but to be honest, as I've been saying, I'm not 
> set up to do that right now.  I've barely looked at any Cassandra code; for 
> one; everyone on this list probably codes more than I do, secondly; and 
> lastly, it's a good one for someone that wants an easy one to start with: 
> vNodes.  I've already seen too many people seeking assistance with the vNode 
> setting.
> 
> And you can expect as others have been mentioning that there should be 
> similar ones on compaction, repair and backup. 
> 
> Microsoft knows poor usability gives them an easy market to take over. And 
> 

Re: Performance Of IN Queries On Wide Rows

2018-02-21 Thread Jeff Jirsa
Slight nuance: we don't load the whole row into memory, but the column
index (and the result set, and the tombstones in the partition), which can
still spike your GC/heap (and potentially overflow the row cache, if you
have it on, which is atypical).

On Wed, Feb 21, 2018 at 1:35 PM, Carl Mueller 
wrote:

> Cass 2.1.14 is missing some wide row optimizations done in later cass
> releases IIRC.
>
> Speculation: IN won't matter, it will load the entire wide row into memory
> regardless which might spike your GC/heap and overflow the rowcache
>
> On Wed, Feb 21, 2018 at 2:16 PM, Gareth Collins <
> gareth.o.coll...@gmail.com> wrote:
>
>> Thanks for the response!
>>
>> I could understand that being the case if the Cassandra cluster is not
>> loaded. Splitting the work across multiple nodes would obviously make
>> the query faster.
>>
>> But if this was just a single node, shouldn't one IN query be faster
>> than multiple due to the fact that, if I understand correctly,
>> Cassandra should need to do less work?
>>
>> thanks in advance,
>> Gareth
>>
>> On Wed, Feb 21, 2018 at 7:27 AM, Rahul Singh
>>  wrote:
>> > That depends on the driver you use but separate queries asynchronously
>> > around the cluster would be faster.
>> >
>> >
>> > --
>> > Rahul Singh
>> > rahul.si...@anant.us
>> >
>> > Anant Corporation
>> >
>> > On Feb 20, 2018, 6:48 PM -0500, Eric Stevens ,
>> wrote:
>> >
>> > Someone can correct me if I'm wrong, but I believe if you do a large
>> IN() on
>> > a single partition's cluster keys, all the reads are going to be served
>> from
>> > a single replica.  Compared to many concurrent individual equal
>> statements
>> > you can get the performance gain of leaning on several replicas for
>> > parallelism.
>> >
>> > On Tue, Feb 20, 2018 at 11:43 AM Gareth Collins <
>> gareth.o.coll...@gmail.com>
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> When querying large wide rows for multiple specific values is it
>> >> better to do separate queries for each value...or do it with one query
>> >> and an "IN"? I am using Cassandra 2.1.14
>> >>
>> >> I am asking because I had changed my app to use 'IN' queries and it
>> >> **appears** to be slower rather than faster. I had assumed that the
>> >> "IN" query should be faster...as I assumed it only needs to go down
>> >> the read path once (i.e. row cache -> memtable -> key cache -> bloom
>> >> filter -> index summary -> index -> compaction -> sstable) rather than
>> >> once for each entry? Or are there some additional caveats that I
>> >> should be aware of for 'IN' query performance (e.g. ordering of 'IN'
>> >> query entries, closeness of 'IN' query values in the SSTable etc.)?
>> >>
>> >> thanks in advance,
>> >> Gareth Collins
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >> For additional commands, e-mail: user-h...@cassandra.apache.org
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>


Re: Performance Of IN Queries On Wide Rows

2018-02-21 Thread Carl Mueller
Cass 2.1.14 is missing some wide row optimizations done in later cass
releases IIRC.

Speculation: IN won't matter, it will load the entire wide row into memory
regardless which might spike your GC/heap and overflow the rowcache

On Wed, Feb 21, 2018 at 2:16 PM, Gareth Collins 
wrote:

> Thanks for the response!
>
> I could understand that being the case if the Cassandra cluster is not
> loaded. Splitting the work across multiple nodes would obviously make
> the query faster.
>
> But if this was just a single node, shouldn't one IN query be faster
> than multiple due to the fact that, if I understand correctly,
> Cassandra should need to do less work?
>
> thanks in advance,
> Gareth
>
> On Wed, Feb 21, 2018 at 7:27 AM, Rahul Singh
>  wrote:
> > That depends on the driver you use but separate queries asynchronously
> > around the cluster would be faster.
> >
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Feb 20, 2018, 6:48 PM -0500, Eric Stevens , wrote:
> >
> > Someone can correct me if I'm wrong, but I believe if you do a large
> IN() on
> > a single partition's cluster keys, all the reads are going to be served
> from
> > a single replica.  Compared to many concurrent individual equal
> statements
> > you can get the performance gain of leaning on several replicas for
> > parallelism.
> >
> > On Tue, Feb 20, 2018 at 11:43 AM Gareth Collins <
> gareth.o.coll...@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> When querying large wide rows for multiple specific values is it
> >> better to do separate queries for each value...or do it with one query
> >> and an "IN"? I am using Cassandra 2.1.14
> >>
> >> I am asking because I had changed my app to use 'IN' queries and it
> >> **appears** to be slower rather than faster. I had assumed that the
> >> "IN" query should be faster...as I assumed it only needs to go down
> >> the read path once (i.e. row cache -> memtable -> key cache -> bloom
> >> filter -> index summary -> index -> compaction -> sstable) rather than
> >> once for each entry? Or are there some additional caveats that I
> >> should be aware of for 'IN' query performance (e.g. ordering of 'IN'
> >> query entries, closeness of 'IN' query values in the SSTable etc.)?
> >>
> >> thanks in advance,
> >> Gareth Collins
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
Hm nodetool decommision performs the streamout of the replicated data, and
you said that was apparently without error...

But if you dropped three nodes in one AZ/rack on a five node with RF3, then
we have a missing RF factor unless NetworkTopologyStrategy fails over to
another AZ. But that would also entail cross-az streaming and queries and
repair.

On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller 
wrote:

> sorry for the idiot questions...
>
> data was allowed to fully rebalance/repair/drain before the next node was
> taken off?
>
> did you take 1 off per rack/AZ?
>
>
> On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash  wrote:
>
>> One node at a time
>>
>> On Feb 21, 2018 10:23 AM, "Carl Mueller" 
>> wrote:
>>
>>> What is your replication factor?
>>> Single datacenter, three availability zones, is that right?
>>> You removed one node at a time or three at once?
>>>
>>> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:
>>>
 We have had a 15 node cluster across three zones and cluster repairs
 using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
 the cluster to 12. Since then, same repair job has taken up to 12 hours to
 finish and most times, it never does.



 More importantly, at some point during the repair cycle, we see read
 latencies jumping to 1-2 seconds and applications immediately notice the
 impact.



 stream_throughput_outbound_megabits_per_sec is set at 200 and
 compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
 around ~500GB at 44% usage.



 When shrinking the cluster, the ‘nodetool decommision’ was eventless.
 It completed successfully with no issues.



 What could possibly cause repairs to cause this impact following
 cluster downsizing? Taking three nodes out does not seem compatible with
 such a drastic effect on repair and read latency.



 Any expert insights will be appreciated.

 
 Thank you



>>>
>>>
>


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
sorry for the idiot questions...

data was allowed to fully rebalance/repair/drain before the next node was
taken off?

did you take 1 off per rack/AZ?


On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash  wrote:

> One node at a time
>
> On Feb 21, 2018 10:23 AM, "Carl Mueller" 
> wrote:
>
>> What is your replication factor?
>> Single datacenter, three availability zones, is that right?
>> You removed one node at a time or three at once?
>>
>> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:
>>
>>> We have had a 15 node cluster across three zones and cluster repairs
>>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>>> finish and most times, it never does.
>>>
>>>
>>>
>>> More importantly, at some point during the repair cycle, we see read
>>> latencies jumping to 1-2 seconds and applications immediately notice the
>>> impact.
>>>
>>>
>>>
>>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>>> around ~500GB at 44% usage.
>>>
>>>
>>>
>>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
>>> completed successfully with no issues.
>>>
>>>
>>>
>>> What could possibly cause repairs to cause this impact following cluster
>>> downsizing? Taking three nodes out does not seem compatible with such a
>>> drastic effect on repair and read latency.
>>>
>>>
>>>
>>> Any expert insights will be appreciated.
>>>
>>> 
>>> Thank you
>>>
>>>
>>>
>>
>>


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Jon Haddad
Ken,

Maybe it’s not clear how open source projects work, so let me try to explain.  
There’s a bunch of us who either get paid by someone or volunteer on our free 
time.  The folks that get paid, (yay!) usually take direction on what the 
priorities are, and work on projects that directly affect our jobs.  That means 
that someone needs to care enough about the features you want to work on them, 
if you’re not going to do it yourself. 

Now as others have said already, please put your list of demands in JIRA, if 
someone is interested, they will work on it.  You may need to contribute a 
little more than you’ve done already, be prepared to get involved if you 
actually want to to see something get done.  Perhaps learning a little more 
about Cassandra’s internals and the people involved will reveal some of the 
design decisions and priorities of the project.  

Third, you seem to be a little obsessed with market share.  While market share 
is fun to talk about, *most* of us that are working on and contributing to 
Cassandra do so because it does actually solve a problem we have, and solves it 
reasonably well.  If some magic open source DB appears out of no where and does 
everything you want Cassandra to, and is bug free, keeps your data consistent, 
automatically does backups, comes with really nice cert management, ad hoc 
querying, amazing materialized views that are perfect, no caveats to secondary 
indexes, and somehow still gives you linear scalability without any mental 
overhead whatsoever then sure, people might start using it.  And that’s 
actually OK, because if that happens we’ll all be incredibly pumped out of our 
minds because we won’t have to work as hard.  If on the slim chance that 
doesn’t manifest, those of us that use Cassandra and are part of the community 
will keep working on the things we care about, iterating, and improving things. 
 Maybe someone will even take a look at your JIRA issues.  

Further filling the mailing list with your grievances will likely not help you 
progress towards your goal of a Cassandra that’s easier to use, so I encourage 
you to try to be a little more productive and try to help rather than just 
complain, which is not constructive.  I did a quick search for your name on the 
mailing list, and I’ve seen very little from you, so to everyone’s who’s been 
around for a while and trying to help you it looks like you’re just some random 
dude asking for people to work for free on the things you’re asking for, 
without offering anything back in return.

Jon


> On Feb 21, 2018, at 11:56 AM, Kenneth Brotman  
> wrote:
> 
> Josh, 
> 
> To say nothing is indifference.  If you care about your community, sometimes 
> don't you have to bring up a subject even though you know it's also 
> temporarily adding some discomfort?  
> 
> As to opening a JIRA, I've got a very specific topic to try in mind now.  An 
> easy one I'll work on and then announce.  Someone else will have to do the 
> coding.  A year from now I would probably just knock it out to make sure it's 
> as easy as I expect it to be but to be honest, as I've been saying, I'm not 
> set up to do that right now.  I've barely looked at any Cassandra code; for 
> one; everyone on this list probably codes more than I do, secondly; and 
> lastly, it's a good one for someone that wants an easy one to start with: 
> vNodes.  I've already seen too many people seeking assistance with the vNode 
> setting.
> 
> And you can expect as others have been mentioning that there should be 
> similar ones on compaction, repair and backup. 
> 
> Microsoft knows poor usability gives them an easy market to take over. And 
> they make it easy to switch.
> 
> Beginning at 4:17 in the video, it says the following:
> 
>   "You don't need to worry about replica sets, quorum or read repair.  
> You can focus on writing correct application logic."
> 
> At 4:42, it says:
>   "Hopefully this gives you a quick idea of how seamlessly you can bring 
> your existing Cassandra applications to Azure Cosmos DB.  No code changes are 
> required.  It works with your favorite Cassandra tools and drivers including 
> for example native Cassandra driver for Spark. And it takes seconds to get 
> going, and it's elastically and globally scalable."
> 
> More to come,
> 
> Kenneth Brotman
> 
> -Original Message-
> From: Josh McKenzie [mailto:jmcken...@apache.org] 
> Sent: Wednesday, February 21, 2018 8:28 AM
> To: d...@cassandra.apache.org
> Cc: User
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
> 
> There's a disheartening amount of "here's where Cassandra is bad, and here's 
> what it needs to do for me for free" happening in this thread.
> 
> This is open-source software. Everyone is *strongly encouraged* to submit a 
> patch to move the needle on *any* of these things being complained about in 
> this thread.
> 
> For the Apache Way  to 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Durity, Sean R
It is instructive to listen to the concerns of new and existing users in order 
to improve a product like Cassandra, but I think the school yard taunt model 
isn’t the most effective.

In my experience with open and closed source databases, there are always things 
that could be improved. Many have a historical base in how the product evolved 
over time. A newcomer sees those as rough edges right away. In other cases, the 
database creators have often widened their scope to try and solve every data 
problem. This creates the complexity of too many configuration options, etc. 
Even the best RDBMS (Informix!) battled these kinds of issues.

Cassandra, though, introduced another angle of difficulty. In trying to relate 
to RDBMS users (pun intended), it often borrowed terminology to make it seem 
familiar. But they don’t work the same way or even solve the same problems. The 
classic example is secondary indexes. For RDBMS, they are very useful; for 
Cassandra, they are anathema (except for very narrow cases).

However, I think the shots at Cassandra are generally unfair. When I started 
working with it, the DataStax documentation was some of the best documentation 
I had seen on any project, especially an open source one. (If anything the 
cooling off between Apache Cassandra and DataStax may be the most serious 
misstep so far…) The more I learned about how Cassandra worked, the more I 
marveled at the clever combination of intricate solutions (gossip, merkle 
trees, compaction strategies, etc.) to solve specific data problems. This is a 
great product! It has given me lots of sleep-filled nights over the last 4+ 
years. My customers love it, once I explain what it should be used for (and 
what it shouldn’t). I applaud the contributors, whether coders or users. Thank 
you!

Finally, a note on backup. Backing up a distributed system is tough, but 
restores are even more complex (if you want no down-time, no extra disk space, 
point-in-time recovery, etc). If you want to investigate why it is a tough 
problem for Cassandra, go look at RecoverX from Datos IO. They have solved many 
of the problems, but it isn’t an easy task. You could ask people to try and 
recreate all that, or just point them to a working solution. If backup and 
recovery is required (and I would argue it isn’t always required), it is 
probably worth paying for.


Sean Durity
From: Josh McKenzie [mailto:jmcken...@apache.org]
Sent: Wednesday, February 21, 2018 11:28 AM
To: d...@cassandra.apache.org
Cc: User 
Subject: [EXTERNAL] Re: Cassandra Needs to Grow Up by Version Five!

There's a disheartening amount of "here's where Cassandra is bad, and here's 
what it needs to do for me for free" happening in this thread.

This is open-source software. Everyone is *strongly encouraged* to submit a 
patch to move the needle on *any* of these things being complained about in 
this thread.

For the Apache 
Way
 to work, people need to step up and meaningfully contribute to a project to 
scratch their own itch instead of just waiting for a random 
corporation-subsidized engineer to happen to have interests that align with 
them and contribute that to the project.

Beating a dead horse for things everyone on the project knows are serious pain 
points is not productive.

On Wed, Feb 21, 2018 at 5:45 AM, Oleksandr Shulgin 
> wrote:
On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

>
> >> Cluster wide management should be a big theme in any next major release.
> >>
> >Na. Stability and testing should be a big theme in the next major release.
> >
>
> Double Na on that one Jeff.  I think you have a concern there about the
> need to test sufficiently to ensure the stability of the next major
> release.  That makes perfect sense.- for every release, especially the
> major ones.  Continuous improvement is not a phase of development for
> example.  CI should be in everything, in every phase.  Stability and
> testing a part of every release not just one.  A major release should be a
> nice step from the previous major release though.
>

I guess what Jeff refers to is the tick-tock release cycle experiment,
which has proven to be a complete disaster by popular opinion.

There's also the "materialized views" feature which failed to materialize
in the end (pun intended) and had to be declared experimental retroactively.

Another prominent example is incremental repair which was introduced as the
default option in 2.2 and now is not recommended to use because of so many
corner cases where it can fail.  So again experimental as an afterthought.

Not to 

Re: Performance Of IN Queries On Wide Rows

2018-02-21 Thread Gareth Collins
Thanks for the response!

I could understand that being the case if the Cassandra cluster is not
loaded. Splitting the work across multiple nodes would obviously make
the query faster.

But if this was just a single node, shouldn't one IN query be faster
than multiple due to the fact that, if I understand correctly,
Cassandra should need to do less work?

thanks in advance,
Gareth

On Wed, Feb 21, 2018 at 7:27 AM, Rahul Singh
 wrote:
> That depends on the driver you use but separate queries asynchronously
> around the cluster would be faster.
>
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 20, 2018, 6:48 PM -0500, Eric Stevens , wrote:
>
> Someone can correct me if I'm wrong, but I believe if you do a large IN() on
> a single partition's cluster keys, all the reads are going to be served from
> a single replica.  Compared to many concurrent individual equal statements
> you can get the performance gain of leaning on several replicas for
> parallelism.
>
> On Tue, Feb 20, 2018 at 11:43 AM Gareth Collins 
> wrote:
>>
>> Hello,
>>
>> When querying large wide rows for multiple specific values is it
>> better to do separate queries for each value...or do it with one query
>> and an "IN"? I am using Cassandra 2.1.14
>>
>> I am asking because I had changed my app to use 'IN' queries and it
>> **appears** to be slower rather than faster. I had assumed that the
>> "IN" query should be faster...as I assumed it only needs to go down
>> the read path once (i.e. row cache -> memtable -> key cache -> bloom
>> filter -> index summary -> index -> compaction -> sstable) rather than
>> once for each entry? Or are there some additional caveats that I
>> should be aware of for 'IN' query performance (e.g. ordering of 'IN'
>> query entries, closeness of 'IN' query values in the SSTable etc.)?
>>
>> thanks in advance,
>> Gareth Collins
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread DuyHai Doan
So before buying any marketing claims from Microsoft or whoever, maybe
should you try to use it extensively ?

And talking about backup, have a look at DynamoDB:
http://i68.tinypic.com/n1b6yr.jpg

>From my POV, if a multi-billions company like Amazon doesn't get it right
or can't make it easy for end-user (without involving  an unwieldy Hadoop
machinery:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html),
what Cassandra offers in term of back-up restore is more than satisfactory




On Wed, Feb 21, 2018 at 8:56 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

>  Josh,
>
> To say nothing is indifference.  If you care about your community,
> sometimes don't you have to bring up a subject even though you know it's
> also temporarily adding some discomfort?
>
> As to opening a JIRA, I've got a very specific topic to try in mind now.
> An easy one I'll work on and then announce.  Someone else will have to do
> the coding.  A year from now I would probably just knock it out to make
> sure it's as easy as I expect it to be but to be honest, as I've been
> saying, I'm not set up to do that right now.  I've barely looked at any
> Cassandra code; for one; everyone on this list probably codes more than I
> do, secondly; and lastly, it's a good one for someone that wants an easy
> one to start with: vNodes.  I've already seen too many people seeking
> assistance with the vNode setting.
>
> And you can expect as others have been mentioning that there should be
> similar ones on compaction, repair and backup.
>
> Microsoft knows poor usability gives them an easy market to take over. And
> they make it easy to switch.
>
> Beginning at 4:17 in the video, it says the following:
>
> "You don't need to worry about replica sets, quorum or read
> repair.  You can focus on writing correct application logic."
>
> At 4:42, it says:
> "Hopefully this gives you a quick idea of how seamlessly you can
> bring your existing Cassandra applications to Azure Cosmos DB.  No code
> changes are required.  It works with your favorite Cassandra tools and
> drivers including for example native Cassandra driver for Spark. And it
> takes seconds to get going, and it's elastically and globally scalable."
>
> More to come,
>
> Kenneth Brotman
>
> -Original Message-
> From: Josh McKenzie [mailto:jmcken...@apache.org]
> Sent: Wednesday, February 21, 2018 8:28 AM
> To: d...@cassandra.apache.org
> Cc: User
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> There's a disheartening amount of "here's where Cassandra is bad, and
> here's what it needs to do for me for free" happening in this thread.
>
> This is open-source software. Everyone is *strongly encouraged* to submit
> a patch to move the needle on *any* of these things being complained about
> in this thread.
>
> For the Apache Way  to
> work, people need to step up and meaningfully contribute to a project to
> scratch their own itch instead of just waiting for a random
> corporation-subsidized engineer to happen to have interests that align with
> them and contribute that to the project.
>
> Beating a dead horse for things everyone on the project knows are serious
> pain points is not productive.
>
> On Wed, Feb 21, 2018 at 5:45 AM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> > On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman <
> > kenbrot...@yahoo.com.invalid> wrote:
> >
> > >
> > > >> Cluster wide management should be a big theme in any next major
> > release.
> > > >>
> > > >Na. Stability and testing should be a big theme in the next major
> > release.
> > > >
> > >
> > > Double Na on that one Jeff.  I think you have a concern there about
> > > the need to test sufficiently to ensure the stability of the next
> > > major release.  That makes perfect sense.- for every release,
> > > especially the major ones.  Continuous improvement is not a phase of
> > > development for example.  CI should be in everything, in every
> > > phase.  Stability and testing a part of every release not just one.
> > > A major release should be
> > a
> > > nice step from the previous major release though.
> > >
> >
> > I guess what Jeff refers to is the tick-tock release cycle experiment,
> > which has proven to be a complete disaster by popular opinion.
> >
> > There's also the "materialized views" feature which failed to
> > materialize in the end (pun intended) and had to be declared
> > experimental retroactively.
> >
> > Another prominent example is incremental repair which was introduced
> > as the default option in 2.2 and now is not recommended to use because
> > of so many corner cases where it can fail.  So again experimental as an
> afterthought.
> >
> > Not to mention that even if you are aware of the default incremental
> > and go with full repair instead, you're still up for a sad surprise:
> > anti-compaction will be triggered despite the "full" 

RE: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Kenneth Brotman
 Josh, 

To say nothing is indifference.  If you care about your community, sometimes 
don't you have to bring up a subject even though you know it's also temporarily 
adding some discomfort?  

As to opening a JIRA, I've got a very specific topic to try in mind now.  An 
easy one I'll work on and then announce.  Someone else will have to do the 
coding.  A year from now I would probably just knock it out to make sure it's 
as easy as I expect it to be but to be honest, as I've been saying, I'm not set 
up to do that right now.  I've barely looked at any Cassandra code; for one; 
everyone on this list probably codes more than I do, secondly; and lastly, it's 
a good one for someone that wants an easy one to start with: vNodes.  I've 
already seen too many people seeking assistance with the vNode setting.

And you can expect as others have been mentioning that there should be similar 
ones on compaction, repair and backup. 

Microsoft knows poor usability gives them an easy market to take over. And they 
make it easy to switch.

Beginning at 4:17 in the video, it says the following:

"You don't need to worry about replica sets, quorum or read repair.  
You can focus on writing correct application logic."

At 4:42, it says:
"Hopefully this gives you a quick idea of how seamlessly you can bring 
your existing Cassandra applications to Azure Cosmos DB.  No code changes are 
required.  It works with your favorite Cassandra tools and drivers including 
for example native Cassandra driver for Spark. And it takes seconds to get 
going, and it's elastically and globally scalable."

More to come,

Kenneth Brotman

-Original Message-
From: Josh McKenzie [mailto:jmcken...@apache.org] 
Sent: Wednesday, February 21, 2018 8:28 AM
To: d...@cassandra.apache.org
Cc: User
Subject: Re: Cassandra Needs to Grow Up by Version Five!

There's a disheartening amount of "here's where Cassandra is bad, and here's 
what it needs to do for me for free" happening in this thread.

This is open-source software. Everyone is *strongly encouraged* to submit a 
patch to move the needle on *any* of these things being complained about in 
this thread.

For the Apache Way  to work, 
people need to step up and meaningfully contribute to a project to scratch 
their own itch instead of just waiting for a random corporation-subsidized 
engineer to happen to have interests that align with them and contribute that 
to the project.

Beating a dead horse for things everyone on the project knows are serious pain 
points is not productive.

On Wed, Feb 21, 2018 at 5:45 AM, Oleksandr Shulgin < 
oleksandr.shul...@zalando.de> wrote:

> On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman < 
> kenbrot...@yahoo.com.invalid> wrote:
>
> >
> > >> Cluster wide management should be a big theme in any next major
> release.
> > >>
> > >Na. Stability and testing should be a big theme in the next major
> release.
> > >
> >
> > Double Na on that one Jeff.  I think you have a concern there about 
> > the need to test sufficiently to ensure the stability of the next 
> > major release.  That makes perfect sense.- for every release, 
> > especially the major ones.  Continuous improvement is not a phase of 
> > development for example.  CI should be in everything, in every 
> > phase.  Stability and testing a part of every release not just one.  
> > A major release should be
> a
> > nice step from the previous major release though.
> >
>
> I guess what Jeff refers to is the tick-tock release cycle experiment, 
> which has proven to be a complete disaster by popular opinion.
>
> There's also the "materialized views" feature which failed to 
> materialize in the end (pun intended) and had to be declared 
> experimental retroactively.
>
> Another prominent example is incremental repair which was introduced 
> as the default option in 2.2 and now is not recommended to use because 
> of so many corner cases where it can fail.  So again experimental as an 
> afterthought.
>
> Not to mention that even if you are aware of the default incremental 
> and go with full repair instead, you're still up for a sad surprise:
> anti-compaction will be triggered despite the "full" repair.  Because 
> anti-compaction is only disabled in case of sub-range repair (don't 
> ask why), so you need to use something advanced like Reaper if you 
> want to avoid that.  I don't think you'll ever find this in the documentation.
>
> Honestly, for an eventually-consistent system like Cassandra 
> anti-entropy repair is one of the most important pieces to get right.  
> And Cassandra fails really badly on that one: the feature is not 
> really well designed, poorly implemented and under-documented.
>
> In a summary, IMO, Cassandra is a poor implementation of some good ideas.
> It is a collection of hacks, not features.  They sometimes play 
> together accidentally, and rarely by design.
>
> Regards,
> --
> Alex
>



Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fred Habash
RF of 3 with three racs AZ's in a single region.

On Feb 21, 2018 10:23 AM, "Carl Mueller" 
wrote:

> What is your replication factor?
> Single datacenter, three availability zones, is that right?
> You removed one node at a time or three at once?
>
> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:
>
>> We have had a 15 node cluster across three zones and cluster repairs
>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>> finish and most times, it never does.
>>
>>
>>
>> More importantly, at some point during the repair cycle, we see read
>> latencies jumping to 1-2 seconds and applications immediately notice the
>> impact.
>>
>>
>>
>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>> around ~500GB at 44% usage.
>>
>>
>>
>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
>> completed successfully with no issues.
>>
>>
>>
>> What could possibly cause repairs to cause this impact following cluster
>> downsizing? Taking three nodes out does not seem compatible with such a
>> drastic effect on repair and read latency.
>>
>>
>>
>> Any expert insights will be appreciated.
>>
>> 
>> Thank you
>>
>>
>>
>
>


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fred Habash
One node at a time

On Feb 21, 2018 10:23 AM, "Carl Mueller" 
wrote:

> What is your replication factor?
> Single datacenter, three availability zones, is that right?
> You removed one node at a time or three at once?
>
> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:
>
>> We have had a 15 node cluster across three zones and cluster repairs
>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>> finish and most times, it never does.
>>
>>
>>
>> More importantly, at some point during the repair cycle, we see read
>> latencies jumping to 1-2 seconds and applications immediately notice the
>> impact.
>>
>>
>>
>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>> around ~500GB at 44% usage.
>>
>>
>>
>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
>> completed successfully with no issues.
>>
>>
>>
>> What could possibly cause repairs to cause this impact following cluster
>> downsizing? Taking three nodes out does not seem compatible with such a
>> drastic effect on repair and read latency.
>>
>>
>>
>> Any expert insights will be appreciated.
>>
>> 
>> Thank you
>>
>>
>>
>
>


Re: Missing 3.11.X cassandra debian packages

2018-02-21 Thread Michael Shuler
On 02/21/2018 11:56 AM, Zachary Marois wrote:
> Starting in that last two weeks (I successfully installed cassandra
> sometime in the last two weeks), I'm guessing on 2/19 when version
> 3.11.2 was released, the cassandra apt package version 3.11.1 became
> unstable. It doesn't seem to be published in the
> http://www.apache.org/dist/cassandra/debian repository anymore (at least
> not in a valid state).
> 
> 
> Despite the package actually being in the repository still
> 
> http://www.apache.org/dist/cassandra/debian/pool/main/c/cassandra/cassandra_3.11.1_all.deb
>  
> 
> It is no longer in the Packages list

This is just the way reprepro works - only the latest version will be
reported.

This is a feature. Users really should be installing the latest release
version.

> http://dl.bintray.com/apache/cassandra/dists/311x/main/binary-amd64/Packages 
> 
> It looks like version 3.11.2 was released on 2/19.
> 
> I'm guessing that publishing dropped the 3.11.1 version from the
> packages list.

This happens for every release. Every so often, branches will be culled
from http://www.apache.org/dist/, when they are no longer supported, so
periodically, complete series of packages will disappear. However, they
will always be available from the canonical Apache release repository.

The canonical release repository for Apache projects is
archive.apache.org. Every release artifact of the Apache Cassandra
project appears at:

  http://archive.apache.org/dist/cassandra/

A debian package user that cannot upgrade to the latest version via
apt-get can always use, for example wget to fetch .deb. files they need
from the repo pool dir:

  http://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/

You will not find Cassandra 0.7.9 in the "current" apt repositories any
longer, since they are unsupported, but there are indeed people using
that version. The above spot is where to find the .deb packages for
0.7.9, and all older releases.

-- 
Warm regards,
Michael

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Missing 3.11.X cassandra debian packages

2018-02-21 Thread Zachary Marois
Starting in that last two weeks (I successfully installed cassandra sometime in 
the last two weeks), I'm guessing on 2/19 when version 3.11.2 was released, the 
cassandra apt package version 3.11.1 became unstable. It doesn't seem to be 
published in the http://www.apache.org/dist/cassandra/debian repository anymore 
(at least not in a valid state).

Despite the package actually being in the repository still

http://www.apache.org/dist/cassandra/debian/pool/main/c/cassandra/cassandra_3.11.1_all.deb

It is no longer in the Packages list

http://dl.bintray.com/apache/cassandra/dists/311x/main/binary-amd64/Packages

It looks like version 3.11.2 was released on 2/19.

I'm guessing that publishing dropped the 3.11.1 version from the packages list.



FINAL REMINDER: CFP for Apache EU Roadshow Closes 25th February

2018-02-21 Thread Sharan F

Hello Apache Supporters and Enthusiasts

This is your FINAL reminder that the Call for Papers (CFP) for the 
Apache EU Roadshow is closing soon. Our Apache EU Roadshow will focus on 
Cloud, IoT, Apache Tomcat, Apache Http and will run from 13-14 June 2018 
in Berlin.
Note that the CFP deadline has been extended to *25*^*th* *February *and 
it will be your final opportunity to submit a talk for thisevent.


Please make your submissions at http://apachecon.com/euroadshow18/

Also note that early bird ticket registrations to attend FOSS Backstage 
including the Apache EU Roadshow, have also been extended and will be 
available until 23^rd February. Please register at 
https://foss-backstage.de/tickets


We look forward to seeing you in Berlin!

Thanks
Sharan Foga, VP Apache Community Development

PLEASE NOTE: You are receiving this message because you are subscribed 
to a user@ or dev@ list of one or more Apache Software Foundation projects.




Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Carl Mueller
I don't disagree with jon.

On Wed, Feb 21, 2018 at 10:27 AM, Jonathan Haddad  wrote:

> The easiest way to do this is replacing one node at a time by using
> rsync.  I don't know why it has to be more complicated than copying data to
> a new machine and replacing it in the cluster.   Bringing up a new DC with
> snapshots is going to be a nightmare in comparison.
>
> On Wed, Feb 21, 2018 at 8:16 AM Carl Mueller 
> wrote:
>
>> DCs can be stood up with snapshotted data.
>>
>>
>> Stand up a new cluster with your old cluster snapshots:
>>
>> https://docs.datastax.com/en/cassandra/2.1/cassandra/
>> operations/ops_snapshot_restore_new_cluster.html
>>
>> Then link the DCs together.
>>
>> Disclaimer: I've never done this in real life.
>>
>> On Wed, Feb 21, 2018 at 9:25 AM, Nitan Kainth 
>> wrote:
>>
>>> New dc will be faster but may impact cluster performance due to
>>> streaming.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande 
>>> wrote:
>>>
>>> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need
>>> to be in 2 different DC< so we would end up create additional 2 new DC and
>>> dropping 2.
>>>
>>> are there any advantages on adding DC over one node at a time?
>>>
>>>
>>> --
>>> *From:* Jeff Jirsa 
>>> *Sent:* Wednesday, February 21, 2018 1:02 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Best approach to Replace existing 8 smaller nodes in
>>> production cluster with New 8 nodes that are bigger in capacity, without a
>>> downtime
>>>
>>> You add the nodes with rf=0 so there’s no streaming, then bump it to
>>> rf=1 and run repair, then rf=2 and run repair, then rf=3 and run repair,
>>> then you either change the app to use local quorum in the new dc, or
>>> reverse the process by decreasing the rf in the original dc by 1 at a time
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev 
>>> wrote:
>>> >
>>> > I'd say, "add new DC, then remove old DC" approach is more risky
>>> especially if they use QUORUM CL (in this case they will need to change CL
>>> to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
>>> > Also, if there is a chance to get rid of streaming, it worth doing as
>>> usually direct data copy (not by means of C*) is more effective and less
>>> troublesome.
>>> >
>>> > Regards,
>>> > Kyrill
>>> >
>>> > 
>>> > From: Nitan Kainth 
>>> > Sent: Wednesday, February 21, 2018 1:04:05 AM
>>> > To: user@cassandra.apache.org
>>> > Subject: Re: Best approach to Replace existing 8 smaller nodes in
>>> production cluster with New 8 nodes that are bigger in capacity, without a
>>> downtime
>>> >
>>> > You can also create a new DC and then terminate old one.
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev <
>>> kyrylo_lebed...@epam.com> wrote:
>>> >>
>>> >> Hi,
>>> >> Consider using this approach, replacing nodes one by one:
>>> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-
>>> place-node-replacement/
>>>
>>> 
>>> Cassandra instantaneous in place node replacement | Carlos ...
>>> 
>>> mrcalonso.com
>>> At some point everyone using Cassandra faces the situation of having to
>>> replace nodes. Either because the cluster needs to scale and some nodes are
>>> too small or ...
>>>
>>> >>
>>> >> Regards,
>>> >> Kyrill
>>> >>
>>> >> 
>>> >> From: Leena Ghatpande 
>>> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
>>> >> To: user@cassandra.apache.org
>>> >> Subject: Best approach to Replace existing 8 smaller nodes in
>>> production cluster with New 8 nodes that are bigger in capacity, without a
>>> downtime
>>> >>
>>> >> Best approach to replace existing 8 smaller 8 nodes in production
>>> cluster with New 8 nodes that are bigger in capacity without a downtime
>>> >>
>>> >> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes
>>> with new 8 nodes that are bigger in capacity in terms of RAM,CPU and
>>> Diskspace without a downtime.
>>> >> The RF is set to 3 currently, and we have 2 large tables with upto
>>> 70Million rows
>>> >>
>>> >> What would be the best approach to implement this
>>> >>- Add 1 New Node and Decomission 1 Old node at a time?
>>> >>- Add all New nodes to the cluster, and then decommission old
>>> nodes ?
>>> >>If we do this, can we still keep the RF=3 while we have 16
>>> nodes at a point in the cluster before we start decommissioning?
>>> >>   - How long do we wait in between adding a Node or decomissiing to
>>> ensure the process is 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Josh McKenzie
There's a disheartening amount of "here's where Cassandra is bad, and
here's what it needs to do for me for free" happening in this thread.

This is open-source software. Everyone is *strongly encouraged* to submit a
patch to move the needle on *any* of these things being complained about in
this thread.

For the Apache Way  to work,
people need to step up and meaningfully contribute to a project to scratch
their own itch instead of just waiting for a random corporation-subsidized
engineer to happen to have interests that align with them and contribute
that to the project.

Beating a dead horse for things everyone on the project knows are serious
pain points is not productive.

On Wed, Feb 21, 2018 at 5:45 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> >
> > >> Cluster wide management should be a big theme in any next major
> release.
> > >>
> > >Na. Stability and testing should be a big theme in the next major
> release.
> > >
> >
> > Double Na on that one Jeff.  I think you have a concern there about the
> > need to test sufficiently to ensure the stability of the next major
> > release.  That makes perfect sense.- for every release, especially the
> > major ones.  Continuous improvement is not a phase of development for
> > example.  CI should be in everything, in every phase.  Stability and
> > testing a part of every release not just one.  A major release should be
> a
> > nice step from the previous major release though.
> >
>
> I guess what Jeff refers to is the tick-tock release cycle experiment,
> which has proven to be a complete disaster by popular opinion.
>
> There's also the "materialized views" feature which failed to materialize
> in the end (pun intended) and had to be declared experimental
> retroactively.
>
> Another prominent example is incremental repair which was introduced as the
> default option in 2.2 and now is not recommended to use because of so many
> corner cases where it can fail.  So again experimental as an afterthought.
>
> Not to mention that even if you are aware of the default incremental and go
> with full repair instead, you're still up for a sad surprise:
> anti-compaction will be triggered despite the "full" repair.  Because
> anti-compaction is only disabled in case of sub-range repair (don't ask
> why), so you need to use something advanced like Reaper if you want to
> avoid that.  I don't think you'll ever find this in the documentation.
>
> Honestly, for an eventually-consistent system like Cassandra anti-entropy
> repair is one of the most important pieces to get right.  And Cassandra
> fails really badly on that one: the feature is not really well designed,
> poorly implemented and under-documented.
>
> In a summary, IMO, Cassandra is a poor implementation of some good ideas.
> It is a collection of hacks, not features.  They sometimes play together
> accidentally, and rarely by design.
>
> Regards,
> --
> Alex
>


Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Jonathan Haddad
The easiest way to do this is replacing one node at a time by using rsync.
I don't know why it has to be more complicated than copying data to a new
machine and replacing it in the cluster.   Bringing up a new DC with
snapshots is going to be a nightmare in comparison.

On Wed, Feb 21, 2018 at 8:16 AM Carl Mueller 
wrote:

> DCs can be stood up with snapshotted data.
>
>
> Stand up a new cluster with your old cluster snapshots:
>
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html
>
> Then link the DCs together.
>
> Disclaimer: I've never done this in real life.
>
> On Wed, Feb 21, 2018 at 9:25 AM, Nitan Kainth 
> wrote:
>
>> New dc will be faster but may impact cluster performance due to streaming.
>>
>> Sent from my iPhone
>>
>> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande 
>> wrote:
>>
>> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need to
>> be in 2 different DC< so we would end up create additional 2 new DC and
>> dropping 2.
>>
>> are there any advantages on adding DC over one node at a time?
>>
>>
>> --
>> *From:* Jeff Jirsa 
>> *Sent:* Wednesday, February 21, 2018 1:02 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Best approach to Replace existing 8 smaller nodes in
>> production cluster with New 8 nodes that are bigger in capacity, without a
>> downtime
>>
>> You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1
>> and run repair, then rf=2 and run repair, then rf=3 and run repair, then
>> you either change the app to use local quorum in the new dc, or reverse the
>> process by decreasing the rf in the original dc by 1 at a time
>>
>> --
>> Jeff Jirsa
>>
>>
>> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev 
>> wrote:
>> >
>> > I'd say, "add new DC, then remove old DC" approach is more risky
>> especially if they use QUORUM CL (in this case they will need to change CL
>> to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
>> > Also, if there is a chance to get rid of streaming, it worth doing as
>> usually direct data copy (not by means of C*) is more effective and less
>> troublesome.
>> >
>> > Regards,
>> > Kyrill
>> >
>> > 
>> > From: Nitan Kainth 
>> > Sent: Wednesday, February 21, 2018 1:04:05 AM
>> > To: user@cassandra.apache.org
>> > Subject: Re: Best approach to Replace existing 8 smaller nodes in
>> production cluster with New 8 nodes that are bigger in capacity, without a
>> downtime
>> >
>> > You can also create a new DC and then terminate old one.
>> >
>> > Sent from my iPhone
>> >
>> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev 
>> wrote:
>> >>
>> >> Hi,
>> >> Consider using this approach, replacing nodes one by one:
>> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/
>>
>> 
>> Cassandra instantaneous in place node replacement | Carlos ...
>> 
>> mrcalonso.com
>> At some point everyone using Cassandra faces the situation of having to
>> replace nodes. Either because the cluster needs to scale and some nodes are
>> too small or ...
>>
>> >>
>> >> Regards,
>> >> Kyrill
>> >>
>> >> 
>> >> From: Leena Ghatpande 
>> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
>> >> To: user@cassandra.apache.org
>> >> Subject: Best approach to Replace existing 8 smaller nodes in
>> production cluster with New 8 nodes that are bigger in capacity, without a
>> downtime
>> >>
>> >> Best approach to replace existing 8 smaller 8 nodes in production
>> cluster with New 8 nodes that are bigger in capacity without a downtime
>> >>
>> >> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes
>> with new 8 nodes that are bigger in capacity in terms of RAM,CPU and
>> Diskspace without a downtime.
>> >> The RF is set to 3 currently, and we have 2 large tables with upto
>> 70Million rows
>> >>
>> >> What would be the best approach to implement this
>> >>- Add 1 New Node and Decomission 1 Old node at a time?
>> >>- Add all New nodes to the cluster, and then decommission old nodes
>> ?
>> >>If we do this, can we still keep the RF=3 while we have 16
>> nodes at a point in the cluster before we start decommissioning?
>> >>   - How long do we wait in between adding a Node or decomissiing to
>> ensure the process is complete before we proceed?
>> >>   - Any tool that we can use to monitor if the add/decomission node is
>> done before we proceed to next
>> >>
>> >> Any other suggestion?
>> >>
>> >>
>> >> 

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Jeff Jirsa
nodetool cfhistograms, nodetool compactionstats would be helpful

Compaction is probably behind from streaming, and reads are touching many 
sstables.

-- 
Jeff Jirsa


> On Feb 21, 2018, at 8:20 AM, Fd Habash  wrote:
> 
> We have had a 15 node cluster across three zones and cluster repairs using 
> ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
> cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
> and most times, it never does.
>  
> More importantly, at some point during the repair cycle, we see read 
> latencies jumping to 1-2 seconds and applications immediately notice the 
> impact.
>  
> stream_throughput_outbound_megabits_per_sec is set at 200 and 
> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
> ~500GB at 44% usage.
>  
> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
> completed successfully with no issues.
>  
> What could possibly cause repairs to cause this impact following cluster 
> downsizing? Taking three nodes out does not seem compatible with such a 
> drastic effect on repair and read latency.
>  
> Any expert insights will be appreciated.
> 
> Thank you
>  


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
What is your replication factor?
Single datacenter, three availability zones, is that right?
You removed one node at a time or three at once?

On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:

> We have had a 15 node cluster across three zones and cluster repairs using
> ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the
> cluster to 12. Since then, same repair job has taken up to 12 hours to
> finish and most times, it never does.
>
>
>
> More importantly, at some point during the repair cycle, we see read
> latencies jumping to 1-2 seconds and applications immediately notice the
> impact.
>
>
>
> stream_throughput_outbound_megabits_per_sec is set at 200 and
> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
> around ~500GB at 44% usage.
>
>
>
> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
> completed successfully with no issues.
>
>
>
> What could possibly cause repairs to cause this impact following cluster
> downsizing? Taking three nodes out does not seem compatible with such a
> drastic effect on repair and read latency.
>
>
>
> Any expert insights will be appreciated.
>
> 
> Thank you
>
>
>


Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Fd Habash
We have had a 15 node cluster across three zones and cluster repairs using 
‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the 
cluster to 12. Since then, same repair job has taken up to 12 hours to finish 
and most times, it never does. 

More importantly, at some point during the repair cycle, we see read latencies 
jumping to 1-2 seconds and applications immediately notice the impact.

stream_throughput_outbound_megabits_per_sec is set at 200 and 
compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is around 
~500GB at 44% usage. 

When shrinking the cluster, the ‘nodetool decommision’ was eventless. It 
completed successfully with no issues.

What could possibly cause repairs to cause this impact following cluster 
downsizing? Taking three nodes out does not seem compatible with such a drastic 
effect on repair and read latency. 

Any expert insights will be appreciated. 

Thank you



Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Carl Mueller
DCs can be stood up with snapshotted data.


Stand up a new cluster with your old cluster snapshots:

https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

Then link the DCs together.

Disclaimer: I've never done this in real life.

On Wed, Feb 21, 2018 at 9:25 AM, Nitan Kainth  wrote:

> New dc will be faster but may impact cluster performance due to streaming.
>
> Sent from my iPhone
>
> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande 
> wrote:
>
> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need to
> be in 2 different DC< so we would end up create additional 2 new DC and
> dropping 2.
>
> are there any advantages on adding DC over one node at a time?
>
>
> --
> *From:* Jeff Jirsa 
> *Sent:* Wednesday, February 21, 2018 1:02 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
>
> You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1
> and run repair, then rf=2 and run repair, then rf=3 and run repair, then
> you either change the app to use local quorum in the new dc, or reverse the
> process by decreasing the rf in the original dc by 1 at a time
>
> --
> Jeff Jirsa
>
>
> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev 
> wrote:
> >
> > I'd say, "add new DC, then remove old DC" approach is more risky
> especially if they use QUORUM CL (in this case they will need to change CL
> to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
> > Also, if there is a chance to get rid of streaming, it worth doing as
> usually direct data copy (not by means of C*) is more effective and less
> troublesome.
> >
> > Regards,
> > Kyrill
> >
> > 
> > From: Nitan Kainth 
> > Sent: Wednesday, February 21, 2018 1:04:05 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
> >
> > You can also create a new DC and then terminate old one.
> >
> > Sent from my iPhone
> >
> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev 
> wrote:
> >>
> >> Hi,
> >> Consider using this approach, replacing nodes one by one:
> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-
> place-node-replacement/
>
> 
> Cassandra instantaneous in place node replacement | Carlos ...
> 
> mrcalonso.com
> At some point everyone using Cassandra faces the situation of having to
> replace nodes. Either because the cluster needs to scale and some nodes are
> too small or ...
>
> >>
> >> Regards,
> >> Kyrill
> >>
> >> 
> >> From: Leena Ghatpande 
> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
> >> To: user@cassandra.apache.org
> >> Subject: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
> >>
> >> Best approach to replace existing 8 smaller 8 nodes in production
> cluster with New 8 nodes that are bigger in capacity without a downtime
> >>
> >> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with
> new 8 nodes that are bigger in capacity in terms of RAM,CPU and Diskspace
> without a downtime.
> >> The RF is set to 3 currently, and we have 2 large tables with upto
> 70Million rows
> >>
> >> What would be the best approach to implement this
> >>- Add 1 New Node and Decomission 1 Old node at a time?
> >>- Add all New nodes to the cluster, and then decommission old nodes ?
> >>If we do this, can we still keep the RF=3 while we have 16 nodes
> at a point in the cluster before we start decommissioning?
> >>   - How long do we wait in between adding a Node or decomissiing to
> ensure the process is complete before we proceed?
> >>   - Any tool that we can use to monitor if the add/decomission node is
> done before we proceed to next
> >>
> >> Any other suggestion?
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: 

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
jon: I am planning on writing a custom compaction strategy. That's why the
question is here, I figured the specifics of memtable -> sstable and
cassandra internals are not a user question. If that still isn't deep
enough for the dev thread, I will move all those questions to user.

On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller 
wrote:

> Thank you all!
>
> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves 
> wrote:
>
>> Probably a lot of work but it would be incredibly useful for vnodes if
>> flushing was range aware (to be used with RangeAwareCompactionStrategy).
>> The writers are already range aware for JBOD, but that's not terribly
>> valuable ATM.
>>
>> On 20 February 2018 at 21:57, Jeff Jirsa  wrote:
>>
>>> There are some arguments to be made that the flush should consider
>>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>>> break into smaller pieces to try to minimize range overlaps going from l0
>>> into l1, for example.
>>>
>>> I have no idea how much work would be involved, but may be worthwhile.
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 20,  2018, at 1:26 PM, Jon Haddad  wrote:
>>>
>>> The file format is independent from compaction.  A compaction strategy
>>> only selects sstables to be compacted, that’s it’s only job.  It could have
>>> side effects, like generating other files, but any decent compaction
>>> strategy will account for the fact that those other files don’t exist.
>>>
>>> I wrote a blog post a few months ago going over some of the nuance of
>>> compaction you mind find informative: http://thelastpic
>>> kle.com/blog/2017/03/16/compaction-nuance.html
>>>
>>> This is also the wrong mailing list, please direct future user questions
>>> to the user list.  The dev list is for development of Cassandra itself.
>>>
>>> Jon
>>>
>>> On Feb 20, 2018, at 1:10 PM, Carl Mueller 
>>> wrote:
>>>
>>> When memtables/CommitLogs are flushed to disk/sstable, does the sstable
>>> go
>>> through sstable organization specific to each compaction strategy, or is
>>> the sstable creation the same for all compactionstrats and it is up to
>>> the
>>> compaction strategy to recompact the sstable if desired?
>>>
>>>
>>>
>>
>


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Thank you all!

On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves  wrote:

> Probably a lot of work but it would be incredibly useful for vnodes if
> flushing was range aware (to be used with RangeAwareCompactionStrategy).
> The writers are already range aware for JBOD, but that's not terribly
> valuable ATM.
>
> On 20 February 2018 at 21:57, Jeff Jirsa  wrote:
>
>> There are some arguments to be made that the flush should consider
>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>> break into smaller pieces to try to minimize range overlaps going from l0
>> into l1, for example.
>>
>> I have no idea how much work would be involved, but may be worthwhile.
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 20,  2018, at 1:26 PM, Jon Haddad  wrote:
>>
>> The file format is independent from compaction.  A compaction strategy
>> only selects sstables to be compacted, that’s it’s only job.  It could have
>> side effects, like generating other files, but any decent compaction
>> strategy will account for the fact that those other files don’t exist.
>>
>> I wrote a blog post a few months ago going over some of the nuance of
>> compaction you mind find informative: http://thelastpic
>> kle.com/blog/2017/03/16/compaction-nuance.html
>>
>> This is also the wrong mailing list, please direct future user questions
>> to the user list.  The dev list is for development of Cassandra itself.
>>
>> Jon
>>
>> On Feb 20, 2018, at 1:10 PM, Carl Mueller 
>> wrote:
>>
>> When memtables/CommitLogs are flushed to disk/sstable, does the sstable go
>> through sstable organization specific to each compaction strategy, or is
>> the sstable creation the same for all compactionstrats and it is up to the
>> compaction strategy to recompact the sstable if desired?
>>
>>
>>
>


Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Nitan Kainth
New dc will be faster but may impact cluster performance due to streaming.

Sent from my iPhone

> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande  wrote:
> 
> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need to be 
> in 2 different DC< so we would end up create additional 2 new DC and dropping 
> 2. 
> are there any advantages on adding DC over one node at a time? 
> 
> 
> From: Jeff Jirsa 
> Sent: Wednesday, February 21, 2018 1:02 AM
> To: user@cassandra.apache.org
> Subject: Re: Best approach to Replace existing 8 smaller nodes in production 
> cluster with New 8 nodes that are bigger in capacity, without a downtime
>  
> You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1 and 
> run repair, then rf=2 and run repair, then rf=3 and run repair, then you 
> either change the app to use local quorum in the new dc, or reverse the 
> process by decreasing the rf in the original dc by 1 at a time
> 
> -- 
> Jeff Jirsa
> 
> 
> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev  
> > wrote:
> > 
> > I'd say, "add new DC, then remove old DC" approach is more risky especially 
> > if they use QUORUM CL (in this case they will need to change CL to 
> > LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
> > Also, if there is a chance to get rid of streaming, it worth doing as 
> > usually direct data copy (not by means of C*) is more effective and less 
> > troublesome.
> > 
> > Regards,
> > Kyrill
> > 
> > 
> > From: Nitan Kainth 
> > Sent: Wednesday, February 21, 2018 1:04:05 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Best approach to Replace existing 8 smaller nodes in 
> > production cluster with New 8 nodes that are bigger in capacity, without a 
> > downtime
> > 
> > You can also create a new DC and then terminate old one.
> > 
> > Sent from my iPhone
> > 
> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev  
> >> wrote:
> >> 
> >> Hi,
> >> Consider using this approach, replacing nodes one by one: 
> >> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/
> 
> Cassandra instantaneous in place node replacement | Carlos ...
> mrcalonso.com
> At some point everyone using Cassandra faces the situation of having to 
> replace nodes. Either because the cluster needs to scale and some nodes are 
> too small or ...
> 
> >> 
> >> Regards,
> >> Kyrill
> >> 
> >> 
> >> From: Leena Ghatpande 
> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
> >> To: user@cassandra.apache.org
> >> Subject: Best approach to Replace existing 8 smaller nodes in production 
> >> cluster with New 8 nodes that are bigger in capacity, without a downtime
> >> 
> >> Best approach to replace existing 8 smaller 8 nodes in production cluster 
> >> with New 8 nodes that are bigger in capacity without a downtime
> >> 
> >> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with 
> >> new 8 nodes that are bigger in capacity in terms of RAM,CPU and Diskspace 
> >> without a downtime.
> >> The RF is set to 3 currently, and we have 2 large tables with upto 
> >> 70Million rows
> >> 
> >> What would be the best approach to implement this
> >>- Add 1 New Node and Decomission 1 Old node at a time?
> >>- Add all New nodes to the cluster, and then decommission old nodes ?
> >>If we do this, can we still keep the RF=3 while we have 16 nodes at 
> >> a point in the cluster before we start decommissioning?
> >>   - How long do we wait in between adding a Node or decomissiing to ensure 
> >> the process is complete before we proceed?
> >>   - Any tool that we can use to monitor if the add/decomission node is 
> >> done before we proceed to next
> >> 
> >> Any other suggestion?
> >> 
> >> 
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >> 
> > 
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> > 
> > 
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> > 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 


Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Leena Ghatpande
We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need to be in 
2 different DC< so we would end up create additional 2 new DC and dropping 2.

are there any advantages on adding DC over one node at a time?



From: Jeff Jirsa 
Sent: Wednesday, February 21, 2018 1:02 AM
To: user@cassandra.apache.org
Subject: Re: Best approach to Replace existing 8 smaller nodes in production 
cluster with New 8 nodes that are bigger in capacity, without a downtime

You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1 and 
run repair, then rf=2 and run repair, then rf=3 and run repair, then you either 
change the app to use local quorum in the new dc, or reverse the process by 
decreasing the rf in the original dc by 1 at a time

--
Jeff Jirsa


> On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev  wrote:
>
> I'd say, "add new DC, then remove old DC" approach is more risky especially 
> if they use QUORUM CL (in this case they will need to change CL to 
> LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
> Also, if there is a chance to get rid of streaming, it worth doing as usually 
> direct data copy (not by means of C*) is more effective and less troublesome.
>
> Regards,
> Kyrill
>
> 
> From: Nitan Kainth 
> Sent: Wednesday, February 21, 2018 1:04:05 AM
> To: user@cassandra.apache.org
> Subject: Re: Best approach to Replace existing 8 smaller nodes in production 
> cluster with New 8 nodes that are bigger in capacity, without a downtime
>
> You can also create a new DC and then terminate old one.
>
> Sent from my iPhone
>
>> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev  
>> wrote:
>>
>> Hi,
>> Consider using this approach, replacing nodes one by one: 
>> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/
[https://s0.wp.com/i/blank.jpg]

Cassandra instantaneous in place node replacement | Carlos 
...
mrcalonso.com
At some point everyone using Cassandra faces the situation of having to replace 
nodes. Either because the cluster needs to scale and some nodes are too small 
or ...


>>
>> Regards,
>> Kyrill
>>
>> 
>> From: Leena Ghatpande 
>> Sent: Tuesday, February 20, 2018 10:24:24 PM
>> To: user@cassandra.apache.org
>> Subject: Best approach to Replace existing 8 smaller nodes in production 
>> cluster with New 8 nodes that are bigger in capacity, without a downtime
>>
>> Best approach to replace existing 8 smaller 8 nodes in production cluster 
>> with New 8 nodes that are bigger in capacity without a downtime
>>
>> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with new 
>> 8 nodes that are bigger in capacity in terms of RAM,CPU and Diskspace 
>> without a downtime.
>> The RF is set to 3 currently, and we have 2 large tables with upto 70Million 
>> rows
>>
>> What would be the best approach to implement this
>>- Add 1 New Node and Decomission 1 Old node at a time?
>>- Add all New nodes to the cluster, and then decommission old nodes ?
>>If we do this, can we still keep the RF=3 while we have 16 nodes at a 
>> point in the cluster before we start decommissioning?
>>   - How long do we wait in between adding a Node or decomissiing to ensure 
>> the process is complete before we proceed?
>>   - Any tool that we can use to monitor if the add/decomission node is done 
>> before we proceed to next
>>
>> Any other suggestion?
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Installing the common service to start cassandrea

2018-02-21 Thread Rahul Singh
Jeff,

Check the service configuration to see what path it’s using for the JRE 
execution and if it’s specifying any class path parameters. The system user may 
not have the environment variables available whereas your user may have it.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 20, 2018, 5:29 PM -0500, Jeff Hechter , wrote:
> Hi,
>
> I have cassandra running on my machine(Windows). I have downloaded 
> commons-daemon-1.1.0-bin-windows.zip and extracted it to
> cassandra\bin\daemon. I successfully created the service using cassandra.bat 
> -install.
>
> When I go to start the service I get error below. When I start from the 
> command line it works fine. Any idea where I can update the location of the 
> IBM Jre.
>
> The description for Event ID 2 from source IBM Java cannot be found. Either 
> the component that raises this event is not installed on your local computer 
> or the installation is corrupted. You can install or repair the component on 
> the local computer.
>
>
> Thank You
> Jeff Hechter
>
> Scrum Master - Spectrum Control Install Development
>
> Phone: 1-520-799-5146
> Email : jhech...@us.ibm.com
>
>


Re: Performance Of IN Queries On Wide Rows

2018-02-21 Thread Rahul Singh
That depends on the driver you use but separate queries asynchronously around 
the cluster would be faster.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Feb 20, 2018, 6:48 PM -0500, Eric Stevens , wrote:
> Someone can correct me if I'm wrong, but I believe if you do a large IN() on 
> a single partition's cluster keys, all the reads are going to be served from 
> a single replica.  Compared to many concurrent individual equal statements 
> you can get the performance gain of leaning on several replicas for 
> parallelism.
>
> > On Tue, Feb 20, 2018 at 11:43 AM Gareth Collins 
> >  wrote:
> > > Hello,
> > >
> > > When querying large wide rows for multiple specific values is it
> > > better to do separate queries for each value...or do it with one query
> > > and an "IN"? I am using Cassandra 2.1.14
> > >
> > > I am asking because I had changed my app to use 'IN' queries and it
> > > **appears** to be slower rather than faster. I had assumed that the
> > > "IN" query should be faster...as I assumed it only needs to go down
> > > the read path once (i.e. row cache -> memtable -> key cache -> bloom
> > > filter -> index summary -> index -> compaction -> sstable) rather than
> > > once for each entry? Or are there some additional caveats that I
> > > should be aware of for 'IN' query performance (e.g. ordering of 'IN'
> > > query entries, closeness of 'IN' query values in the SSTable etc.)?
> > >
> > > thanks in advance,
> > > Gareth Collins
> > >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > >


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Oleksandr Shulgin
On Mon, Feb 19, 2018 at 10:01 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

>
> >> Cluster wide management should be a big theme in any next major release.
> >>
> >Na. Stability and testing should be a big theme in the next major release.
> >
>
> Double Na on that one Jeff.  I think you have a concern there about the
> need to test sufficiently to ensure the stability of the next major
> release.  That makes perfect sense.- for every release, especially the
> major ones.  Continuous improvement is not a phase of development for
> example.  CI should be in everything, in every phase.  Stability and
> testing a part of every release not just one.  A major release should be a
> nice step from the previous major release though.
>

I guess what Jeff refers to is the tick-tock release cycle experiment,
which has proven to be a complete disaster by popular opinion.

There's also the "materialized views" feature which failed to materialize
in the end (pun intended) and had to be declared experimental retroactively.

Another prominent example is incremental repair which was introduced as the
default option in 2.2 and now is not recommended to use because of so many
corner cases where it can fail.  So again experimental as an afterthought.

Not to mention that even if you are aware of the default incremental and go
with full repair instead, you're still up for a sad surprise:
anti-compaction will be triggered despite the "full" repair.  Because
anti-compaction is only disabled in case of sub-range repair (don't ask
why), so you need to use something advanced like Reaper if you want to
avoid that.  I don't think you'll ever find this in the documentation.

Honestly, for an eventually-consistent system like Cassandra anti-entropy
repair is one of the most important pieces to get right.  And Cassandra
fails really badly on that one: the feature is not really well designed,
poorly implemented and under-documented.

In a summary, IMO, Cassandra is a poor implementation of some good ideas.
It is a collection of hacks, not features.  They sometimes play together
accidentally, and rarely by design.

Regards,
--
Alex


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Ben Slater
I’ve been bitting my tongue because I don’t normally like to directly plug
our service on the mailing list but if you’re going to compare Cassandra to
a full managed service from Microsoft then you really should check out
Instaclustr (www.instaclustr.com) and you’ll find that we take care of many
of this issues you have raised is just the same way that Microsoft does
with CosmosDB (ie hiding them behind our managed service tooling).

Cheers
Ben

On Wed, 21 Feb 2018 at 19:22 DuyHai Doan  wrote:

> For UI and interactive data exploration there is already the Cassandra
> interpreter for Apache Zeppelin that is more than decent for the job
>
> On Wed, Feb 21, 2018 at 9:19 AM, Daniel Hölbling-Inzko <
> daniel.hoelbling-in...@bitmovin.com> wrote:
>
>> But what does this video really show? That Microsoft managed to run
>> Cassandra as a SaaS product with nice UI?
>> Google did that years ago with BigTable and Amazon with DynamoDB.
>>
>> I agree that we need more tools, but not so much for querying (although
>> that would also help a bit), but just in general the project feels
>> unapproachable right now.
>> Besides the excellent DataStax documentation there is little best
>> practice knowledge about how to operate and provision Cassandra clusters.
>> Having some recipes for Chef, Puppet or Ansible that show the most common
>> settings (or some Cloudfoundry/GCP Templates or Helm Charts) would be
>> really useful.
>> Also a list of all the projects that Cassandra goes well with (like TLP
>> Reaper and and Netflix's Priam etc..)
>>
>> greetings Daniel
>>
>> On Wed, 21 Feb 2018 at 07:23 Kenneth Brotman 
>> wrote:
>>
>>> If you watch this video through you'll see why usability is so
>>> important.  You can't ignore usability issues.
>>>
>>> Cassandra does not exist in a vacuum.  The competitors are world class.
>>>
>>> The video is on the New Cassandra API for Azure Cosmos DB:
>>> https://www.youtube.com/watch?v=1Sf4McGN1AQ
>>>
>>> Kenneth Brotman
>>>
>>> -Original Message-
>>> From: Daniel Hölbling-Inzko [mailto:daniel.hoelbling-in...@bitmovin.com]
>>> Sent: Tuesday, February 20, 2018 1:28 AM
>>> To: user@cassandra.apache.org; James Briggs
>>> Cc: d...@cassandra.apache.org
>>> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>>>
>>> Hi,
>>>
>>> I have to add my own two cents here as the main thing that keeps me from
>>> really running Cassandra is the amount of pain running it incurs.
>>> Not so much because it's actually painful but because the tools are so
>>> different and the documentation and best practices are scattered across a
>>> dozen outdated DataStax articles and this mailing list etc.. We've been
>>> hesitant (although our use case is perfect for using Cassandra) to deploy
>>> Cassandra to any critical systems as even after a year of running it we
>>> still don't have the operational experience to confidently run critical
>>> systems with it.
>>>
>>> Simple things like a foolproof / safe cluster-wide S3 Backup (like
>>> Elasticsearch has it) would for example solve a TON of issues for new
>>> people. I don't need it auto-scheduled or something, but having to
>>> configure cron jobs across the whole cluster is a pain in the ass for small
>>> teams.
>>> To be honest, even the way snapshots are done right now is already super
>>> painful. Every other system I operated so far will just create one backup
>>> folder I can export, in C* the Backup is scattered across a bunch of
>>> different Keyspace folders etc.. needless to say that it took a while until
>>> I trusted my backup scripts fully.
>>>
>>> And especially for a Database I believe Backup/Restore needs to be a
>>> non-issue that's documented front and center. If not smaller teams just
>>> don't have the resources to dedicate to learning and building the tools
>>> around it.
>>>
>>> Now that the team is getting larger we could spare the resources to
>>> operate these things, but switching from a well-understood RDBMs schema to
>>> Cassandra is now incredibly hard and will probably take years.
>>>
>>> greetings Daniel
>>>
>>> On Tue, 20 Feb 2018 at 05:56 James Briggs >> .invalid>
>>> wrote:
>>>
>>> > Kenneth:
>>> >
>>> > What you said is not wrong.
>>> >
>>> > Vertica and Riak are examples of distributed databases that don't
>>> > require hand-holding.
>>> >
>>> > Cassandra is for Java-programmer DIYers, or more often Datastax
>>> > clients, at this point.
>>> > Thanks, James.
>>> >
>>> > --
>>> > *From:* Kenneth Brotman 
>>> > *To:* user@cassandra.apache.org
>>> > *Cc:* d...@cassandra.apache.org
>>> > *Sent:* Monday, February 19, 2018 4:56 PM
>>> >
>>> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
>>> >
>>> > Jeff, you helped me figure out what I was missing.  It just took me a
>>> > day to digest what you wrote.  I’m coming over from another type of
>>> > engineering.  I didn’t know and 

Re: LEAK DETECTED while minor compaction

2018-02-21 Thread Дарья Меленцова
Bloom filter settings have not changed, they are default. In the table
settings  bloom_filter_fp_chance = 0.01. Should I increase it?

DESC TABLE "PerBoxEventSeriesEventIds"

CREATE TABLE "EventsKeyspace"."PerBoxEventSeriesEventIds" (
key blob,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'enabled': 'True',
'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.SnappyCompressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.1
AND speculative_retry = 'NONE';

-- 
Darya Melentsova

2018-02-21 12:06 GMT+05:00 Jeff Jirsa :
> Your bloom filter settings look broken. Did you set the FP ratio to 0? If so 
> that’s a bad idea and we should have stopped you from doing it.
>
>
> --
> Jeff Jirsa
>
>
>> On Feb 20, 2018, at 11:01 PM, Дарья Меленцова  wrote:
>>
>> Hello.
>>
>> Could you help me with LEAK DETECTED error while minor compaction process?
>>
>> There is a table with a lot of small record 6.6*10^9 (mapping
>> (eventId, boxId) -> cellId)).
>> Minor compaction starts and then fails on 99% done with an error:
>>
>> Stacktrace
>> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,032 Ref.java:207 - LEAK
>> DETECTED: a reference
>> (org.apache.cassandra.utils.concurrent.Ref$State@1ca1bf87) to class
>> org.apache.cassandra.io.util.MmappedSegmentedFile$Cleanup@308695651:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big-Index.db
>> was not released before the reference was garbage collected
>> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
>> DETECTED: a reference
>> (org.apache.cassandra.utils.concurrent.Ref$State@1659d4f7) to class
>> org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@1398495320:[Memory@[0..dc),
>> Memory@[0..898)] was not released before the reference was garbage
>> collected
>> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
>> DETECTED: a reference
>> (org.apache.cassandra.utils.concurrent.Ref$State@42978833) to class
>> org.apache.cassandra.utils.concurrent.WrappedSharedCloseable$1@1648504648:[[OffHeapBitSet]]
>> was not released before the reference was garbage collected
>> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
>> DETECTED: a reference
>> (org.apache.cassandra.utils.concurrent.Ref$State@3a64a19b) to class
>> org.apache.cassandra.io.sstable.format.SSTableReader$DescriptorTypeTidy@863282967:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big
>> was not released before the reference was garbage collected
>> ERROR [Reference-Reaper:1] 2018-02-05 10:06:17,033 Ref.java:207 - LEAK
>> DETECTED: a reference
>> (org.apache.cassandra.utils.concurrent.Ref$State@4ddc775a) to class
>> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Cleanup@1041709510:/storage1/cassandra_events/data/EventsKeyspace/PerBoxEventSeriesEvents-41847c3049a211e6af50b9221207cca8/tmplink-lb-102593-big-Data.db
>> was not released before the reference was garbage collected
>>
>> I have tried increase max heap size (8GB -> 16GB), but got the same error.
>> How can I resolve the issue?
>>
>>
>> Cassandra parameters and the problem table
>> Cassandra v 2.2.9
>> MAX_HEAP_SIZE="16G"
>> java version "1.8.0_121"
>>
>> compaction = {'min_threshold': '4', 'enabled': 'True', 'class':
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
>> 'max_threshold': '32'}
>> compression = {'sstable_compression':
>> 'org.apache.cassandra.io.compress.SnappyCompressor'}
>>
>> nodetoole tablestats
>> Read Count: 1454605
>>Read Latency: 2.0174777647540054 ms.
>>Write Count: 12034909
>>Write Latency: 0.044917336558174224 ms.
>>Pending Flushes: 0
>>Table: PerBoxEventSeriesEventIds
>>SSTable count: 20
>>Space used (live): 885969527458
>>Space used (total): 885981801994
>>Space used by snapshots (total): 0
>>Off heap memory used (total): 19706226232
>>SSTable Compression Ratio: 0.5722091068132875
>>Number of keys (estimate): 6614724684
>>Memtable cell count: 375796
>>Memtable data size: 31073510
>>Memtable off heap memory used: 0
>>Memtable switch count: 30
>>   

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread DuyHai Doan
For UI and interactive data exploration there is already the Cassandra
interpreter for Apache Zeppelin that is more than decent for the job

On Wed, Feb 21, 2018 at 9:19 AM, Daniel Hölbling-Inzko <
daniel.hoelbling-in...@bitmovin.com> wrote:

> But what does this video really show? That Microsoft managed to run
> Cassandra as a SaaS product with nice UI?
> Google did that years ago with BigTable and Amazon with DynamoDB.
>
> I agree that we need more tools, but not so much for querying (although
> that would also help a bit), but just in general the project feels
> unapproachable right now.
> Besides the excellent DataStax documentation there is little best practice
> knowledge about how to operate and provision Cassandra clusters.
> Having some recipes for Chef, Puppet or Ansible that show the most common
> settings (or some Cloudfoundry/GCP Templates or Helm Charts) would be
> really useful.
> Also a list of all the projects that Cassandra goes well with (like TLP
> Reaper and and Netflix's Priam etc..)
>
> greetings Daniel
>
> On Wed, 21 Feb 2018 at 07:23 Kenneth Brotman 
> wrote:
>
>> If you watch this video through you'll see why usability is so
>> important.  You can't ignore usability issues.
>>
>> Cassandra does not exist in a vacuum.  The competitors are world class.
>>
>> The video is on the New Cassandra API for Azure Cosmos DB:
>> https://www.youtube.com/watch?v=1Sf4McGN1AQ
>>
>> Kenneth Brotman
>>
>> -Original Message-
>> From: Daniel Hölbling-Inzko [mailto:daniel.hoelbling-in...@bitmovin.com]
>> Sent: Tuesday, February 20, 2018 1:28 AM
>> To: user@cassandra.apache.org; James Briggs
>> Cc: d...@cassandra.apache.org
>> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>>
>> Hi,
>>
>> I have to add my own two cents here as the main thing that keeps me from
>> really running Cassandra is the amount of pain running it incurs.
>> Not so much because it's actually painful but because the tools are so
>> different and the documentation and best practices are scattered across a
>> dozen outdated DataStax articles and this mailing list etc.. We've been
>> hesitant (although our use case is perfect for using Cassandra) to deploy
>> Cassandra to any critical systems as even after a year of running it we
>> still don't have the operational experience to confidently run critical
>> systems with it.
>>
>> Simple things like a foolproof / safe cluster-wide S3 Backup (like
>> Elasticsearch has it) would for example solve a TON of issues for new
>> people. I don't need it auto-scheduled or something, but having to
>> configure cron jobs across the whole cluster is a pain in the ass for small
>> teams.
>> To be honest, even the way snapshots are done right now is already super
>> painful. Every other system I operated so far will just create one backup
>> folder I can export, in C* the Backup is scattered across a bunch of
>> different Keyspace folders etc.. needless to say that it took a while until
>> I trusted my backup scripts fully.
>>
>> And especially for a Database I believe Backup/Restore needs to be a
>> non-issue that's documented front and center. If not smaller teams just
>> don't have the resources to dedicate to learning and building the tools
>> around it.
>>
>> Now that the team is getting larger we could spare the resources to
>> operate these things, but switching from a well-understood RDBMs schema to
>> Cassandra is now incredibly hard and will probably take years.
>>
>> greetings Daniel
>>
>> On Tue, 20 Feb 2018 at 05:56 James Briggs > invalid>
>> wrote:
>>
>> > Kenneth:
>> >
>> > What you said is not wrong.
>> >
>> > Vertica and Riak are examples of distributed databases that don't
>> > require hand-holding.
>> >
>> > Cassandra is for Java-programmer DIYers, or more often Datastax
>> > clients, at this point.
>> > Thanks, James.
>> >
>> > --
>> > *From:* Kenneth Brotman 
>> > *To:* user@cassandra.apache.org
>> > *Cc:* d...@cassandra.apache.org
>> > *Sent:* Monday, February 19, 2018 4:56 PM
>> >
>> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
>> >
>> > Jeff, you helped me figure out what I was missing.  It just took me a
>> > day to digest what you wrote.  I’m coming over from another type of
>> > engineering.  I didn’t know and it’s not really documented.  Cassandra
>> > runs in a data center.  Now days that means the nodes are going to be
>> > in managed containers, Docker containers, managed by Kerbernetes,
>> > Meso or something, and for that reason anyone operating Cassandra in a
>> > real world setting would not encounter the issues I raised in the way I
>> described.
>> >
>> > Shouldn’t the architectural diagrams people reference indicate that in
>> > some way?  That would have help me.
>> >
>> > Kenneth Brotman
>> >
>> > *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com]
>> > *Sent:* Monday, February 19, 2018 10:43 AM
>> > *To:* 

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Daniel Hölbling-Inzko
But what does this video really show? That Microsoft managed to run
Cassandra as a SaaS product with nice UI?
Google did that years ago with BigTable and Amazon with DynamoDB.

I agree that we need more tools, but not so much for querying (although
that would also help a bit), but just in general the project feels
unapproachable right now.
Besides the excellent DataStax documentation there is little best practice
knowledge about how to operate and provision Cassandra clusters.
Having some recipes for Chef, Puppet or Ansible that show the most common
settings (or some Cloudfoundry/GCP Templates or Helm Charts) would be
really useful.
Also a list of all the projects that Cassandra goes well with (like TLP
Reaper and and Netflix's Priam etc..)

greetings Daniel

On Wed, 21 Feb 2018 at 07:23 Kenneth Brotman 
wrote:

> If you watch this video through you'll see why usability is so important.
> You can't ignore usability issues.
>
> Cassandra does not exist in a vacuum.  The competitors are world class.
>
> The video is on the New Cassandra API for Azure Cosmos DB:
> https://www.youtube.com/watch?v=1Sf4McGN1AQ
>
> Kenneth Brotman
>
> -Original Message-
> From: Daniel Hölbling-Inzko [mailto:daniel.hoelbling-in...@bitmovin.com]
> Sent: Tuesday, February 20, 2018 1:28 AM
> To: user@cassandra.apache.org; James Briggs
> Cc: d...@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> Hi,
>
> I have to add my own two cents here as the main thing that keeps me from
> really running Cassandra is the amount of pain running it incurs.
> Not so much because it's actually painful but because the tools are so
> different and the documentation and best practices are scattered across a
> dozen outdated DataStax articles and this mailing list etc.. We've been
> hesitant (although our use case is perfect for using Cassandra) to deploy
> Cassandra to any critical systems as even after a year of running it we
> still don't have the operational experience to confidently run critical
> systems with it.
>
> Simple things like a foolproof / safe cluster-wide S3 Backup (like
> Elasticsearch has it) would for example solve a TON of issues for new
> people. I don't need it auto-scheduled or something, but having to
> configure cron jobs across the whole cluster is a pain in the ass for small
> teams.
> To be honest, even the way snapshots are done right now is already super
> painful. Every other system I operated so far will just create one backup
> folder I can export, in C* the Backup is scattered across a bunch of
> different Keyspace folders etc.. needless to say that it took a while until
> I trusted my backup scripts fully.
>
> And especially for a Database I believe Backup/Restore needs to be a
> non-issue that's documented front and center. If not smaller teams just
> don't have the resources to dedicate to learning and building the tools
> around it.
>
> Now that the team is getting larger we could spare the resources to
> operate these things, but switching from a well-understood RDBMs schema to
> Cassandra is now incredibly hard and will probably take years.
>
> greetings Daniel
>
> On Tue, 20 Feb 2018 at 05:56 James Briggs 
> wrote:
>
> > Kenneth:
> >
> > What you said is not wrong.
> >
> > Vertica and Riak are examples of distributed databases that don't
> > require hand-holding.
> >
> > Cassandra is for Java-programmer DIYers, or more often Datastax
> > clients, at this point.
> > Thanks, James.
> >
> > --
> > *From:* Kenneth Brotman 
> > *To:* user@cassandra.apache.org
> > *Cc:* d...@cassandra.apache.org
> > *Sent:* Monday, February 19, 2018 4:56 PM
> >
> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
> >
> > Jeff, you helped me figure out what I was missing.  It just took me a
> > day to digest what you wrote.  I’m coming over from another type of
> > engineering.  I didn’t know and it’s not really documented.  Cassandra
> > runs in a data center.  Now days that means the nodes are going to be
> > in managed containers, Docker containers, managed by Kerbernetes,
> > Meso or something, and for that reason anyone operating Cassandra in a
> > real world setting would not encounter the issues I raised in the way I
> described.
> >
> > Shouldn’t the architectural diagrams people reference indicate that in
> > some way?  That would have help me.
> >
> > Kenneth Brotman
> >
> > *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com]
> > *Sent:* Monday, February 19, 2018 10:43 AM
> > *To:* 'user@cassandra.apache.org'
> > *Cc:* 'd...@cassandra.apache.org'
> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
> >
> > Well said.  Very fair.  I wouldn’t mind hearing from others still
> > You’re a good guy!
> >
> > Kenneth Brotman
> >
> > *From:* Jeff Jirsa [mailto:jji...@gmail.com ]
> > *Sent:* Monday, February 19, 2018 9:10 AM
> > *To:*