Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Prasenjit Sarkar
Jeff,

I don't think you can push the topic of usability back to developers by
asking them to open JIRAs. It is upon the technical leaders of the
Cassandra community to take the initiative in this regard. We can argue
back and forth on the dynamics of open source projects, but the usability
concerns of Cassandra is a reality that can not be ignored.

Prasenjit

PS My views, not those of my employer

On Tue, Feb 20, 2018 at 10:22 PM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> If you watch this video through you'll see why usability is so important.
> You can't ignore usability issues.
>
> Cassandra does not exist in a vacuum.  The competitors are world class.
>
> The video is on the New Cassandra API for Azure Cosmos DB:
> https://www.youtube.com/watch?v=1Sf4McGN1AQ
>
> Kenneth Brotman
>
> -Original Message-
> From: Daniel Hölbling-Inzko [mailto:daniel.hoelbling-in...@bitmovin.com]
> Sent: Tuesday, February 20, 2018 1:28 AM
> To: user@cassandra.apache.org; James Briggs
> Cc: d...@cassandra.apache.org
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
>
> Hi,
>
> I have to add my own two cents here as the main thing that keeps me from
> really running Cassandra is the amount of pain running it incurs.
> Not so much because it's actually painful but because the tools are so
> different and the documentation and best practices are scattered across a
> dozen outdated DataStax articles and this mailing list etc.. We've been
> hesitant (although our use case is perfect for using Cassandra) to deploy
> Cassandra to any critical systems as even after a year of running it we
> still don't have the operational experience to confidently run critical
> systems with it.
>
> Simple things like a foolproof / safe cluster-wide S3 Backup (like
> Elasticsearch has it) would for example solve a TON of issues for new
> people. I don't need it auto-scheduled or something, but having to
> configure cron jobs across the whole cluster is a pain in the ass for small
> teams.
> To be honest, even the way snapshots are done right now is already super
> painful. Every other system I operated so far will just create one backup
> folder I can export, in C* the Backup is scattered across a bunch of
> different Keyspace folders etc.. needless to say that it took a while until
> I trusted my backup scripts fully.
>
> And especially for a Database I believe Backup/Restore needs to be a
> non-issue that's documented front and center. If not smaller teams just
> don't have the resources to dedicate to learning and building the tools
> around it.
>
> Now that the team is getting larger we could spare the resources to
> operate these things, but switching from a well-understood RDBMs schema to
> Cassandra is now incredibly hard and will probably take years.
>
> greetings Daniel
>
> On Tue, 20 Feb 2018 at 05:56 James Briggs 
> wrote:
>
> > Kenneth:
> >
> > What you said is not wrong.
> >
> > Vertica and Riak are examples of distributed databases that don't
> > require hand-holding.
> >
> > Cassandra is for Java-programmer DIYers, or more often Datastax
> > clients, at this point.
> > Thanks, James.
> >
> > --
> > *From:* Kenneth Brotman 
> > *To:* user@cassandra.apache.org
> > *Cc:* d...@cassandra.apache.org
> > *Sent:* Monday, February 19, 2018 4:56 PM
> >
> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
> >
> > Jeff, you helped me figure out what I was missing.  It just took me a
> > day to digest what you wrote.  I’m coming over from another type of
> > engineering.  I didn’t know and it’s not really documented.  Cassandra
> > runs in a data center.  Now days that means the nodes are going to be
> > in managed containers, Docker containers, managed by Kerbernetes,
> > Meso or something, and for that reason anyone operating Cassandra in a
> > real world setting would not encounter the issues I raised in the way I
> described.
> >
> > Shouldn’t the architectural diagrams people reference indicate that in
> > some way?  That would have help me.
> >
> > Kenneth Brotman
> >
> > *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com]
> > *Sent:* Monday, February 19, 2018 10:43 AM
> > *To:* 'user@cassandra.apache.org'
> > *Cc:* 'd...@cassandra.apache.org'
> > *Subject:* RE: Cassandra Needs to Grow Up by Version Five!
> >
> > Well said.  Very fair.  I wouldn’t mind hearing from others still
> > You’re a good guy!
> >
> > Kenneth Brotman
> >
> > *From:* Jeff Jirsa [mailto:jji...@gmail.com ]
> > *Sent:* Monday, February 19, 2018 9:10 AM
> > *To:* cassandra
> > *Cc:* Cassandra DEV
> > *Subject:* Re: Cassandra Needs to Grow Up by Version Five!
> >
> > There's a lot of things below I disagree with, but it's ok. I
> > convinced myself not to nit-pick every point.
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-13971 has some of
> > Stefan's work with cert management
> >
> > Beyond that, I encourage 

Re: Backups eating up disk space

2017-01-11 Thread Prasenjit Sarkar
Hi Kunal,

Razi's post does give a very lucid description of how cassandra manages the
hard links inside the backup directory.

Where it needs clarification is the following:
--> incremental backups is a system wide setting and so its an all or
nothing approach

--> as multiple people have stated, incremental backups do not create hard
links to compacted sstables. however, this can bloat the size of your
backups

--> again as stated, it is a general industry practice to place backups in
a different secondary storage location than the main production site. So
best to move it to the secondary storage before applying rm on the backups
folder

In my experience with production clusters, managing the backups folder
across multiple nodes can be painful if the objective is to ever recover
data. With the usual disclaimers, better to rely on third party vendors to
accomplish the needful rather than scripts/tablesnap.

Regards
Prasenjit

On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
raziuddin.kh...@nih.gov> wrote:

> Hello Kunal,
>
>
>
> Caveat: I am not a super-expert on Cassandra, but it helps to explain to
> others, in order to eventually become an expert, so if my explanation is
> wrong, I would hope others would correct me. J
>
>
>
> The active sstables/data files are are all the files located in the
> directory for the table.
>
> You can safely remove all files under the backups/ directory and the
> directory itself.
>
> Removing any files that are current hard-links inside backups won’t cause
> any issues, and I will explain why.
>
>
>
> Have you looked at your Cassandra.yaml file and checked the setting for
> incremental_backups?  If it is set to true, and you don’t want to make new
> backups, you can set it to false, so that after you clean up, you will not
> have to clean up the backups again.
>
>
>
> Explanation:
>
> Lets look at the the definition of incremental backups again: “Cassandra
> creates a hard link to each SSTable flushed or streamed locally in
> a backups subdirectory of the keyspace data.”
>
>
>
> Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/
>
> In the rest of the discussion, when I refer to “table directory”, I
> explicitly mean the directory: my_keyspace/my_table-some-uuid/
>
> When I refer to backups/ directory, I explicitly mean:
> my_keyspace/my_table-some-uuid/backups/
>
>
>
> Suppose that you have an sstable-A that was either flushed from a memtable
> or streamed from another node.
>
> At this point, you have a hardlink to sstable-A in your table directory,
> and a hardlink to sstable-A in your backups/ directory.
>
> Suppose that you have another sstable-B that was also either flushed from
> a memtable or streamed from another node.
>
> At this point, you have a hardlink to sstable-B in your table directory,
> and a hardlink to sstable-B in your backups/ directory.
>
>
>
> Next, suppose compaction were to occur, where say sstable-A and sstable-B
> would be compacted to produce sstable-C, representing all the data from A
> and B.
>
> Now, sstable-C will live in your main table directory, and the hardlinks
> to sstable-A and sstable-B will be deleted in the main table directory, but
> sstable-A and sstable-B will continue to exist in /backups.
>
> At this point, in your main table directory, you will have a hardlink to
> sstable-C. In your backups/ directory you will have hardlinks to sstable-A,
> and sstable-B.
>
>
>
> Thus, your main table directory is not cluttered with old un-compacted
> sstables, and only has the sstables along with other files that are
> actively being used.
>
>
>
> To drive the point home, …
>
> Suppose that you have another sstable-D that was either flushed from a
> memtable or streamed from another node.
>
> At this point, in your main table directory, you will have sstable-C and
> sstable-D. In your backups/ directory you will have hardlinks to sstable-A,
> sstable-B, and sstable-D.
>
>
>
> Next, suppose compaction were to occur where say sstable-C and sstable-D
> would be compacted to produce sstable-E, representing all the data from C
> and D.
>
> Now, sstable-E will live in your main table directory, and the hardlinks
> to sstable-C and sstable-D will be deleted in the main table directory, but
> sstable-D will continue to exist in /backups.
>
> At this point, in your main table directory, you will have a hardlink to
> sstable-E. In your backups/ directory you will have hardlinks to sstable-A,
> sstable-B and sstable-D.
>
>
>
> As you can see, the /backups directory quickly accumulates with all
> un-compacted sstables and how it progressively used up more and more space.
>
> Also, note that the /backups directory does not contain sstables generated
> from compaction, such as sstable-C and sstable-E.
>
> It is safe to delete the entire backups/ directory because all the data is
> represented in the compacted sstable-E.
>
> I hope this explanation was clear and gives you confidence in using rm to
> delete 

Re: What is the merit of incremental backup

2016-07-14 Thread Prasenjit Sarkar
Hi Satoshi

You are correct that incremental backups offer you the opportunity to
reduce the amount of data you need to transfer offsite. On the recovery
path, you need to piece together the full backup and subsequent incremental
backups.

However, where incremental backups help is with respect to the RTO due to
the data reduction effect you mentioned. The RPO can be reduced only if you
take more frequent incremental backups than full backups.

Hope this helps,
Prasenjit

On Wed, Jul 13, 2016 at 11:54 PM, Satoshi Hikida  wrote:

> Hi,
>
> I want to know the actual advantage of using incremental backup.
>
> I've read through the DataStax document and it says the merit of using
> incremental backup is as follows:
>
> - It allows storing backups offsite without transferring entire snapshots
> - With incremental backups and snapshots, it can provide more recent RPO
> (Recovery Point Objective)
>
> Is my understanding correct? I would appreciate if someone gives me some
> advice or correct me.
>
> References:
> - DataStax, "Enabling incremental backups",
> http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsBackupIncremental.html
>
> Regards,
> Satoshi
>