Re: [jira] [Commented] (CASSANDRA-9104) Unit test failures, trunk + Windows

2015-04-24 Thread Branimir Lambov
+1
On 24 Apr 2015 21:34, Joshua McKenzie (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/CASSANDRA-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511672#comment-14511672
 ]

 Joshua McKenzie commented on CASSANDRA-9104:
 

 Nope - you're totally right. Simple patch to remove the redundant recover
 calls [here|
 https://github.com/apache/cassandra/compare/trunk...josh-mckenzie:9104_followup_simple].
 Tested on ci box and this fixes the problem as well (with a much smaller
 hammer).

  Unit test failures, trunk + Windows
  ---
 
  Key: CASSANDRA-9104
  URL:
 https://issues.apache.org/jira/browse/CASSANDRA-9104
  Project: Cassandra
   Issue Type: Test
 Reporter: Joshua McKenzie
 Assignee: Joshua McKenzie
   Labels: Windows
  Fix For: 3.0
 
  Attachments: 9104_CFSTest.txt, 9104_KeyCache.txt,
 9104_KeyCache_ScrubTest_v2.txt, 9104_RecoveryManager.txt,
 9104_RecoveryManager_v2.txt, 9104_ScrubTest.txt
 
 
  Variety of different test failures have cropped up over the past 2-3
 weeks:
  h6. -org.apache.cassandra.cql3.UFTest FAILED (timeout)- // No longer
 failing / timing out
  h6.
 testLoadNewSSTablesAvoidsOverwrites(org.apache.cassandra.db.ColumnFamilyStoreTest):
  FAILED
  {noformat}
 12 SSTables unexpectedly exist
 junit.framework.AssertionFailedError: 12 SSTables unexpectedly exist
 at
 org.apache.cassandra.db.ColumnFamilyStoreTest.testLoadNewSSTablesAvoidsOverwrites(ColumnFamilyStoreTest.java:1896)
  {noformat}
  h6. org.apache.cassandra.db.KeyCacheTest FAILED
  {noformat}
 expected:4 but was:2
 junit.framework.AssertionFailedError: expected:4 but was:2
 at
 org.apache.cassandra.db.KeyCacheTest.assertKeyCacheSize(KeyCacheTest.java:221)
 at
 org.apache.cassandra.db.KeyCacheTest.testKeyCache(KeyCacheTest.java:181)
  {noformat}
  h6. RecoveryManagerTest:
  {noformat}
 org.apache.cassandra.db.RecoveryManagerTest FAILED
 org.apache.cassandra.db.RecoveryManager2Test FAILED
 org.apache.cassandra.db.RecoveryManager3Test FAILED
 org.apache.cassandra.db.RecoveryManagerTruncateTest FAILED
 All are the following:
java.nio.file.AccessDeniedException:
 build\test\cassandra\commitlog;0\CommitLog-5-1427995105229.log
FSWriteError in
 build\test\cassandra\commitlog;0\CommitLog-5-1427995105229.log
   at
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:128)
   at
 org.apache.cassandra.db.commitlog.CommitLogSegmentManager.recycleSegment(CommitLogSegmentManager.java:360)
   at
 org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
   at
 org.apache.cassandra.db.RecoveryManagerTest.testNothingToRecover(RecoveryManagerTest.java:75)
Caused by: java.nio.file.AccessDeniedException:
 build\test\cassandra\commitlog;0\CommitLog-5-1427995105229.log
   at
 sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:83)
   at
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
   at
 sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
   at
 sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
   at
 sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
   at java.nio.file.Files.delete(Files.java:1079)
   at
 org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:124)
  {noformat}
  h6. testScrubCorruptedCounterRow(org.apache.cassandra.db.ScrubTest):
 FAILED
  {noformat}
  Expecting new size of 1, got 2 while replacing
 [BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-1-big-Data.db')]
 by
 [BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-1-big-Data.db')]
 in View(pending_count=0,
 sstables=[BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-3-big-Data.db')],
 compacting=[])
  junit.framework.AssertionFailedError: Expecting new size of 1, got 2
 while replacing
 [BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-1-big-Data.db')]
 by
 [BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-1-big-Data.db')]
 in View(pending_count=0,
 sstables=[BigTableReader(path='C:\src\refCassandra\build\test\cassandra\data;0\Keyspace1\Counter1-deab62b2d95c11e489c6e117fe147c1d\la-3-big-Data.db')],
 compacting=[])
 at
 org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:767)
 at
 

Re: [VOTE] CEP-11: Pluggable memtable implementations

2021-08-24 Thread Branimir Lambov
Vote passes with 7 binding and 4 non-binding +1 votes and no vetoes.

Thank you all. JIRA ticket will be opened soon.

Regards,
Branimir

On Fri, Aug 20, 2021 at 10:41 AM Sam Tunnicliffe  wrote:

> +1
>
> > On 19 Aug 2021, at 17:10, Branimir Lambov  wrote:
> >
> > Hello everyone,
> >
> > I am proposing the CEP-11 (Pluggable memtable implementations) for
> adoption
> >
> > Discussion thread:
> >
> https://lists.apache.org/thread.html/rb5e950f882196764744c31bc3c13dfbf0603cb9f8bc2f6cfb976d285%40%3Cdev.cassandra.apache.org%3E
> >
> >
> > The vote will be open for 72 hours.
> > Votes by PMC members are considered binding.
> > A vote passes if there are at least three binding +1s and no binding
> vetoes.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


[VOTE] CEP-11: Pluggable memtable implementations

2021-08-19 Thread Branimir Lambov
Hello everyone,

I am proposing the CEP-11 (Pluggable memtable implementations) for adoption

Discussion thread:
https://lists.apache.org/thread.html/rb5e950f882196764744c31bc3c13dfbf0603cb9f8bc2f6cfb976d285%40%3Cdev.cassandra.apache.org%3E


The vote will be open for 72 hours.
Votes by PMC members are considered binding.
A vote passes if there are at least three binding +1s and no binding vetoes.


Re: [VOTE] Release Apache Cassandra 4.0.0 (take2)

2021-07-15 Thread Branimir Lambov
+1

On Thu, Jul 15, 2021 at 12:55 AM Scott Andreas  wrote:

> +1nb.
>
> Thank you for sharing a Circle run, Sumanth!
>
> 
> From: Sumanth Pasupuleti 
> Sent: Wednesday, July 14, 2021 12:52 PM
> To: dev@cassandra.apache.org
> Subject: Re: [VOTE] Release Apache Cassandra 4.0.0 (take2)
>
> +1 (nb)
> Confirmed passing j8 UTs and dtests
>
> https://app.circleci.com/pipelines/github/sumanth-pasupuleti/cassandra/77/workflows/7b0ad00d-7ae3-41d2-b1a7-82fa63b7
>
> On Wed, Jul 14, 2021 at 11:03 AM Jeremy Hanna 
> wrote:
>
> > +1 (nb)
> >
> > > On Jul 15, 2021, at 3:42 AM, Blake Eggleston
> >  wrote:
> > >
> > > +1
> > >
> > >> On Jul 14, 2021, at 8:21 AM, Aleksey Yeschenko 
> > wrote:
> > >>
> > >> +1
> > >>
> > >>>> On 14 Jul 2021, at 15:37, Jonathan Ellis  wrote:
> > >>>
> > >>> +1
> > >>>
> > >>>> On Tue, Jul 13, 2021 at 5:14 PM Mick Semb Wever 
> > wrote:
> > >>>>
> > >>>> Proposing the test build of Cassandra 4.0.0 for release.
> > >>>>
> > >>>> sha1: 924bf92fab1820942137138c779004acaf834187
> > >>>> Git:
> > >>>>
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.0-tentative
> > >>>> Maven Artifacts:
> > >>>>
> > >>>>
> >
> https://repository.apache.org/content/repositories/orgapachecassandra-1242/org/apache/cassandra/cassandra-all/4.0.0/
> > >>>>
> > >>>> The Source and Build Artifacts, and the Debian and RPM packages and
> > >>>> repositories, are available here:
> > >>>> https://dist.apache.org/repos/dist/dev/cassandra/4.0.0/
> > >>>>
> > >>>> The vote will be open for 72 hours (longer if needed). Everyone who
> > >>>> has tested the build is invited to vote. Votes by PMC members are
> > >>>> considered binding. A vote passes if there are at least three
> binding
> > >>>> +1s and no -1's.
> > >>>>
> > >>>> [1]: CHANGES.txt:
> > >>>>
> > >>>>
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.0-tentative
> > >>>> [2]: NEWS.txt:
> > >>>>
> >
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.0-tentative
> > >>>>
> > >>>>
> -
> > >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>
> > >>
> > >> -
> > >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >>
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>


-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)

2021-11-15 Thread Branimir Lambov
01 PM David Capwell
> >>  >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>> We already have many interfaces similar to these for Compaction
> >>>>>>> Strategy, Indexing, Query Handler.
> >>>>>>>
> >>>>>>> Today-I-Learned QueryHandler is not allowed to be touched in a
> minor…
> >>>> good
> >>>>>>> to know…
> >>>>>>>
> >>>>>>>> not trunk -> try not to change these interfaces
> >>>>>>>
> >>>>>>> Outside of MBeans, I honestly do not know what interfaces fall into
> >>>> this
> >>>>>>> group; and for MBeans we have tests which block breaking changes.
> >> The
> >>>>>>> point I am making is that not everyone is aware of the rules, so
> >> having
> >>>>>>> something in place to help enforce such rules should be thought
> >> about;
> >>>> if
> >>>>>>> we want to add pluggable hooks with the intent that external
> parties
> >>>> can
> >>>>>>> leverage such hooks, we should also add to the scope the
> maintenance
> >> of
> >>>>>>> these interfaces (we should not assume “tribal knowledge” will
> work).
> >>>>>>>
> >>>>>>> I am not trying to ask for something large or something requiring a
> >>>> ton of
> >>>>>>> work, I am just asking that this gets thought about during the
> >> project
> >>>> so
> >>>>>>> it doesn’t get neglected.  This could be as simple as an annotation
> >>>> like
> >>>>>>> @ExposedTo3rdParties (Hadoop does this to show an interface is
> >> exposed
> >>>> and
> >>>>>>> must be maintained), or it could be something like split
> directories
> >>>>>>> (src/java = private, src/java-exposed = public); I am trying not to
> >>>> dictate
> >>>>>>> an implementation, only trying to make sure we are setup to support
> >>>> the CEP
> >>>>>>> after the work is done.
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Nov 9, 2021, at 9:52 AM, Jeremiah D Jordan <
> >>>> jeremiah.jor...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> We already have many interfaces similar to these for Compaction
> >>>>>>> Strategy, Indexing, Query Handler.  I would hope that commiters are
> >>>> already
> >>>>>>> following a policy along the lines of trunk -> anything goes, not
> >>>> trunk ->
> >>>>>>> try not to change these interfaces.  I would expect that to be the
> >> same
> >>>>>>> policy for any new internal interfaces that are added.  But given
> we
> >>>>>>> already have many such interfaces, I see no reason to block adding
> >>>> more of
> >>>>>>> them while change policies are discussed.
> >>>>>>>>
> >>>>>>>> -Jeremiah
> >>>>>>>>
> >>>>>>>>> On Nov 9, 2021, at 10:44 AM, David Capwell
> >>>> 
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I still have one outstanding comment, but this is a comment for
> >>>> several
> >>>>>>> of the CEPs being worked on
> >>>>>>>>>
> >>>>>>>>>> And last comment, which I have also done in the other modularity
> >>>>>>> thread… backwards compatibility and maintenance. It is not clear
> >> right
> >>>> now
> >>>>>>> what java interfaces may not break and how we can maintain and
> extend
> >>>> such
> >>>>>>> interfaces in the future.  If the goal is to allow 3rd parties to
> >>>> plugin
> >>>>>>> and offer new SSTable formats, are we as a project ok with having a
> >>>> minor
> >>>>>>> release do a binary or source non-compatible change?  If not how do
> >> we
> >>>

[VOTE] CEP-17: SSTable format API

2021-11-15 Thread Branimir Lambov
Hi everyone,

I would like to start a vote on this CEP.

Proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-17%3A+SSTable+format+API

Discussion:
https://lists.apache.org/thread.html/r636bebcab4e678dbee042285449193e8e75d3753200a1b404fcc7196%40%3Cdev.cassandra.apache.org%3E

The vote will be open for 72 hours.
A vote passes if there are at least three binding +1s and no binding vetoes.

Regards,
Branimir


Re: [VOTE] CEP-17: SSTable format API

2021-11-22 Thread Branimir Lambov
The vote passes with 13 +1 votes and no -1 votes.

Thanks to everyone.

Regards,
Branimir

On Thu, Nov 18, 2021 at 8:46 PM Dinesh Joshi  wrote:

> +1
>
> > On Nov 17, 2021, at 1:22 AM, Benjamin Lerer  wrote:
> >
> > +1
> >
> > Le mar. 16 nov. 2021 à 18:05, Joshua McKenzie  a
> > écrit :
> >
> >> +1
> >>
> >> On Tue, Nov 16, 2021 at 10:14 AM Andrés de la Peña <
> adelap...@apache.org>
> >> wrote:
> >>
> >>> +1
> >>>
> >>> On Tue, 16 Nov 2021 at 08:39, Sam Tunnicliffe  wrote:
> >>>
> >>>> +1
> >>>>
> >>>>> On 15 Nov 2021, at 19:42, Branimir Lambov 
> >> wrote:
> >>>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> I would like to start a vote on this CEP.
> >>>>>
> >>>>> Proposal:
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-17%3A+SSTable+format+API
> >>>>>
> >>>>> Discussion:
> >>>>>
> >>>>
> >>>
> >>
> https://lists.apache.org/thread.html/r636bebcab4e678dbee042285449193e8e75d3753200a1b404fcc7196%40%3Cdev.cassandra.apache.org%3E
> >>>>>
> >>>>> The vote will be open for 72 hours.
> >>>>> A vote passes if there are at least three binding +1s and no binding
> >>>> vetoes.
> >>>>>
> >>>>> Regards,
> >>>>> Branimir
> >>>>
> >>>>
> >>>> -
> >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>>>
> >>>>
> >>>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)

2021-11-09 Thread Branimir Lambov
Does anyone have any further comments or questions on the proposal, or are
we ready to  move forward to a vote?

Regards,
Branimir

On Tue, Nov 2, 2021 at 7:15 PM David Capwell 
wrote:

> > I apologize I did not mention those things explicitly. All the places
> where
> > sstable files are accessed directly would have to be refactored.
>
> Works for me
>
> > Speaking about the implementation, one idea I was thinking about was that
> > the factories for formats are registered using Java's native service
> > loader.
>
> I am a fan of ServiceLoader as a means of plugging in.
>
> > I hope this explains a bit
>
> Yep; thanks!
>
> > On Nov 2, 2021, at 1:46 AM, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
> >
> > David,
> >
> > I apologize I did not mention those things explicitly. All the places
> where
> > sstable files are accessed directly would have to be refactored.
> >
> > Regarding TableMetrics - currently it includes many metrics, some of them
> > are unrelated to sstables at all, but there are metrics which are
> specific
> > to the current sstable format, like metrics related to index summaries or
> > bloom filters. The created gauges query certain methods on sstable
> reader -
> > I think the only common metrics for sstables we can leave in TableMetrics
> > are those for which there are query methods in generic sstable interface.
> > Other metrics, specific to the certain sstable format should be
> registered
> > by the implementation itself.
> >
> > Speaking about the implementation, one idea I was thinking about was that
> > the factories for formats are registered using Java's native service
> > loader. This way we could get the list of all the factories on the
> > classpath and call some method, like `registerMetrics` during system
> > initialization. That could be also implemented in static initializer in
> the
> > factory but it would make it less obvious for the implementors where such
> > initialization should be done.
> >
> > I hope this explains a bit
> >
> > Thanks,
> > Jacek
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>


Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)

2021-11-01 Thread Branimir Lambov
As Jacek is not a committer, this proposal needs a shepherd. I would be
happy to take this role.

> to me the interfaces has to be at the SSTable level, which then expose
readers/writers, but also has to expose the other things we do outside of
those paths

Could you give some detail on what these things are? Are they something
different from what the standalone Cassandra tools (scrub/verify/upgrade)
are currently doing? Obviously, any pluggability proposal will have to
provide a solution to these, and it would be helpful to know what needs to
be done beyond making sure the bundled tools work correctly (which includes
iterating indexes; format-specific operations (e.g. index summary
redistribution) are excluded as they are to be handled by the individual
format).

There is another problem in the current code alluded to in the question, in
the fact that "SSTableReader" (tied to the sstable format and ready for
querying data (i.e. with open data files and bloom filters loaded in
memory)) is the only concept that the code uses to work with sstables. As I
understand it, this proposal does not aim to solve that problem, only to
make sure that we can properly read and write sstables of a given format,
including in streaming and standalone tools. In other words, to provide the
machinery to convert sstable descriptors into sstable readers and writers.

I see this as an expansion of CASSANDRA-7443 and cleanup of any changes
that came after it and broke the intended capability.

Regards,
Branimir

On Thu, Oct 28, 2021 at 7:43 PM David Capwell 
wrote:

> Sorry about that; used -1/+1 to show preference, not binding action
>
> > On Oct 28, 2021, at 5:50 AM, bened...@apache.org wrote:
> >
> >> I am -1 here, for the reasons listed above; the problem (in my eye) is
> not reader/writer but higher level at the actual SSTable.  If we plug out
> read/write but still allow direct file access, then these abstractions fail
> to provide the goals of the CEP.
> >
> > Be careful dropping -1s, as your -1s here are binding. I realise this
> isn’t a vote thread, but the effect is the same. IMO we should try to
> express our preferences and defer to the collective opinion where possible.
> True -1s should very rarely appear.
> >
> >
> > From: David Capwell 
> > Date: Wednesday, 27 October 2021 at 15:33
> > To: dev@cassandra.apache.org 
> > Subject: Re: [DISCUSS] CEP-17: SSTable format API (CASSANDRA-17056)
> > Reading the CEP I don’t see any mention to the systems which access
> SSTables; such as streaming (small callout to zero-copy-streaming with
> ZeroCopyBigTableWriter) and repair.  If you are abstracting out
> BigTableReader then you are not dealing with the implementation assumptions
> that users of SSTables have (such as direct mutation to auxiliary files
> outside of -Data.db).
> >
> >> Audience
> >>   • Cassandra developers who wish to see SSTableReader and
> SSTableWriter more modular than they are today,
> >
> > This statement relates to the above comment, many parts of the code do
> not use Reader/Writer but instead use direct format knowledge to apply
> changes to the file format (normally outside of -Data.db); to me the
> interfaces has to be at the SSTable level, which then expose
> readers/writers, but also has to expose the other things we do outside of
> those paths.
> >
> >>   • move the metrics related to sstable format out from
> TableMetrics class and make them tied to certain sstable implementation
> >
> > I am curious about this comment, are you removing exposing this
> information?
> >
> >>   • have a single factory for creating both readers and writers for
> particular implementation of sstable and use it consistently - no direct
> creation of any reader / writer
> >
> > I am -1 here, for the reasons listed above; the problem (in my eye) is
> not reader/writer but higher level at the actual SSTable.  If we plug out
> read/write but still allow direct file access, then these abstractions fail
> to provide the goals of the CEP.
> >
> > I am +1 to the intent of the CEP.
> >
> > And last comment, which I have also done in the other modularity thread…
> backwards compatibility and maintenance. It is not clear right now what
> java interfaces may not break and how we can maintain and extend such
> interfaces in the future.  If the goal is to allow 3rd parties to plugin
> and offer new SSTable formats, are we as a project ok with having a minor
> release do a binary or source non-compatible change?  If not how do we
> detect this?  Until this problem is solved, I do not think we should add
> any such interfaces.
> >
> >> On Oct 22, 2021, at 7:23 AM, Jeremiah Jordan 
> wrote:
> >>
> >> Hi Stefan,
> >> That idea is not related to this CEP which is about the file formats of
> the
> >> sstables, not file system access.  But you should take a look at the
> work
> >> recently committed in
> https://issues.apache.org/jira/browse/CASSANDRA-16926
> >> to switch to using java.nio.file.Path for file access.  This 

[DISCUSS] CEP-11: Pluggable memtable implementations

2021-07-20 Thread Branimir Lambov
Proposal for a mechanism for plugging in memtable implementations:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-11%3A+Pluggable+memtable+implementations

The proposal supports using custom memtable implementations to support
development and testing of improved alternatives, but also enables a
broader definition of "memtable" to better support more advanced use cases
like persistent memory. To this end, memtable implementations are given
control over flushing and storing data in the commit log, enabling
solutions that implement their own durability mechanisms and live much
longer than their classical counterparts. Taken to the extreme, this also
enables memtables that never flush (in other words, alternative storage
engines) in a minimally-invasive manner.

I am curious to hear your thoughts on the proposal.

Regards,
Branimir


Re: [DISCUSS] CEP-11: Pluggable memtable implementations

2021-07-21 Thread Branimir Lambov
data is otherwise always only kept in the memtable instead of writing
> to the SSTable (for performance reasons). Same implementation of memtable
> still.
>
> Why would the write process of the table not ask the table what settings it
> has and instead asks the memtable what settings the table has? This seems
> counterintuitive to me. Even the persistent memory case is a bit
> questionable, why not simply disable commitlog in the writing process? Why
> ask the memtable?
>
> This feels like memtable is going to be the write pipeline, but to me that
> doesn't feel like the correct architectural decision. I'd rather see these
> decisions done outside the memtable. Even a persistent memory memtable user
> might want to have a commitlog enabled for data capture / shipping logs, or
> layers of persistence speed. The whole persistent memory without any
> commercially known future is a bit weird at the moment (even Optane has no
> known manufacturing anymore with last factory being dismantled based on
> public information).
>
> > boolean streamToMemtable()
>
> And that one I don't understand. Why is streaming in the memtable? This
> smells like a scope creep from something else. The explanation would
> indicate to me that the wanted behavior is just disabling automated
> flushing.
>
> But these are just some questions that came to my mind while reading this.
> And I don't want to sound too negative (most of the features are really
> something I'd like to see), perhaps I just misunderstood some of the
> motivations why stuff should be brought to memtable instead of being
> implemented outside memtable. Perhaps there's something else in the write
> pipeline arch that needs fixing but is now masqueraded inside this CEP.
>
> I'm definitely interested to hear more.
>
>   - Micke
>
> On Wed, 21 Jul 2021 at 08:24, Berenguer Blasi 
> wrote:
>
> > +1. De-tangling, going more modular and clean interfaces sgtm.
> >
> > On 20/7/21 21:45, Nate McCall wrote:
> > > Yay for pluggable memtables!! I havent gone over this in detail yet,
> but
> > > personally I've always thought integrating something like Arrow would
> be
> > > cool for sharing data (that's as far as i've gotten, but anything that
> > > makes that kind of experimentation easier would also help with mocking
> > test
> > > plumbing, so +1 from me).
> > >
> > > Thanks for putting this together!
> > >
> > > -Nate
> > >
> > > On Tue, Jul 20, 2021 at 10:11 PM Branimir Lambov <
> > > branimir.lam...@datastax.com> wrote:
> > >
> > >> Proposal for a mechanism for plugging in memtable implementations:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-11%3A+Pluggable+memtable+implementations
> > >>
> > >> The proposal supports using custom memtable implementations to support
> > >> development and testing of improved alternatives, but also enables a
> > >> broader definition of "memtable" to better support more advanced use
> > cases
> > >> like persistent memory. To this end, memtable implementations are
> given
> > >> control over flushing and storing data in the commit log, enabling
> > >> solutions that implement their own durability mechanisms and live much
> > >> longer than their classical counterparts. Taken to the extreme, this
> > also
> > >> enables memtables that never flush (in other words, alternative
> storage
> > >> engines) in a minimally-invasive manner.
> > >>
> > >> I am curious to hear your thoughts on the proposal.
> > >>
> > >> Regards,
> > >> Branimir
> > >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>


-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


Re: [DISCUSS] CEP-11: Pluggable memtable implementations

2021-07-21 Thread Branimir Lambov
> Why is flushing control bad to do in CFS and better in the
  memtable?

I wonder why you would understand this as something that takes away
control instead of giving it. The CFS is not configurable. With the
CEP, memtables are configurable at the table level. It is entirely
possible to implement a memtable wrapper that provides any of the
examples of functionalities you mention -- and that would be fully
configurable (just as example, one could very well select a
time-series-optimized-flush wrapper over skip-list memtable).



> is this proposal going to take an angle towards per-range
  memtables?

This is another question that the proposal leaves to the memtable
implementation (or wrapper), but it does make sense to make sure the
interfaces provide the necessary support for sharding (e.g. by
providing suitable shard boundaries that split the owned space; note
that we already have sstable/compaction-per-range functionality with
multiple data directories and it makes sense to ensure that the
provided splits are in some agreement with the data directory
boundaries).



> Why would the write process of the table not ask the table what
  settings it has and instead asks the memtable what settings the
  table has?

The reason for this is that memtables are the primary reason the
commit log needs to preserve data. The question of whether ot not the
memtable needs its content to be present and retained in the commit
log until flush (writesAreDurable) is a question that only the
memtable can answer.

writesShouldSkipCommitLog is a result of scope reduction (call it
laziness on my part). I could not find a way to tell if commit log
data may be required for point-in-time-restore or any other feature,
and the existing method of turning the commit log off does not have
the right granularity. I am very open to suggestions here.



> Why is streaming in the memtable? [...] the wanted behavior is just
  disabling automated flushing

Yes, if zero-copy-streaming is not enabled. And that's exactly what
this method is there for -- to make sure sstables are not copied
whole, and that a flush is not done at the end.

Regards,
Branimir

On Wed, Jul 21, 2021 at 4:33 PM bened...@apache.org 
wrote:

> I would love to help out with this in any way that I can, FYI. Definitely
> one of the more impactful performance improvements to the codebase, given
> the benefits to compaction and memory behaviour.
>
> From: bened...@apache.org 
> Date: Wednesday, 21 July 2021 at 14:32
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] CEP-11: Pluggable memtable implementations
> > memtable-as-a-commitlog-index
>
> Heh, based on 7282? Yeah, I’ve had this idea for a while now (actually
> there was a paper that did this a long time ago), and it could be very nice
> (if for no other benefit than reducing heap utilisation). I don’t think
> this requires that they be modelled as the same concept, however, only that
> the Memtable must be able to receive an address into a commit log entry and
> to adopt partial ownership over the entry’s lifecycle.
>
>
> From: Branimir Lambov 
> Date: Wednesday, 21 July 2021 at 14:28
> To: dev@cassandra.apache.org 
> Subject: Re: [DISCUSS] CEP-11: Pluggable memtable implementations
> > In general, I think we need to make up our mind as to whether we
>   consider the Memtable and CommitLog one logical entity [...], or
>   whether we want to further untangle those two components from an
>   architectural perspective which we started down that road on with
>   the pluggable storage engine work.
>
> This CEP is intentionally not attempting to answer this question. FWIW
> I do not see them as separable (there's evidence to this fact in the
> codebase), but there are valid secondary uses of the commit log that
> are served well enough by the current architecture.
>
> It is important, however, to let the memtable implementation opt out,
> to permit it to provide its own solution for data persistence.
>
> We should revisit this in the future, especially if Benedict's shared
> log facility and my plans for a memtable-as-a-commitlog-index
> evolve.
>
> Regards,
> Branimir
>
> On Wed, Jul 21, 2021 at 1:34 PM Michael Burman  wrote:
>
> > Hi,
> >
> > It is nice to see these going forward (and a great use of CEP) so thanks
> > for the proposal. I have my reservations regarding the linking of
> memtable
> > to CommitLog and flushing and should not leak abstraction from one to
> > another. And I don't see the reasoning why they should be, it doesn't
> seem
> > to add anything else than tight coupling of components, reducing reuse
> and
> > making things unnecessarily complicated. Also, the streaming notions seem
> > weird to me - how are they related to memtable? Why should memtable care
> > about the b

Re: [DISCUSS] CEP-11: Pluggable memtable implementations

2021-07-23 Thread Branimir Lambov
> CEP indicates the flushing behavior is suddenly more tied to the Memtable
  implementation level rather than being configurable at the table level

The specific things that change with the proposal are:
- Flushes are supplied with a reason (e.g. memory full, schema change,
prepare
  to stream).
- The memtable can reject a flush request.
- The logic to initiate "memory full" and "period expired" flushes moves to
the
  memtable where it conceptually belongs.

Is the latter what worries you? For reusability, the current logic is
extracted
in a base class that the skiplist/trie/7282 implementations derive from.


> I'm not sure if the "isDurable" + "shouldSkip" is interesting instead
  of "shouldWrite"(etc). But I also wonder in cases where point-in-time
restore
  is required how one could achieve it without a commit log(can persistent
  memory memtable be rolled back?).

That's exactly the reason why the two flags are separate. To use PITR, you
use
the commit log but make sure that it does not treat the segments covered by
the
persistent memtable as dirty(i.e. writesAreDurable but not
writesShouldSkipCommitLog); commit log segments are written only to be
archived, and PITR restores a memtable snapshot and applies the mutations
after
it.

Am I misunderstanding the question?


> Although I do feel like persistent memory exceptions make stuff more
complex.

The persistent memtables were the reason that drove this functionality, but
think about it also as an easy way to do pluggable storage engines. I may
not
be up to date with the consensus in the community on this, but I don't see
us
investing the effort to have fully-fledged pluggable storage engines of the
CASSANDRA-13475 type any time soon.

To make the memtable a storage engine you need two things:
- an opt out of flushing, so that the memtable is the only component that
serves
  reads,
- an opt out of the commit log, so that the memtable is the only component
that
  serves writes,

plus some solutions for the secondary uses of sstables (streaming) and
commit
log (PITR, CDC).

The proposal gives it that, with a little more control than just opt-out.
It can
work for the pmem (opt out of both) and rocksdb (opt out of flushing only)
use cases, but for me it will also be useful to experiment with a memtable
that
includes its own version of a commit log (opt out of commit log only).


On Thu, Jul 22, 2021 at 4:00 PM Michael Burman  wrote:

> On Wed, 21 Jul 2021 at 17:24, Branimir Lambov <
> branimir.lam...@datastax.com>
> wrote:
>
> > > Why is flushing control bad to do in CFS and better in the
> >   memtable?
> >
> > I wonder why you would understand this as something that takes away
> > control instead of giving it. The CFS is not configurable. With the
> > CEP, memtables are configurable at the table level. It is entirely
> > possible to implement a memtable wrapper that provides any of the
> > examples of functionalities you mention -- and that would be fully
> > configurable (just as example, one could very well select a
> > time-series-optimized-flush wrapper over skip-list memtable).
> >
> >
> I think this was a bit of miscommunication. I'm not in favor of keeping it
> in the CFS, but at least to me (as a reader) CEP indicates the flushing
> behavior is suddenly more tied to the Memtable implementation level rather
> than being configurable at the table level. Thus that would not reduce
> coupling of different flush strategies, but instead just move it from CFS
> to Memtable-implementation. And especially with multiple Memtable
> implementations that would mean the reusable parts of flushing could end up
> being difficult to reuse. If not the intention, then good.
>
>
> >
> > This is another question that the proposal leaves to the memtable
> > implementation (or wrapper), but it does make sense to make sure the
> > interfaces provide the necessary support for sharding
> >
>
> + 1 to this, that's a good limitation of scope to get forward. I think this
> was originally touched in 7282 (where I had it in the memtable impl), but
> then got pushed one step outside.
>
> writesShouldSkipCommitLog is a result of scope reduction (call it
> > laziness on my part). I could not find a way to tell if commit log
> > data may be required for point-in-time-restore or any other feature,
> > and the existing method of turning the commit log off does not have
> > the right granularity. I am very open to suggestions here.
> >
>
> Could this be limited to a single parameter? I'm not sure if the
> "isDurable" + "shouldSkip" is interesting instead of "shouldWrite" (etc).
> But I also wonder in cases where point-in-time restore is required how one
> could achieve it without 

Re: [VOTE] Release Apache Cassandra 4.0.0 (third time is the charm)

2021-07-23 Thread Branimir Lambov
+1

On Fri, Jul 23, 2021 at 4:15 PM Aleksey Yeschenko 
wrote:

> +1
>
> > On 23 Jul 2021, at 14:03, Joshua McKenzie  wrote:
> >
> > +1
> >
> > On Fri, Jul 23, 2021 at 8:07 AM Dinesh Joshi  >
> > wrote:
> >
> >> +1
> >>
> >>
> >>> On Jul 23, 2021, at 4:56 AM, Paulo Motta 
> >> wrote:
> >>>
> >>> +1
> >>>
> >>>> Em sex., 23 de jul. de 2021 às 08:37, Andrés de la Peña <
> >>>> a.penya.gar...@gmail.com> escreveu:
> >>>>
> >>>> +1
> >>>>
> >>>>> On Fri, 23 Jul 2021 at 11:56, Sam Tunnicliffe 
> wrote:
> >>>>>
> >>>>> +1
> >>>>>
> >>>>>> On 22 Jul 2021, at 23:40, Brandon Williams <
> >> brandonwilli...@apache.org
> >>>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> I am proposing the test build of Cassandra 4.0.0 for release.
> >>>>>>
> >>>>>> sha1: 902b4d31772eaa84f05ffdc1e4f4b7a66d5b17e6
> >>>>>> Git:
> >>>>>
> >>>>
> >>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/4.0.0-tentative
> >>>>>> Maven Artifacts:
> >>>>>>
> >>>>>
> >>>>
> >>
> https://repository.apache.org/content/repositories/orgapachecassandra-1244/org/apache/cassandra/cassandra-all/4.0.0/
> >>>>>>
> >>>>>> The Source and Build Artifacts, and Debian and RPM packages and
> >>>>>> repositories are available here:
> >>>>>> https://dist.apache.org/repos/dist/dev/cassandra/4.0.0/
> >>>>>>
> >>>>>> The vote will be open for 72 hours (longer if needed). Everyone who
> >>>>>> has tested the build is invited to vote. Votes by PMC members are
> >>>>>> considered binding. A vote passes if there are at least three
> binding
> >>>>>> +1s and no -1's.
> >>>>>>
> >>>>>> [1]: CHANGES.txt:
> >>>>>>
> >>>>>
> >>>>
> >>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/4.0.0-tentative
> >>>>>> [2]: NEWS.txt:
> >>>>>
> >>>>
> >>
> https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/4.0.0-tentative
> >>>>>>
> >>>>>>
> -
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>>>>>
> >>>>>
> >>>>>
> >>>>> -
> >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>>>>
> >>>>>
> >>>>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


[DISCUSS] CEP-19: Trie memtable implementation

2022-01-10 Thread Branimir Lambov
We would like to contribute our TrieMemtable to Cassandra.

https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation

This is a new memtable solution aimed to replace the legacy implementation,
developed with the following objectives:
- lowering the on-heap complexity and the ability to store memtable
indexing structures off-heap,
- leveraging byte order and a trie structure to lower the memory footprint
and improve mutation and lookup performance.

The new memtable relies on CASSANDRA-6936 to translate to and from
byte-ordered representations of types, and CASSANDRA-17034 / CEP-11 to plug
into Cassandra. The memtable is built on multiple shards of custom
in-memory single-writer multiple-reader tries, whose implementation uses a
combination of state-of-the-art and novel features for greater efficiency.

The CEP's JIRA ticket (https://issues.apache.org/jira/browse/CASSANDRA-17240)
contains the initial version of the implementation. In its current form it
achieves much better garbage collection latency, significantly bigger data
sizes between flushes for the same memory allocation, as well as
drastically increased write throughput, and we expect the memory and
garbage collection improvements to go much further with upcoming
improvements to the solution.

I am interested in hearing your thoughts on the proposal.

Regards,
Branimir


Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-10 Thread Branimir Lambov
Let us continue the configuration discussion in the CEP-11 JIRA (
https://issues.apache.org/jira/browse/CASSANDRA-17034).

Any further comments on the alternate memtable? Are we ready for a vote?

Regards,
Branimir


On Wed, Feb 9, 2022 at 12:13 PM Bowen Song  wrote:

> TBH, I don't have an opinion on the configuration. I just want to say that
> if at the end we decide the configuration in the YAML should override the
> table schema, I would like to recommend that we specifying a list of
> whitelisted (or blacklisted) "templates" in the YAML file, and the template
> chosen by the table schema is used if it's enabled, otherwise fallback to a
> default template, which could be the first element in the whitelist if
> that's used, or a separate configuration entry if a blacklist is used. The
> list should be optional in the YAML, and an empty list or the absent of it
> means everything is enabled.
>
> Advantage of this:
>
> 1. it doesn't require the operator to configure this, as an empty or
> absent list by default enables all templates and should work fine in most
> cases.
>
> 2. it allows the operator to whitelist / blacklist any template if ever
> needed (e.g. due to a bug), and also allow them to choose a fallback option.
>
> 3. the table schema has priority as long as the chosen template is not
> explicitly disabled by the YAML.
>
> 4. it allows the operator to selectively disable some templates without
> forcing all tables to use the same template specified by the YAML.
>
>
> On 09/02/2022 09:43, bened...@apache.org wrote:
>
> Why not have some default templates that can be specified by the schema
> without touching the yaml, but overridden in the yaml as necessary?
>
>
>
> *From: *Branimir Lambov  
> *Date: *Wednesday, 9 February 2022 at 09:35
> *To: *dev@cassandra.apache.org 
> 
> *Subject: *Re: [DISCUSS] CEP-19: Trie memtable implementation
>
> If I understand this correctly, you prefer _not_ to have an option to give
> the configuration explicitly in the schema. I.e. force the configurations
> ("templates" in current terms) to be specified in the yaml, and only allow
> tables to specify which one to use among them?
>
>
>
> This does sound at least as good to me, and I'll happily change the API.
>
>
>
> Regards,
>
> Branimir
>
>
>
> On Tue, Feb 8, 2022 at 10:40 PM Dinesh Joshi  wrote:
>
> My quick reading of the code suggests that schema will override the
> operator's default preference in the YAML. In the event of a bug in the new
> implementation, there could be situation where the operator might need to
> override this via the YAML.
>
>
>
> On Feb 8, 2022, at 12:29 PM, Jeremiah D Jordan 
> wrote:
>
>
>
> I don’t really see most users touching the default implementation.  I
> would expect the main reason someone would change would be
>
> 1. They run into some bug that is only in one of the implementations.
>
> 2. They have persistent memory and so want to use
> https://issues.apache.org/jira/browse/CASSANDRA-13981
>
>
>
> Given that I doubt most people will touch it, I think it is good to give
> advanced operators the ability to have more control over switching to
> things that have new performance characteristics.  So I like the idea that
> the proposed configuration approach which allows someone to change to a new
> implementation one node at a time and only for specific tables.
>
>
>
> On Feb 8, 2022, at 2:21 PM, Dinesh Joshi  wrote:
>
>
>
> Thank you for sharing the perf test results.
>
>
>
> Going back to the schema vs yaml configuration. I am concerned users may
> pick the wrong implementation for their use-case. Is there any chance for
> us to automatically pick a MemTable implementation based on heuristics? Do
> we foresee users ever picking the existing SkipList implementation over the
> Trie Given the performance tests, it seems the Trie implementation is the
> clear winner.
>
>
>
> To be clear, I am not suggesting we remove the existing implementation. I
> am for maintaining a pluggable API for various components.
>
>
>
> Dinesh
>
>
>
> On Feb 7, 2022, at 8:39 AM, Branimir Lambov  wrote:
>
>
>
> Added some performance results to the ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-17240
>
>
>
> Regards,
>
> Branimir
>
>
>
> On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi  wrote:
>
> This is excellent. Thanks for opening up this CEP. It would be great to
> get some stats around GC allocation rate / memory pressure, read & write
> latencies, etc. compared to existing implementation.
>
>
>
> Dinesh
>
>
>
> On Jan 18, 2022, at 2:13 AM, Branimi

[VOTE] CEP-19: Trie memtable implementation

2022-02-16 Thread Branimir Lambov
Hi everyone,

I'd like to propose CEP-19 for approval.

Proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
Discussion: https://lists.apache.org/thread/fdvf1wmxwnv5jod59jznbnql23nqosty

The vote will be open for 72 hours.
Votes by committers are considered binding.
A vote passes if there are at least three binding +1s and no binding vetoes.

Thank you,
Branimir


Re: [VOTE] CEP-19: Trie memtable implementation

2022-02-25 Thread Branimir Lambov
The vote passes with 8 binding and 6 non-binding +1s and no vetoes.

Thanks everyone!

Regards,
Branimir

On Thu, Feb 17, 2022 at 12:43 PM Benjamin Lerer  wrote:

> +1
>
> Le jeu. 17 févr. 2022 à 08:22, Dinesh Joshi  a écrit :
>
>> +1
>>
>> On 2/16/22 21:45, Berenguer Blasi wrote:
>> > +1
>> >
>> > On 16/2/22 23:50, Joseph Lynch wrote:
>> >> +1 nb
>> >>
>> >> Really excited for this, Thank you Branimir!
>> >>
>> >> -Joey
>> >>
>> >> On Wed, Feb 16, 2022 at 12:58 AM Branimir Lambov 
>> >> wrote:
>> >>> Hi everyone,
>> >>>
>> >>> I'd like to propose CEP-19 for approval.
>> >>>
>> >>> Proposal:
>> >>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
>> >>>
>> >>> Discussion:
>> >>> https://lists.apache.org/thread/fdvf1wmxwnv5jod59jznbnql23nqosty
>> >>>
>> >>> The vote will be open for 72 hours.
>> >>> Votes by committers are considered binding.
>> >>> A vote passes if there are at least three binding +1s and no binding
>> >>> vetoes.
>> >>>
>> >>> Thank you,
>> >>> Branimir
>>
>>

-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-07 Thread Branimir Lambov
Added some performance results to the ticket:
https://issues.apache.org/jira/browse/CASSANDRA-17240

Regards,
Branimir

On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi  wrote:

> This is excellent. Thanks for opening up this CEP. It would be great to
> get some stats around GC allocation rate / memory pressure, read & write
> latencies, etc. compared to existing implementation.
>
> Dinesh
>
> On Jan 18, 2022, at 2:13 AM, Branimir Lambov  wrote:
>
> The memtable pluggability API (CEP-11) is per-table to enable memtable
> selection that suits specific workflows. It also makes full sense to permit
> per-node configuration, both to be able to modify the configuration to suit
> heterogeneous deployments better, as well as to test changes for
> improvements such as this one.
> Recognizing this, the patch comes with a modification to the API
> <https://github.com/blambov/cassandra/commit/24b558ba2f71a2f040804e28993cc914b31298f5>
> that defines memtable templates in cassandra.yaml (i.e. per node) and
> allows the schema to select a template (in addition to being able to
> specify the full memtable configuration). One could use this e.g. by adding:
>
> memtable_templates:
> trie:
> class: TrieMemtable
> shards: 16
> skiplist:
> class: SkipListMemtable
> memtable:
> template: skiplist
>
> (which defines two templates and specifies the default memtable
> implementation to use) to cassandra.yaml and specifying  WITH memtable =
> {'template' : 'trie'} in the table schema.
>
> I intend to commit this modification with the memtable API
> (CASSANDRA-17034/CEP-11).
>
> Performance comparisons will be published soon.
>
> Regards,
> Branimir
>
> On Fri, Jan 14, 2022 at 4:15 PM Jeff Jirsa  wrote:
>
>> Sounds like a great addition
>>
>> Can you share some of the details around gc and latency improvements
>> you’ve observed with the list?
>>
>> Any specific reason the confirmation is through schema vs yaml?
>> Presumably it’s so a user can test per table, but this changes every host
>> in a cluster, so the impact of a bug/regression is much higher.
>>
>>
>> On Jan 10, 2022, at 1:30 AM, Branimir Lambov  wrote:
>>
>> 
>> We would like to contribute our TrieMemtable to Cassandra.
>>
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
>>
>> This is a new memtable solution aimed to replace the legacy
>> implementation, developed with the following objectives:
>> - lowering the on-heap complexity and the ability to store memtable
>> indexing structures off-heap,
>> - leveraging byte order and a trie structure to lower the memory
>> footprint and improve mutation and lookup performance.
>>
>> The new memtable relies on CASSANDRA-6936 to translate to and from
>> byte-ordered representations of types, and CASSANDRA-17034 / CEP-11 to plug
>> into Cassandra. The memtable is built on multiple shards of custom
>> in-memory single-writer multiple-reader tries, whose implementation uses a
>> combination of state-of-the-art and novel features for greater efficiency.
>>
>> The CEP's JIRA ticket (
>> https://issues.apache.org/jira/browse/CASSANDRA-17240) contains the
>> initial version of the implementation. In its current form it achieves much
>> better garbage collection latency, significantly bigger data sizes between
>> flushes for the same memory allocation, as well as drastically increased
>> write throughput, and we expect the memory and garbage collection
>> improvements to go much further with upcoming improvements to the solution.
>>
>> I am interested in hearing your thoughts on the proposal.
>>
>> Regards,
>> Branimir
>>
>>
>


Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-09 Thread Branimir Lambov
If I understand this correctly, you prefer _not_ to have an option to give
the configuration explicitly in the schema. I.e. force the configurations
("templates" in current terms) to be specified in the yaml, and only allow
tables to specify which one to use among them?

This does sound at least as good to me, and I'll happily change the API.

Regards,
Branimir

On Tue, Feb 8, 2022 at 10:40 PM Dinesh Joshi  wrote:

> My quick reading of the code suggests that schema will override the
> operator's default preference in the YAML. In the event of a bug in the new
> implementation, there could be situation where the operator might need to
> override this via the YAML.
>
> On Feb 8, 2022, at 12:29 PM, Jeremiah D Jordan 
> wrote:
>
> I don’t really see most users touching the default implementation.  I
> would expect the main reason someone would change would be
> 1. They run into some bug that is only in one of the implementations.
> 2. They have persistent memory and so want to use
> https://issues.apache.org/jira/browse/CASSANDRA-13981
>
> Given that I doubt most people will touch it, I think it is good to give
> advanced operators the ability to have more control over switching to
> things that have new performance characteristics.  So I like the idea that
> the proposed configuration approach which allows someone to change to a new
> implementation one node at a time and only for specific tables.
>
> On Feb 8, 2022, at 2:21 PM, Dinesh Joshi  wrote:
>
> Thank you for sharing the perf test results.
>
> Going back to the schema vs yaml configuration. I am concerned users may
> pick the wrong implementation for their use-case. Is there any chance for
> us to automatically pick a MemTable implementation based on heuristics? Do
> we foresee users ever picking the existing SkipList implementation over the
> Trie Given the performance tests, it seems the Trie implementation is the
> clear winner.
>
> To be clear, I am not suggesting we remove the existing implementation. I
> am for maintaining a pluggable API for various components.
>
> Dinesh
>
> On Feb 7, 2022, at 8:39 AM, Branimir Lambov  wrote:
>
> Added some performance results to the ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-17240
>
> Regards,
> Branimir
>
> On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi  wrote:
>
>> This is excellent. Thanks for opening up this CEP. It would be great to
>> get some stats around GC allocation rate / memory pressure, read & write
>> latencies, etc. compared to existing implementation.
>>
>> Dinesh
>>
>> On Jan 18, 2022, at 2:13 AM, Branimir Lambov  wrote:
>>
>> The memtable pluggability API (CEP-11) is per-table to enable memtable
>> selection that suits specific workflows. It also makes full sense to permit
>> per-node configuration, both to be able to modify the configuration to suit
>> heterogeneous deployments better, as well as to test changes for
>> improvements such as this one.
>> Recognizing this, the patch comes with a modification to the API
>> <https://github.com/blambov/cassandra/commit/24b558ba2f71a2f040804e28993cc914b31298f5>
>> that defines memtable templates in cassandra.yaml (i.e. per node) and
>> allows the schema to select a template (in addition to being able to
>> specify the full memtable configuration). One could use this e.g. by adding:
>>
>> memtable_templates:
>> trie:
>> class: TrieMemtable
>> shards: 16
>> skiplist:
>> class: SkipListMemtable
>> memtable:
>> template: skiplist
>>
>> (which defines two templates and specifies the default memtable
>> implementation to use) to cassandra.yaml and specifying  WITH memtable =
>> {'template' : 'trie'} in the table schema.
>>
>> I intend to commit this modification with the memtable API
>> (CASSANDRA-17034/CEP-11).
>>
>> Performance comparisons will be published soon.
>>
>> Regards,
>> Branimir
>>
>> On Fri, Jan 14, 2022 at 4:15 PM Jeff Jirsa  wrote:
>>
>>> Sounds like a great addition
>>>
>>> Can you share some of the details around gc and latency improvements
>>> you’ve observed with the list?
>>>
>>> Any specific reason the confirmation is through schema vs yaml?
>>> Presumably it’s so a user can test per table, but this changes every host
>>> in a cluster, so the impact of a bug/regression is much higher.
>>>
>>>
>>> On Jan 10, 2022, at 1:30 AM, Branimir Lambov  wrote:
>>>
>>> 
>>> We would like to contribute our TrieMemtable to Cassandra.
>>>
>>&g

Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-01-18 Thread Branimir Lambov
The memtable pluggability API (CEP-11) is per-table to enable memtable
selection that suits specific workflows. It also makes full sense to permit
per-node configuration, both to be able to modify the configuration to suit
heterogeneous deployments better, as well as to test changes for
improvements such as this one.
Recognizing this, the patch comes with a modification to the API
<https://github.com/blambov/cassandra/commit/24b558ba2f71a2f040804e28993cc914b31298f5>
that defines memtable templates in cassandra.yaml (i.e. per node) and
allows the schema to select a template (in addition to being able to
specify the full memtable configuration). One could use this e.g. by adding:

memtable_templates:
trie:
class: TrieMemtable
shards: 16
skiplist:
class: SkipListMemtable
memtable:
template: skiplist

(which defines two templates and specifies the default memtable
implementation to use) to cassandra.yaml and specifying  WITH memtable =
{'template' : 'trie'} in the table schema.

I intend to commit this modification with the memtable API
(CASSANDRA-17034/CEP-11).

Performance comparisons will be published soon.

Regards,
Branimir

On Fri, Jan 14, 2022 at 4:15 PM Jeff Jirsa  wrote:

> Sounds like a great addition
>
> Can you share some of the details around gc and latency improvements
> you’ve observed with the list?
>
> Any specific reason the confirmation is through schema vs yaml? Presumably
> it’s so a user can test per table, but this changes every host in a
> cluster, so the impact of a bug/regression is much higher.
>
>
> On Jan 10, 2022, at 1:30 AM, Branimir Lambov  wrote:
>
> 
> We would like to contribute our TrieMemtable to Cassandra.
>
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-19%3A+Trie+memtable+implementation
>
> This is a new memtable solution aimed to replace the legacy
> implementation, developed with the following objectives:
> - lowering the on-heap complexity and the ability to store memtable
> indexing structures off-heap,
> - leveraging byte order and a trie structure to lower the memory footprint
> and improve mutation and lookup performance.
>
> The new memtable relies on CASSANDRA-6936 to translate to and from
> byte-ordered representations of types, and CASSANDRA-17034 / CEP-11 to plug
> into Cassandra. The memtable is built on multiple shards of custom
> in-memory single-writer multiple-reader tries, whose implementation uses a
> combination of state-of-the-art and novel features for greater efficiency.
>
> The CEP's JIRA ticket (
> https://issues.apache.org/jira/browse/CASSANDRA-17240) contains the
> initial version of the implementation. In its current form it achieves much
> better garbage collection latency, significantly bigger data sizes between
> flushes for the same memory allocation, as well as drastically increased
> write throughput, and we expect the memory and garbage collection
> improvements to go much further with upcoming improvements to the solution.
>
> I am interested in hearing your thoughts on the proposal.
>
> Regards,
> Branimir
>
>


Re: [DISCUSS] Improve Commitlog write path

2022-09-21 Thread Branimir Lambov
Hello Amit,

This paper may be of interest to you:
https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf

We did a range of tests that are similar to your scenario and realized
several things early on:

   - Memory-mapping the commit log in combination with memory-mapped data
   or index files causes long msync delays. This can be solved by switching to
   a compressed log.
   - For smaller mutation sizes, using a large segment size
   practically removes the commit log bottleneck, even with compression and
   even though compression is currently single-threaded.
   - Write performance scaling with available CPU threads is limited by
   memtable congestion. Scaling can be improved by using a sharded memtable
   (introduced with CASSANDRA-17034).

Needing a compressed log to achieve the best write performance is not
ideal, and implementing a non-compressed non-memory-mapped option is
fairly easy, with or without Direct IO. If you are looking for a simple
performance improvement for the commit log, this is where I would start.

Regards,
Branimir

On Tue, Jul 26, 2022 at 3:36 PM Pawar, Amit  wrote:

> [Public]
>
>
>
> Hi Bowen,
>
>
>
> Thanks for your reply. Now it is clear that what are some benefits of this
> patch. I will send it for review once it is ready and hopefully it gets
> accepted.
>
>
>
> Thanks,
>
> Amit
>
>
>
> *From:* Bowen Song via dev 
> *Sent:* Tuesday, July 26, 2022 5:36 PM
> *To:* dev@cassandra.apache.org
> *Subject:* Re: [DISCUSS] Improve Commitlog write path
>
>
>
> [CAUTION: External Email]
>
> Hi Amit,
>
> That's some brilliant tests you have done there. It shows that the
> compaction throughput not only can be a bottleneck on the speed of insert
> operations, but it can also stress the JVM garbage collector. As a result
> of GC pressure, it can cause other things, such as insert, to fail.
>
> Your last statement is correct. The commit log change can be beneficial
> for atypical workloads where large volume of data is getting inserted and
> then expired soon, for example when using the TimeWindowCompactionStrategy
> with short TTL. But I must point out that this kind of atypical usage is
> often an anti-pattern in Cassandra, as Cassandra is a database, not a queue
> or cache system.
>
> This, however, is not saying the commit log change should not be
> introduced. As others have pointed out, it's down to a balancing act
> between the cost and benefit, and it will depend on the code complexity and
> the effect it has on typical workload, such as CPU and JVM heap usage.
> After all, we should prioritise the performance and reliability of typical
> usage before optimising for atypical use cases.
>
> Best,
> Bowen
>
> On 26/07/2022 12:41, Pawar, Amit wrote:
>
> [Public]
>
>
>
> Hi Bowen,
>
>
>
> Thanks for the reply and it helped to identify the failure point. Tested
> compaction throughput with different values and threads active in
> compaction reports “java.lang.OutOfMemoryError: Map failed” error with 1024
> MB/s earlier compared to other values. This shows with lower throughput
> such issues are going to come up not immediately but in days or weeks. Test
> results are given below.
>
>
>
>
> ++---+---+-+
>
> | Records| Compaction Throughput | 5 large files In GB   | Disk usage
> (GB) |
>
>
> ++---+---+-+
>
> | 20 | 8 | Not collected |
> 500 |
>
>
> ++---+---+-+
>
> | 20 | 16| Not collected |
> 500 |
>
>
> ++---+---+-+
>
> | 9  | 64| 3.5,3.5,3.5,3.5,3.5   |
> 273 |
>
>
> ++---+---+-+
>
> | 9  | 128   | 3.5, 3.9,4.9,8.0, 15  |
> 287 |
>
>
> ++---+---+-+
>
> | 9  | 256   | 11,11,12,16,20|
> 359 |
>
>
> ++---+---+-+
>
> | 9  | 512   | 14,19,23,27,28|
> 469 |
>
>
> ++---+---+-+
>
> | 9  | 1024  | 14,18,23,27,28|
> 458 |
>
>
> ++---+---+-+
>
> | 9  | 0 | 6.9,6.9,7.0,28,28 |
> 223 |
>
>
> ++---+---+-+
>
> ||   |
> | |
>
>
> ++---+---+-+
>
>
>
> Issues observed with 

[DISCUSS] Adding dependency on agrona

2022-09-21 Thread Branimir Lambov
Hi everyone,

CASSANDRA-17240 (Trie memtable implementation) introduces a dependency on
the agrona  library (https://github.com/real-logic/agrona).

Does anyone have any objections to adding this dependency?

Regards,
Branimir


Re: [DISCUSS] Adding dependency on agrona

2022-09-23 Thread Branimir Lambov
The usage in the trie memtable is only for volatile access to buffers. In
this case I chose the library instead of reimplementing the functionality
(e.g. as methods in `ByteBufferUtil`) because the relevant interface makes
sense and the library is a good quality one that contains a range of other
utilities that can be very useful to Cassandra.

In other words, I personally would welcome opening Cassandra up to using
other parts of Agrona, and am asking if the community shares this sentiment.


Regards,
Branimir

On Wed, Sep 21, 2022 at 9:15 PM Derek Chen-Becker 
wrote:

> Agrona looks like it has quite a bit more than just buffers, so if we add
> this as a dependency for the new memtable, would it potentially open up use
> of other parts of Agrona (wittingly or not)? Unless I misunderstood, wasn't
> part of the new memtable implementation an interface to allow this to be
> pluggable? Could we avoid bringing it in as a full dependency for Cassandra
> if the trie memtable were packaged separately as a plugin instead of being
> included directly?
>
> Cheers,
>
> Derek
>
> On Wed, Sep 21, 2022 at 6:41 AM Benedict  wrote:
>
>> In principle no, it’s a high quality library. But it might help to
>> briefly outline what it’s used for. I assume it is instead of ByteBuffer?
>> In which case it could maybe be worthwhile discussing as a project how we
>> foresee interaction with existing buffer machinery, and maybe how we expect
>> our buffer use to evolve on the project, as we already have several buffers.
>>
>> That said, I anticipate our buffer use changing significantly with the
>> introduction of value types and native memory improvements coming in future
>> Java releases, so my personal inclination is just to accept the dependency.
>>
>> On 21 Sep 2022, at 13:29, Branimir Lambov  wrote:
>>
>> 
>> Hi everyone,
>>
>> CASSANDRA-17240 (Trie memtable implementation) introduces a dependency on
>> the agrona  library (https://github.com/real-logic/agrona).
>>
>> Does anyone have any objections to adding this dependency?
>>
>> Regards,
>> Branimir
>>
>>
>
> --
> +---+
> | Derek Chen-Becker |
> | GPG Key available at https://keybase.io/dchenbecker
> <https://urldefense.com/v3/__https://keybase.io/dchenbecker__;!!PbtH5S7Ebw!cY9TyIm1RqAGMkhgyKDjzQcOq6Cy6kzMj_VjvMm40JG9VMm6JgFfH9omG1Spx0UmlkEcGJcFmDtKjcbIGBN7PBunbg$>
> and   |
> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org
> <https://urldefense.com/v3/__https://pgp.mit.edu/pks/lookup?search=derek*40chen-becker.org__;JQ!!PbtH5S7Ebw!cY9TyIm1RqAGMkhgyKDjzQcOq6Cy6kzMj_VjvMm40JG9VMm6JgFfH9omG1Spx0UmlkEcGJcFmDtKjcbIGBPzYoayyA$>
> |
> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> +---+
>
>

-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


Re: [VOTE] CEP-25: Trie-indexed SSTable format

2022-12-23 Thread Branimir Lambov
The vote passes with 13 +1s (11 binding) and no negative votes.

Thank you all!
Branimir

On Tue, Dec 20, 2022 at 7:39 AM Dinesh Joshi  wrote:

> +1
>
> On Dec 19, 2022, at 6:28 PM, Jake Luciani  wrote:
>
> 
> +1
>
> On Mon, Dec 19, 2022 at 7:27 PM C. Scott Andreas 
> wrote:
>
>> +1nb
>>
>> On Dec 19, 2022, at 1:27 PM, Josh McKenzie  wrote:
>>
>>
>> +1
>>
>> On Mon, Dec 19, 2022, at 11:54 AM, SAURABH VERMA wrote:
>>
>> +1
>>
>> On Mon, Dec 19, 2022 at 9:36 PM Benjamin Lerer  wrote:
>>
>> +1
>>
>> Le lun. 19 déc. 2022 à 16:31, Andrés de la Peña  a
>> écrit :
>>
>> +1
>>
>> On Mon, 19 Dec 2022 at 15:11, Aleksey Yeshchenko 
>> wrote:
>>
>> +1
>>
>> On 19 Dec 2022, at 13:42, Ekaterina Dimitrova 
>> wrote:
>>
>> +1
>>
>> On Mon, 19 Dec 2022 at 8:30, J. D. Jordan 
>> wrote:
>>
>> +1 nb
>>
>> > On Dec 19, 2022, at 7:07 AM, Brandon Williams  wrote:
>> >
>> > +1
>> >
>> > Kind Regards,
>> > Brandon
>> >
>> >> On Mon, Dec 19, 2022 at 6:59 AM Branimir Lambov 
>> wrote:
>> >>
>> >> Hi everyone,
>> >>
>> >> I'd like to propose CEP-25 for approval.
>> >>
>> >> Proposal:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
>> >> Discussion:
>> https://lists.apache.org/thread/3dpdg6dgm3rqxj96cyhn58b50g415dyh
>> >>
>> >> The vote will be open for 72 hours.
>> >> Votes by committers are considered binding.
>> >> A vote passes if there are at least three binding +1s and no binding
>> vetoes.
>> >>
>> >> Thank you,
>> >> Branimir
>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Saurabh Verma,
>> India
>>
>>
>
> --
> http://twitter.com/tjake
>
>


[VOTE] CEP-25: Trie-indexed SSTable format

2022-12-19 Thread Branimir Lambov
Hi everyone,

I'd like to propose CEP-25 for approval.

Proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
Discussion: https://lists.apache.org/thread/3dpdg6dgm3rqxj96cyhn58b50g415dyh

The vote will be open for 72 hours.
Votes by committers are considered binding.
A vote passes if there are at least three binding +1s and no binding vetoes.

Thank you,
Branimir


[DISCUSS] CEP-26: Unified Compaction Strategy

2022-12-19 Thread Branimir Lambov
Hello everyone,

I would like to open the discussion on our proposal for a unified
compaction strategy that aims to solve well-known problems with compaction
and improve parallelism to permit higher levels of sustained write
throughput.

The proposal is here:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+Strategy

The strategy is based on two main observations:
- that tiered and levelled compaction can be generalized as the same thing
if one observes that both form exponentially-growing levels based on the
size of sstables (or non-overlapping sstable runs) and trigger a compaction
when more than a given number of sstables are present on one level;
- that instead of "size" in the description above we can use "density",
i.e. the size of an sstable divided by the width of the token range it
covers, which permits sstables to be split at arbitrary points when the
output of a compaction is written and still produce a levelled hierarchy.

The latter allows us to shard the compaction space into
progressively higher numbers of shards as data moves to the higher levels
of the hierarchy, improving parallelism, space requirements and the
duration of compactions, and the former allows us to cover the existing
strategies, as well as hybrid mixtures that can prove more efficient for
some workloads.

Thank you,
Branimir


Re: [DISCUSS] CEP-25: Trie-indexed SSTable format

2022-11-21 Thread Branimir Lambov
There is no intention to introduce any new versions of the format
specifically for DSE. If there are any further changes to the format, they
will be OSS-first. In other words this support only extends to preexisting
versions of the format.

Inline row index in the data file is not something we have implemented, and
it's currently not in any plans. I personally am not sure how it can be
done to provide a benefit: if we place it at the end of a partition, it
does not help much compared to a separate file; if we place it in front, we
have to buffer the partition content, which will affect write performance.
In either case it may be harder to cache. Do you have something different
in mind?

Regards,
Branimir

On Mon, Nov 21, 2022 at 3:01 PM Benedict  wrote:

> Personally very pleased to see this proposal, and I’m not opposed to
> easing your migration by maintaining some light support for internal file
> versions - though would prefer the support have some version limit where it
> can be excised (maybe for one minor version bump?)
>
> One implementation question: are there any plans to support inline row
> index in the big sstable format files? Is this something DSE supports, and
> on the roadmap just not for initial work, or currently not envisioned?
>
> I would anticipate significant advantage to this for many workloads, and
> no downside (except for streaming - which could be resolved fairly easily
> by skipping over these sections when streaming to an old node, but since we
> don’t generally stream between versions I don’t see any major issue anyway).
>
>
> On 21 Nov 2022, at 12:43, Branimir Lambov  wrote:
>
> 
> Hi everyone,
>
> We would like to put CEP-25 for discussion.
>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
>
> The proposal describes DSE's Big Trie-indexed SSTable format, which
> replaces the primary index with on-disk tries to improve lookup performance
> and index size, better handle wide partitions, and remove the need to
> manage key caching and index summaries.
>
> We would like to discuss this proposal with you.
>
> One of the questions that we want to ask is whether anyone objects to
> maintaining full compatibility with existing files created by DataStax
> Enterprise.
>
> Regards,
> Branimir
>
>


Re: [DISCUSS] CEP-25: Trie-indexed SSTable format

2022-11-21 Thread Branimir Lambov
I see. This does make a lot of sense for full row indexing, and also if one
can specify sub-kb granularity (at the current default we just won't have
an index in these cases). How does opening a ticket to do these two* after
the current code is committed sound?

* embedded index for sub-X-byte partitions + granularity in bytes

On Mon, Nov 21, 2022 at 3:38 PM Benedict  wrote:

> Buffering on write up to at most one page seems fine? Once you are past a
> single page it’s fine to write either to the end of the partition or to a
> separate file, there’s nothing much to be gained, but esp. for small
> partitions there’s likely significant value in prepending it?
>
> It might be preferable to retain the separate index for those that
> overflow this buffer, and simply encode in the partition index whether the
> row index is inline or in the separate file.
>
> On 21 Nov 2022, at 13:29, Branimir Lambov  wrote:
>
> 
> There is no intention to introduce any new versions of the format
> specifically for DSE. If there are any further changes to the format, they
> will be OSS-first. In other words this support only extends to preexisting
> versions of the format.
>
> Inline row index in the data file is not something we have implemented,
> and it's currently not in any plans. I personally am not sure how it can be
> done to provide a benefit: if we place it at the end of a partition, it
> does not help much compared to a separate file; if we place it in front, we
> have to buffer the partition content, which will affect write performance.
> In either case it may be harder to cache. Do you have something different
> in mind?
>
> Regards,
> Branimir
>
> On Mon, Nov 21, 2022 at 3:01 PM Benedict  wrote:
>
>> Personally very pleased to see this proposal, and I’m not opposed to
>> easing your migration by maintaining some light support for internal file
>> versions - though would prefer the support have some version limit where it
>> can be excised (maybe for one minor version bump?)
>>
>> One implementation question: are there any plans to support inline row
>> index in the big sstable format files? Is this something DSE supports, and
>> on the roadmap just not for initial work, or currently not envisioned?
>>
>> I would anticipate significant advantage to this for many workloads, and
>> no downside (except for streaming - which could be resolved fairly easily
>> by skipping over these sections when streaming to an old node, but since we
>> don’t generally stream between versions I don’t see any major issue anyway).
>>
>>
>> On 21 Nov 2022, at 12:43, Branimir Lambov  wrote:
>>
>> 
>> Hi everyone,
>>
>> We would like to put CEP-25 for discussion.
>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format
>>
>> The proposal describes DSE's Big Trie-indexed SSTable format, which
>> replaces the primary index with on-disk tries to improve lookup performance
>> and index size, better handle wide partitions, and remove the need to
>> manage key caching and index summaries.
>>
>> We would like to discuss this proposal with you.
>>
>> One of the questions that we want to ask is whether anyone objects to
>> maintaining full compatibility with existing files created by DataStax
>> Enterprise.
>>
>> Regards,
>> Branimir
>>
>>
>
>
>


[DISCUSS] CEP-25: Trie-indexed SSTable format

2022-11-21 Thread Branimir Lambov
Hi everyone,

We would like to put CEP-25 for discussion.
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format

The proposal describes DSE's Big Trie-indexed SSTable format, which
replaces the primary index with on-disk tries to improve lookup performance
and index size, better handle wide partitions, and remove the need to
manage key caching and index summaries.

We would like to discuss this proposal with you.

One of the questions that we want to ask is whether anyone objects to
maintaining full compatibility with existing files created by DataStax
Enterprise.

Regards,
Branimir


Re: [DISCUSS] CEP-26: Unified Compaction Strategy

2023-03-17 Thread Branimir Lambov
The prototype of UCS can now be found in this pull request:
https://github.com/apache/cassandra/pull/2228

Its description is given in the included markdown documentation:
https://github.com/blambov/cassandra/blob/UCS-density/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.md

The latest code includes some new elements compared to the link Henrik
posted, including density levelling, bucketing based solely on overlap, and
output splitting by expected density. It goes a little further than what is
described in the CEP-26 proposal as prototyping showed that we can make the
selection of sstables to compact and the sharding decisions independent of
each other. This makes the strategy more stable and better able to react to
changes in configuration and environment.

Regards,
Branimir

On Wed, Dec 21, 2022 at 10:01 AM Benedict  wrote:

> I’m personally very excited by this work. Compaction could do with a
> spring clean and this feels to formalise things much more cleanly, but
> density tiering in particular is something I’ve wanted to incorporate for
> years now, as it should significantly improve STCS behaviour (most
> importantly reducing read amplification and the amount of disk space
> required, narrowing the performance delta to LCS in these important
> dimensions), and simplifies re-levelling of LCS, making large streams much
> less painful.
>
> On 21 Dec 2022, at 07:19, Henrik Ingo  wrote:
>
> 
> I noticed the CEP doesn't link to this, so it should be worth mentioning
> that the UCS documentation is available here:
> https://github.com/datastax/cassandra/blob/ds-trunk/doc/unified_compaction.md
>
> Both of the above seem to do a poor job referencing the literature we've
> been inspired by. I will link to Mark Callaghan's blog on the subject:
>
>
> http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html?m=1
> <https://urldefense.com/v3/__http://smalldatum.blogspot.com/2018/07/tiered-or-leveled-compaction-why-not.html?m=1__;!!PbtH5S7Ebw!Yl4p4GbDXwIxv3LqE22ZTb7rts5YMhROy-ldQnvjOoWW3wTylErPe4ZGChHuxz1ahebyIrxNMkJYObDTMjgpQnZW$>
>
> ...and lazily will also borrow from Mark a post that references a bunch of
> LSM (not just UCS related) academic papers:
> http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html?m=1
> <https://urldefense.com/v3/__http://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html?m=1__;!!PbtH5S7Ebw!Yl4p4GbDXwIxv3LqE22ZTb7rts5YMhROy-ldQnvjOoWW3wTylErPe4ZGChHuxz1ahebyIrxNMkJYObDTMhKyBRnd$>
>
> Finally, it's perhaps worth mentioning that UCS has been in production in
> our Astra Serverless cloud service since it was launched in March 2021. The
> version described by the CEP therefore already incorporates some
> improvements based on observed production behaviour.
>
> Henrik
>
> On Mon, 19 Dec 2022, 15:41 Branimir Lambov,  wrote:
>
>> Hello everyone,
>>
>> I would like to open the discussion on our proposal for a unified
>> compaction strategy that aims to solve well-known problems with compaction
>> and improve parallelism to permit higher levels of sustained write
>> throughput.
>>
>> The proposal is here:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+Strategy
>>
>> The strategy is based on two main observations:
>> - that tiered and levelled compaction can be generalized as the same
>> thing if one observes that both form exponentially-growing levels based on
>> the size of sstables (or non-overlapping sstable runs) and trigger a
>> compaction when more than a given number of sstables are present on one
>> level;
>> - that instead of "size" in the description above we can use "density",
>> i.e. the size of an sstable divided by the width of the token range it
>> covers, which permits sstables to be split at arbitrary points when the
>> output of a compaction is written and still produce a levelled hierarchy.
>>
>> The latter allows us to shard the compaction space into
>> progressively higher numbers of shards as data moves to the higher levels
>> of the hierarchy, improving parallelism, space requirements and the
>> duration of compactions, and the former allows us to cover the existing
>> strategies, as well as hybrid mixtures that can prove more efficient for
>> some workloads.
>>
>> Thank you,
>> Branimir
>>
>>


Re: [EXTERNAL] Re: [DISCUSS] Next release date

2023-03-09 Thread Branimir Lambov
CEPs 25 (trie-indexed sstables) and 26 (unified compaction strategy) should
both be ready for review by mid-April.

Both are around 10k LOC, fairly isolated, and in need of a committer to
review.

Regards,
Branimir

On Mon, Mar 6, 2023 at 11:25 AM Benjamin Lerer  wrote:

> Sorry, I realized that when I started the discussion I probably did not
> frame it enough as I see that it is now going into different directions.
> The concerns I am seeing are:
> 1) A too small amount of time between releases  is inefficient from a
> development perspective and from a user perspective. From a development
> point of view because we are missing time to deliver some features. From a
> user perspective because they cannot follow with the upgrade.
> 2) Some features are so anticipated (Accord being the one mentioned) that
> people would prefer to delay the release to make sure that it is available
> as soon as possible.
> 3) We do not know how long we need to go from the freeze to GA. We hope
> for 2 months but our last experience was 6 months. So delaying the release
> could mean not releasing this year.
> 4) For people doing marketing it is really hard to promote a product when
> you do not know when the release will come and what features might be there.
>
> All those concerns are probably even made worse by the fact that we do not
> have a clear visibility on where we are.
>
> Should we clarify that part first by getting an idea of the status of the
> different CEPs and other big pieces of work? From there we could agree on
> some timeline for the freeze. We could then discuss how to make predictable
> the time from freeze to GA.
>
>
>
> Le sam. 4 mars 2023 à 18:14, Josh McKenzie  a
> écrit :
>
>> (for convenience sake, I'm referring to both Major and Minor semver
>> releases as "major" in this email)
>>
>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>> would advocate to delay until this has sufficient quality to be in
>> production.
>>
>> This approach can be pretty unpredictable in this domain; often
>> unforeseen things come up in implementation that can give you a long tail
>> on something being production ready. For the record - I don't intend to
>> single Accord out *at all* on this front, quite the opposite given how
>> much rigor's gone into the design and implementation. I'm just thinking
>> from my personal experience: everything I've worked on, overseen, or
>> followed closely on this codebase always has a few tricks up its sleeve
>> along the way to having edge-cases stabilized.
>>
>> Much like on some other recent topics, I think there's a nuanced middle
>> ground where we take things on a case-by-case basis. Some factors that have
>> come up in this thread that resonated with me:
>>
>> For a given potential release date 'X':
>> 1. How long has it been since the last release?
>> 2. How long do we expect qualification to take from a "freeze" (i.e. no
>> new improvement or features, branch) point?
>> 3. What body of merged production ready work is available?
>> 4. What body of new work do we have high confidence will be ready within
>> Y time?
>>
>> I think it's worth defining a loose "minimum bound and upper bound" on
>> release cycles we want to try and stick with barring extenuating
>> circumstances. For instance: try not to release sooner than maybe 10 months
>> out from a prior major, and try not to release later than 18 months out
>> from a prior major. Make exceptions if truly exceptional things land, are
>> about to land, or bugs are discovered around those boundaries.
>>
>> Applying the above framework to what we have in flight, our last release
>> date, expectations on CI, etc - targeting an early fall freeze (pending CEP
>> status) and mid to late fall or December release "feels right" to me.
>>
>> With the exception, of course, that if something merges earlier, is
>> stable, and we feel is valuable enough to cut a major based on that, we do
>> it.
>>
>> ~Josh
>>
>> On Fri, Mar 3, 2023, at 7:37 PM, German Eichberger via dev wrote:
>>
>> Hi,
>>
>> We shouldn't release just for releases sake. Are there enough new
>> features and are they working well enough (quality!).
>>
>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I
>> would advocate to delay until this has sufficient quality to be in
>> production.
>>
>> Just because something is released doesn't mean anyone is gonna use it.
>> To add some operator perspective: Every time there is a new release we need
>> to decide
>> 1) are we supporting it
>> 2) which other release can we deprecate
>>
>> and potentially migrate people - which is also a tough sell if there are
>> no significant features and/or breaking changes.  So from my perspective
>> less frequent releases are better - after all we haven't gotten around to
>> support 4.1 
>>
>> The 5.0 release is also coupled with deprecating  3.11 which is what a
>> significant amount of people are using - given 4.1 took longer I am not
>> sure how many 

Re: Downgradability

2023-02-22 Thread Branimir Lambov
> 1. Major SSTable changes should begin with forward-compatibility in a
prior release.

This requires "feature" changes, i.e. new non-trivial code for previous
patch releases. It also entails porting over any further format
modification.

Instead of this, in combination with your second point, why not implement
backwards write compatibility? The opt-in is then clearer to define (i.e.
upgrades start with e.g. a "4.1-compatible" settings set that includes file
format compatibility and disabling of new features, new nodes start with
"current" settings set). When the upgrade completes and the user is happy
with the result, the settings set can be replaced.

Doesn't this achieve what you want (and we all agree is a worthy goal) with
much less effort for everyone? Supporting backwards-compatible writing is
trivial, and we even have a proof-of-concept in the stats metadata
serializer. It also simplifies by a serious margin the amount of work and
thinking one has to do when a format improvement is implemented -- e.g. the
TTL patch can just address this in exactly the way the problem was
addressed in earlier versions of the format, by capping to 2038, without
any need to specify, obey or test any configuration flags.

>> It’s a commitment, and it requires every contributor to consider it as
part of work they produce.

> But it shouldn't be a burden. Ability to downgrade is a testable problem,
so I see this work as a function of the suite of tests the project is
willing to agree on supporting.

I fully agree with this sentiment, and I feel that the current "try to not
introduce breaking changes" approach is adding the burden, but not the
benefits -- because the latter cannot be proven, and are most likely
already broken.

Regards,
Branimir

On Wed, Feb 22, 2023 at 1:01 AM Abe Ratnofsky  wrote:

> Some interesting existing work on this subject is "Understanding and
> Detecting Software Upgrade Failures in Distributed Systems" -
> https://dl.acm.org/doi/10.1145/3477132.3483577
> <https://urldefense.com/v3/__https://dl.acm.org/doi/10.1145/3477132.3483577__;!!PbtH5S7Ebw!ZUMhWOKjMaK62HKCGLYN0rAhZbbX8fOJkgCsfMgjYO5EgJQulefcb5pwH4q5oU5ylLl6W56W-NWm0FLO7w$>,
> also summarized by Andrey Satarin here:
> https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/
> <https://urldefense.com/v3/__https://asatarin.github.io/talks/2022-09-upgrade-failures-in-distributed-systems/__;!!PbtH5S7Ebw!ZUMhWOKjMaK62HKCGLYN0rAhZbbX8fOJkgCsfMgjYO5EgJQulefcb5pwH4q5oU5ylLl6W56W-NUfWWwFsA$>
>
> They specifically tested Cassandra upgrades, and have a solid list of
> defects that they found. They also describe their testing mechanism
> DUPTester, which includes a component that confirms that the leftover state
> from one version can start up on the next version. There is a wider scope
> of upgrade defects highlighted in the paper, beyond SSTable version support.
>
> I believe the project would benefit from expanding our test suite
> similarly, by parametrizing more tests on upgrade version pairs.
>
> Also, per Benedict's comment:
>
> > It’s a commitment, and it requires every contributor to consider it as
> part of work they produce.
>
> But it shouldn't be a burden. Ability to downgrade is a testable problem,
> so I see this work as a function of the suite of tests the project is
> willing to agree on supporting.
>
> Specifically - I agree with Scott's proposal to emulate the HDFS
> upgrade-then-finalize approach. I would also support automatic finalization
> based on a time threshold or similar, to balance the priorities of safe and
> straightforward upgrades. Users need to be aware of the range of SSTable
> formats supported by a given version, and how to handle when their SSTables
> wouldn't be supported by an upcoming upgrade.
>
> --
> Abe
>


-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


Downgradability

2023-02-20 Thread Branimir Lambov
Hi everyone,

There has been a discussion lately about changes to the sstable format in
the context of being able to abort a cluster upgrade, and the fact that
changes to sstables can prevent downgraded nodes from reading any data
written during their temporary operation with the new version.

Most of the discussion is in CASSANDRA-18134
, and is spreading
into CASSANDRA-14277 
and CASSANDRA-17698 ,
none of which is a good place to discuss the topic seriously.

Downgradability is a worthy goal and is listed in the current roadmap. I
would like to open a discussion here on how it would be achieved.

My understanding of what has been suggested so far translates to:
- avoid changes to sstable formats;
- if there are changes, implement them in a way that is
backwards-compatible, e.g. by duplicating data, so that a new version is
presented in a component or portion of a component that legacy nodes will
not try to read;
- if the latter is not feasible, make sure the changes are only applied if
a feature flag has been enabled.

To me this approach introduces several risks:
- it bloats file and parsing complexity;
- it discourages improvement (e.g. CASSANDRA-17698 is no longer a LHF
ticket once this requirement is in place);
- it needs care to avoid risky solutions to address technical issues with
the format versioning (e.g. staying on n-versions for 5.0 and needing a
bump for a 4.1 bugfix might require porting over support for new features);
- it requires separate and uncoordinated solutions to the problem and
switching mechanisms for each individual change.

An alternative solution is to implement/complete CASSANDRA-8110
, which provides a
method of writing sstables for a target version. During upgrades, a node
could be set to produce sstables corresponding to the older version, and
there is a very straightforward way to implement modifications to formats
like the tickets above to conform to its requirements.

What do people think should be the way forward?

Regards,
Branimir


Re: Downgradability

2023-02-21 Thread Branimir Lambov
ven 8110 addresses this - just writing sstables
>>> in old versions won't help if we ever add things like new types or new
>>> types of collections without other control abilities. Claude's other email
>>> in another thread a few hours ago talks about some of these surprises -
>>> "Specifically during the 3.1 -> 4.0 changes a column broadcast_port was
>>> added to system/local.  This means that 3.1 system can not read the table
>>> as it has no definition for it.  I tried marking the column for deletion in
>>> the metadata and in the serialization header.  The later got past the
>>> column not found problem, but I suspect that it just means that data
>>> columns after broadcast_port shifted and so incorrectly read." - this is a
>>> harder problem to solve than just versioning sstables and network
>>> protocols.
>>>
>>> Stepping back a bit, we have downgrade ability listed as a goal, but
>>> it's not (as far as I can tell) universally enforced, nor is it clear at
>>> which point we will be able to concretely say "this release can be
>>> downgraded to X".   Until we actually define and agree that this is a
>>> real goal with a concrete version where downgrade-ability becomes real, it
>>> feels like things are somewhat arbitrarily enforced, which is probably very
>>> frustrating for people trying to commit work/tickets.
>>>
>>> - Jeff
>>>
>>>
>>>
>>> On Mon, Feb 20, 2023 at 11:48 AM Dinesh Joshi  wrote:
>>>
>>>> I’m a big fan of maintaining backward compatibility. Downgradability
>>>> implies that we could potentially roll back an upgrade at any time. While I
>>>> don’t think we need to retain the ability to downgrade in perpetuity it
>>>> would be a good objective to maintain strict backward compatibility and
>>>> therefore downgradability until a certain point. This would imply
>>>> versioning metadata and extending it in such a way that prior version(s)
>>>> could continue functioning. This can certainly be expensive to implement
>>>> and might bloat on-disk storage. However, we could always offer an option
>>>> for the operator to optimize the on-disk structures for the current version
>>>> then we can rewrite them in the latest version. This optimizes the storage
>>>> and opens up new functionality. This means new features that can work with
>>>> old on-disk structures will be available while others that strictly require
>>>> new versions of the data structures will be unavailable until the operator
>>>> migrates to the new version. This migration IMO should be irreversible.
>>>> Beyond this point the operator will lose the ability to downgrade which is
>>>> ok.
>>>>
>>>> Dinesh
>>>>
>>>> On Feb 20, 2023, at 10:40 AM, Jake Luciani  wrote:
>>>>
>>>> 
>>>> There has been progress on
>>>>
>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-8928
>>>>
>>>> Which is similar to what datastax does for DSE. Would this be an
>>>> acceptable solution?
>>>>
>>>> Jake
>>>>
>>>> On Mon, Feb 20, 2023 at 11:17 AM guo Maxwell 
>>>> wrote:
>>>>
>>>>> It seems “An alternative solution is to implement/complete
>>>>> CASSANDRA-8110 <https://issues.apache.org/jira/browse/CASSANDRA-8110>”
>>>>> can give us more options if it is finished
>>>>>
>>>>> Branimir Lambov 于2023年2月20日 周一下午11:03写道:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> There has been a discussion lately about changes to the sstable
>>>>>> format in the context of being able to abort a cluster upgrade, and the
>>>>>> fact that changes to sstables can prevent downgraded nodes from reading 
>>>>>> any
>>>>>> data written during their temporary operation with the new version.
>>>>>>
>>>>>> Most of the discussion is in CASSANDRA-18134
>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-18134>, and is
>>>>>> spreading into CASSANDRA-14277
>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-14227> and
>>>>>> CASSANDRA-17698
>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-17698>, none of
>>>>>> which is a goo

Re: [VOTE] CEP-26: Unified Compaction Strategy

2023-04-07 Thread Branimir Lambov
The vote passes with 11 +1 binding votes and no vetoes.

Thank you all! The patch will be ready to review soon.

Regards,
Branimir

On Thu, Apr 6, 2023 at 10:52 PM Patrick McFadin  wrote:

> +1
>
> Thanks to Lorina for getting people excited about it at Cassandra Forward!
>
> On Thu, Apr 6, 2023 at 10:37 AM Mick Semb Wever  wrote:
>
>> +1
>>
>> On Thu, 6 Apr 2023 at 19:32, Francisco Guerrero 
>> wrote:
>>
>>> +1 (nb)
>>>
>>> On 2023/04/06 17:30:37 Josh McKenzie wrote:
>>> > +1
>>> >
>>> > On Thu, Apr 6, 2023, at 12:18 PM, Joseph Lynch wrote:
>>> > > +1
>>> > >
>>> > > This proposal looks really exciting!
>>> > >
>>> > > -Joey
>>> > >
>>> > > On Wed, Apr 5, 2023 at 2:13 AM Aleksey Yeshchenko 
>>> wrote:
>>> > > >
>>> > > > +1
>>> > > >
>>> > > > On 4 Apr 2023, at 16:56, Ekaterina Dimitrova <
>>> e.dimitr...@gmail.com> wrote:
>>> > > >
>>> > > > +1
>>> > > >
>>> > > > On Tue, 4 Apr 2023 at 11:44, Benjamin Lerer 
>>> wrote:
>>> > > >>
>>> > > >> +1
>>> > > >>
>>> > > >> Le mar. 4 avr. 2023 à 17:17, Andrés de la Peña <
>>> adelap...@apache.org> a écrit :
>>> > > >>>
>>> > > >>> +1
>>> > > >>>
>>> > > >>> On Tue, 4 Apr 2023 at 15:09, Jeremy Hanna <
>>> jeremy.hanna1...@gmail.com> wrote:
>>> > > >>>>
>>> > > >>>> +1 nb, will be great to have this in the codebase - it will
>>> make nearly every table's compaction work more efficiently.  The only
>>> possible exception is tables that are well suited for TWCS.
>>> > > >>>>
>>> > > >>>> On Apr 4, 2023, at 8:00 AM, Berenguer Blasi <
>>> berenguerbl...@gmail.com> wrote:
>>> > > >>>>
>>> > > >>>> +1
>>> > > >>>>
>>> > > >>>> On 4/4/23 14:36, J. D. Jordan wrote:
>>> > > >>>>
>>> > > >>>> +1
>>> > > >>>>
>>> > > >>>> On Apr 4, 2023, at 7:29 AM, Brandon Williams 
>>> wrote:
>>> > > >>>>
>>> > > >>>> 
>>> > > >>>> +1
>>> > > >>>>
>>> > > >>>> On Tue, Apr 4, 2023, 7:24 AM Branimir Lambov <
>>> blam...@apache.org> wrote:
>>> > > >>>>>
>>> > > >>>>> Hi everyone,
>>> > > >>>>>
>>> > > >>>>> I would like to put CEP-26 to a vote.
>>> > > >>>>>
>>> > > >>>>> Proposal:
>>> > > >>>>>
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+Strategy
>>> > > >>>>>
>>> > > >>>>> JIRA and draft implementation:
>>> > > >>>>> https://issues.apache.org/jira/browse/CASSANDRA-18397
>>> > > >>>>>
>>> > > >>>>> Up-to-date documentation:
>>> > > >>>>>
>>> https://github.com/blambov/cassandra/blob/CASSANDRA-18397/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.md
>>> > > >>>>>
>>> > > >>>>> Discussion:
>>> > > >>>>>
>>> https://lists.apache.org/thread/8xf5245tclf1mb18055px47b982rdg4b
>>> > > >>>>>
>>> > > >>>>> The vote will be open for 72 hours.
>>> > > >>>>> A vote passes if there are at least three binding +1s and no
>>> binding vetoes.
>>> > > >>>>>
>>> > > >>>>> Thanks,
>>> > > >>>>> Branimir
>>> > > >>>>
>>> > > >>>>
>>> > > >
>>> > >
>>>
>>

-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


[VOTE] CEP-26: Unified Compaction Strategy

2023-04-04 Thread Branimir Lambov
Hi everyone,

I would like to put CEP-26 to a vote.

Proposal:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-26%3A+Unified+Compaction+Strategy

JIRA and draft implementation:
https://issues.apache.org/jira/browse/CASSANDRA-18397

Up-to-date documentation:
https://github.com/blambov/cassandra/blob/CASSANDRA-18397/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.md

Discussion:
https://lists.apache.org/thread/8xf5245tclf1mb18055px47b982rdg4b

The vote will be open for 72 hours.
A vote passes if there are at least three binding +1s and no binding vetoes.

Thanks,
Branimir


Re: [DISCUSS] CEP-26: Unified Compaction Strategy

2023-03-20 Thread Branimir Lambov
It seems I have created some confusion.

This version of UCS (let's call it V2) is ahead of the one in DSE (V1),
with the main difference that it no longer uses a fixed number of shards.
Because of this, V2 acts similar to LCS in the required extra space,
because the sstables it constructs aim to be close to a target size. V1 UCS
had some special features to deal with the large sstables it created in the
top levels of each shard, which are not as important for V2: when the
target size is small enough, there should be no need for limiting
compactions to the available free space, or for making sure that large
top-level compactions cannot cause sstables to accumulate on L0.

Because of this, such features of the V1 UCS have been omitted in order to
keep the initial commit small enough to fit the C* 5 timeline (they rely on
some sizable refactorings of the compaction interfaces which should come at
a later date).

Regards,
Branimir

On Sat, Mar 18, 2023 at 1:05 AM Jeff Jirsa  wrote:

> I’m without laptop this week but looks like CompactionTask#
> reduceScopeForLimitedSpace
>
> So maybe it just comes for free with UCS
>
>
> On Mar 17, 2023, at 6:21 PM, Jeremy Hanna 
> wrote:
>
> You're right that it doesn't handle it in the sense that it doesn't
> resolve it the problem, but it also doesn't do what STCS does.  From what
> I've seen, STCS blindly tries to compact and then the disk will fill up
> triggering the disk failure policy.  With UCS it's much less likely and if
> it does happen, my understanding is that it will skip the compaction.  I
> didn't realize that LCS would try to reduce the scope of the compaction.  I
> can't find in the branch where it handles that.
>
> Branimir, can you point to where it handles the scenario?
>
> Thanks,
>
> Jeremy
>
> On Mar 17, 2023, at 4:52 PM, Jeff Jirsa  wrote:
>
>
>
>
> On Mar 17, 2023, at 1:46 PM, Jeremy Hanna 
> wrote:
>
>
>
>
> One much more graceful element of UCS is that instead of what was
> previously done with compaction strategies where the server just shuts down
> when running out of space - forcing system administrators to be paranoid
> about headroom.  Instead UCS has a target overhead (default 20%).  First
> since the ranges are sharded, it makes it less likely that there will be
> large sstables that need to get compacted to require as much headroom, but
>  if it detects that there is a compaction that will violate the target
> overhead, it will log that and skip the compaction - a much more graceful
> way of handling it.
>
>
> Skipping doesn’t really handle it though?
>
>
> If you have a newly flushed sstable full of tombstones and it naturally
> somehow triggers you to exceed that target overhead you never free that
> space? Usually LCS would try to reduce the scope of the compaction, and I
> assume UCS will too?
>
>
>
>
>


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-15 Thread Branimir Lambov
ne some basic cleanup of test
>> variations, so this is not a duplication of the pipeline.  It's a
>> significant improvement.
>>
>> I'm ok with cassandra_latest being committed and added to the pipeline,
>> *if* the authors genuinely believe there's significant time and effort
>> saved in doing so.
>>
>> How many broken tests are we talking about ?
>> Are they consistently broken or flaky ?
>> Are they ticketed up and 5.0-rc blockers ?
>>
>> Having to deal with flakies and broken tests is an unfortunate reality to
>> having a pipeline of 170k tests.
>>
>> Despite real frustrations I don't believe the broken windows analogy is
>> appropriate here – it's more of a leave the campground cleaner…   That
>> being said, knowingly introducing a few broken tests is not that either,
>> but still having to deal with a handful of consistently breaking tests
>> for a short period of time is not the same cognitive burden as flakies.
>> There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
>> upgrade_through_versions_test; are these compounding to the frustrations ?
>>
>> It's also been questioned about why we don't just enable settings we
>> recommend.  These are settings we recommend for new clusters.  Our existing
>> cassandra.yaml needs to be tailored for existing clusters being upgraded,
>> where we are very conservative about changing defaults.
>>
>>

-- 
Branimir Lambov
e. branimir.lam...@datastax.com
w. www.datastax.com


[Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-13 Thread Branimir Lambov
Hi All,

CASSANDRA-18753 introduces a second set of defaults (in a separate
"cassandra_latest.yaml") that enable new features of Cassandra. The
objective is two-fold: to be able to test the database in this
configuration, and to point potential users that are evaluating the
technology to an optimized set of defaults that give a clearer picture of
the expected performance of the database for a new user. The objective is
to get this configuration into 5.0 to have the extra bit of confidence that
we are not releasing (and recommending) options that have not gone through
thorough CI.

The implementation has already gone through review, but I'd like to get
people's opinion on two things:
- There are currently a number of test failures when the new options are
selected, some of which appear to be genuine problems. Is the community
okay with committing the patch before all of these are addressed? This
should prevent the introduction of new failures and make sure we don't
release before clearing the existing ones.
- I'd like to get an opinion on what's suitable wording and documentation
for the new defaults set. Currently, the patch proposes adding the
following text to the yaml (see
https://github.com/apache/cassandra/pull/2896/files):
# NOTE:
#   This file is provided in two versions:
# - cassandra.yaml: Contains configuration defaults for a "compatible"
#   configuration that operates using settings that are
backwards-compatible
#   and interoperable with machines running older versions of Cassandra.
#   This version is provided to facilitate pain-free upgrades for
existing
#   users of Cassandra running in production who want to gradually and
#   carefully introduce new features.
# - cassandra_latest.yaml: Contains configuration defaults that enable
#   the latest features of Cassandra, including improved functionality
as
#   well as higher performance. This version is provided for new users
of
#   Cassandra who want to get the most out of their cluster, and for
users
#   evaluating the technology.
#   To use this version, simply copy this file over cassandra.yaml, or
specify
#   it using the -Dcassandra.config system property, e.g. by running
# cassandra
-Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml
# /NOTE
Does this sound sensible? Should we add a pointer to this defaults set
elsewhere in the documentation?

Regards,
Branimir


Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

2024-02-14 Thread Branimir Lambov
is there a reason all guardrails and reliability (aka repair retries)
configs are off by default?  They are off by default in the normal config
for backwards compatibility reasons, but if we are defining a config saying
what we recommend, we should enable these things by default IMO.

This is one more question to be answered by this discussion. Are there
other options that should be enabled by the "latest" configuration? To what
values should they be set?
Is there something that is currently enabled that should not be?

Should we merge the configs breaking these tests?  No…. When we have
failing tests people do not spend the time to figure out if their logic
caused a regression and merge, making things more unstable… so when we
merge failing tests that leads to people merging even more failing tests...

In this case this also means that people will not see at all failures that
they introduce in any of the advanced features, as they are not tested at
all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
non-latest test suite will remain clean after merge. Note that these two
problems demonstrate that we have failures in the configuration we ship
with, because we are not actually testing it at all. IMHO this is a problem
that we should not delay fixing.

Regards,
Branimir

On Wed, Feb 14, 2024 at 1:07 AM David Capwell  wrote:

> so can cause repairs to deadlock forever
>
>
> Small correction, I finished fixing the tests in CASSANDRA-19042 and we
> don’t deadlock, we timeout and fail repair if any of those messages are
> dropped.
>
> On Feb 13, 2024, at 11:04 AM, David Capwell  wrote:
>
> and to point potential users that are evaluating the technology to an
> optimized set of defaults
>
>
> Left this comment in the GH… is there a reason all guardrails and
> reliability (aka repair retries) configs are off by default?  They are
> off by default in the normal config for backwards compatibility reasons,
> but if we are defining a config saying what we recommend, we should enable
> these things by default IMO.
>
> There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed?
>
>
> I was tagged on CASSANDRA-19042, the paxos repair message handing does
> not have the repair reliably improvements that 5.0 have, so can cause
> repairs to deadlock forever (same as current 4.x repairs).  Bringing these
> up to par with the rest of repair would be very much welcome (they are also
> lacking visibility, so need to fallback to heap dumps to see what’s going
> on; same as 4.0.x but not 4.1.x), but I doubt I have cycles to do that….
> This refactor is not 100% trivial as it has fun subtle concurrency issues
> to address (message retries and dedupping), and making sure this logic
> works with the existing repair simulation tests does require refactoring
> how the paxos cleanup state is tracked, which could have subtle consequents.
>
> I do think this should be fixed, but should it block 5.0?  Not sure… will
> leave to others….
>
> Should we merge the configs breaking these tests?  No…. When we have
> failing tests people do not spend the time to figure out if their logic
> caused a regression and merge, making things more unstable… so when we
> merge failing tests that leads to people merging even more failing tests...
>
> On Feb 13, 2024, at 8:41 AM, Branimir Lambov  wrote:
>
> Hi All,
>
> CASSANDRA-18753 introduces a second set of defaults (in a separate
> "cassandra_latest.yaml") that enable new features of Cassandra. The
> objective is two-fold: to be able to test the database in this
> configuration, and to point potential users that are evaluating the
> technology to an optimized set of defaults that give a clearer picture of
> the expected performance of the database for a new user. The objective is
> to get this configuration into 5.0 to have the extra bit of confidence that
> we are not releasing (and recommending) options that have not gone through
> thorough CI.
>
> The implementation has already gone through review, but I'd like to get
> people's opinion on two things:
> - There are currently a number of test failures when the new options are
> selected, some of which appear to be genuine problems. Is the community
> okay with committing the patch before all of these are addressed? This
> should prevent the introduction of new failures and make sure we don't
> release before clearing the existing ones.
> - I'd like to get an opinion on what's suitable wording and documentation
> for the new defaults set. Currently, the patch proposes adding the
> following text to the yaml (see
> https://github.com/apache/cassandra/pull/2896/