Re: Cassandra on RocksDB experiment result

2017-04-26 Thread Dikang Gu
@Samba, that's a very good point, I definitely do not expect all storage
engines provide exactly same features, and each storage engine should have
it's own strength and sweet spots as well. For features not supported by
certain storage engine, I think it should throw exceptions and fail the
requests. It will be better than swallowing the errors silently.

On Wed, Apr 26, 2017 at 7:40 AM, Samba  wrote:

> some features may work with some storage engine but with others; for
> example, storing large blobs may be efficient in one storage engine while
> quite worse in another. perhaps some storage engines may want to SKIP some
> features or add more.
>
> if a storage engine skips a feature, how should the query executor handle
> the response or lack of it?
> if a storage engine provides a new feature, how should that be enabled for
> that particular storage engine alone?
>
> On Wed, Apr 26, 2017 at 5:07 AM, Dikang Gu  wrote:
>
> > I created several tickets to start the discussion, please free feel to
> > comment on the JIRAs. I'm also open for suggestions about other efficient
> > ways to discuss it.
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-13474
> > https://issues.apache.org/jira/browse/CASSANDRA-13475
> > https://issues.apache.org/jira/browse/CASSANDRA-13476
> >
> > Thanks
> > Dikang.
> >
> > On Mon, Apr 24, 2017 at 9:53 PM, Dikang Gu  wrote:
> >
> > > Thanks everyone for the feedback and suggestions! They are all very
> > > helpful. I'm looking forward to having more discussions about the
> > > implementation details.
> > >
> > > As the next step, we will be focus on three areas:
> > > 1. Pluggable storage engine interface.
> > > 2. Wide column support on RocksDB.
> > > 3. Streaming support on RocksDB.
> > >
> > > I will go ahead and create some JIRAs, to start the discussion about
> > > pluggable storage interface, and how to plug RocksDB into Cassandra.
> > >
> > > Please let me know your thoughts.
> > >
> > > Thanks!
> > > Dikang.
> > >
> > > On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
> > > wrote:
> > >
> > >> Dikang,
> > >>
> > >> First I want to thank you and everyone else at Instragram for the
> > >> engineering talent you have devoted to the Cassandra project. Here's
> yet
> > >> another great example.
> > >>
> > >> He's going to hate me for dragging him into this, but Vijay
> > Parthasarathy
> > >> has done some exploratory work before on integrating non-java storage
> to
> > >> Cassandra. Might be helpful person to consult.
> > >>
> > >> Patrick
> > >>
> > >>
> > >>
> > >> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall 
> > wrote:
> > >>
> > >> > > Please take a look and let me know your thoughts. I think the
> > biggest
> > >> > > latency win comes from we get rid of most Java garbages created by
> > >> > current
> > >> > > read/write path and compactions, which reduces the JVM overhead
> and
> > >> makes
> > >> > > the latency to be more predictable.
> > >> > >
> > >> >
> > >> > I want to put this here for the record:
> > >> > https://issues.apache.org/jira/browse/CASSANDRA-2995
> > >> >
> > >> > There are some valid points in the above about increased surface
> area
> > >> > and end-user confusion. That said, just under six years is a long
> > >> > time. I think we are a more mature project now and I completely
> agree
> > >> > with others about the positive impacts of testability this would
> > >> > inherently provide.
> > >> >
> > >> > +1 from me.
> > >> >
> > >> > Dikang, thank you for opening this discussion and sharing your
> efforts
> > >> so
> > >> > far.
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Dikang
> > >
> > >
> >
> >
> > --
> > Dikang
> >
>



-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-26 Thread Samba
some features may work with some storage engine but with others; for
example, storing large blobs may be efficient in one storage engine while
quite worse in another. perhaps some storage engines may want to SKIP some
features or add more.

if a storage engine skips a feature, how should the query executor handle
the response or lack of it?
if a storage engine provides a new feature, how should that be enabled for
that particular storage engine alone?

On Wed, Apr 26, 2017 at 5:07 AM, Dikang Gu  wrote:

> I created several tickets to start the discussion, please free feel to
> comment on the JIRAs. I'm also open for suggestions about other efficient
> ways to discuss it.
>
> https://issues.apache.org/jira/browse/CASSANDRA-13474
> https://issues.apache.org/jira/browse/CASSANDRA-13475
> https://issues.apache.org/jira/browse/CASSANDRA-13476
>
> Thanks
> Dikang.
>
> On Mon, Apr 24, 2017 at 9:53 PM, Dikang Gu  wrote:
>
> > Thanks everyone for the feedback and suggestions! They are all very
> > helpful. I'm looking forward to having more discussions about the
> > implementation details.
> >
> > As the next step, we will be focus on three areas:
> > 1. Pluggable storage engine interface.
> > 2. Wide column support on RocksDB.
> > 3. Streaming support on RocksDB.
> >
> > I will go ahead and create some JIRAs, to start the discussion about
> > pluggable storage interface, and how to plug RocksDB into Cassandra.
> >
> > Please let me know your thoughts.
> >
> > Thanks!
> > Dikang.
> >
> > On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
> > wrote:
> >
> >> Dikang,
> >>
> >> First I want to thank you and everyone else at Instragram for the
> >> engineering talent you have devoted to the Cassandra project. Here's yet
> >> another great example.
> >>
> >> He's going to hate me for dragging him into this, but Vijay
> Parthasarathy
> >> has done some exploratory work before on integrating non-java storage to
> >> Cassandra. Might be helpful person to consult.
> >>
> >> Patrick
> >>
> >>
> >>
> >> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall 
> wrote:
> >>
> >> > > Please take a look and let me know your thoughts. I think the
> biggest
> >> > > latency win comes from we get rid of most Java garbages created by
> >> > current
> >> > > read/write path and compactions, which reduces the JVM overhead and
> >> makes
> >> > > the latency to be more predictable.
> >> > >
> >> >
> >> > I want to put this here for the record:
> >> > https://issues.apache.org/jira/browse/CASSANDRA-2995
> >> >
> >> > There are some valid points in the above about increased surface area
> >> > and end-user confusion. That said, just under six years is a long
> >> > time. I think we are a more mature project now and I completely agree
> >> > with others about the positive impacts of testability this would
> >> > inherently provide.
> >> >
> >> > +1 from me.
> >> >
> >> > Dikang, thank you for opening this discussion and sharing your efforts
> >> so
> >> > far.
> >> >
> >>
> >
> >
> >
> > --
> > Dikang
> >
> >
>
>
> --
> Dikang
>


Re: Cassandra on RocksDB experiment result

2017-04-25 Thread Dikang Gu
I created several tickets to start the discussion, please free feel to
comment on the JIRAs. I'm also open for suggestions about other efficient
ways to discuss it.

https://issues.apache.org/jira/browse/CASSANDRA-13474
https://issues.apache.org/jira/browse/CASSANDRA-13475
https://issues.apache.org/jira/browse/CASSANDRA-13476

Thanks
Dikang.

On Mon, Apr 24, 2017 at 9:53 PM, Dikang Gu  wrote:

> Thanks everyone for the feedback and suggestions! They are all very
> helpful. I'm looking forward to having more discussions about the
> implementation details.
>
> As the next step, we will be focus on three areas:
> 1. Pluggable storage engine interface.
> 2. Wide column support on RocksDB.
> 3. Streaming support on RocksDB.
>
> I will go ahead and create some JIRAs, to start the discussion about
> pluggable storage interface, and how to plug RocksDB into Cassandra.
>
> Please let me know your thoughts.
>
> Thanks!
> Dikang.
>
> On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
> wrote:
>
>> Dikang,
>>
>> First I want to thank you and everyone else at Instragram for the
>> engineering talent you have devoted to the Cassandra project. Here's yet
>> another great example.
>>
>> He's going to hate me for dragging him into this, but Vijay Parthasarathy
>> has done some exploratory work before on integrating non-java storage to
>> Cassandra. Might be helpful person to consult.
>>
>> Patrick
>>
>>
>>
>> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:
>>
>> > > Please take a look and let me know your thoughts. I think the biggest
>> > > latency win comes from we get rid of most Java garbages created by
>> > current
>> > > read/write path and compactions, which reduces the JVM overhead and
>> makes
>> > > the latency to be more predictable.
>> > >
>> >
>> > I want to put this here for the record:
>> > https://issues.apache.org/jira/browse/CASSANDRA-2995
>> >
>> > There are some valid points in the above about increased surface area
>> > and end-user confusion. That said, just under six years is a long
>> > time. I think we are a more mature project now and I completely agree
>> > with others about the positive impacts of testability this would
>> > inherently provide.
>> >
>> > +1 from me.
>> >
>> > Dikang, thank you for opening this discussion and sharing your efforts
>> so
>> > far.
>> >
>>
>
>
>
> --
> Dikang
>
>


-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-24 Thread Dikang Gu
Thanks everyone for the feedback and suggestions! They are all very
helpful. I'm looking forward to having more discussions about the
implementation details.

As the next step, we will be focus on three areas:
1. Pluggable storage engine interface.
2. Wide column support on RocksDB.
3. Streaming support on RocksDB.

I will go ahead and create some JIRAs, to start the discussion about
pluggable storage interface, and how to plug RocksDB into Cassandra.

Please let me know your thoughts.

Thanks!
Dikang.

On Mon, Apr 24, 2017 at 12:42 PM, Patrick McFadin 
wrote:

> Dikang,
>
> First I want to thank you and everyone else at Instragram for the
> engineering talent you have devoted to the Cassandra project. Here's yet
> another great example.
>
> He's going to hate me for dragging him into this, but Vijay Parthasarathy
> has done some exploratory work before on integrating non-java storage to
> Cassandra. Might be helpful person to consult.
>
> Patrick
>
>
>
> On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:
>
> > > Please take a look and let me know your thoughts. I think the biggest
> > > latency win comes from we get rid of most Java garbages created by
> > current
> > > read/write path and compactions, which reduces the JVM overhead and
> makes
> > > the latency to be more predictable.
> > >
> >
> > I want to put this here for the record:
> > https://issues.apache.org/jira/browse/CASSANDRA-2995
> >
> > There are some valid points in the above about increased surface area
> > and end-user confusion. That said, just under six years is a long
> > time. I think we are a more mature project now and I completely agree
> > with others about the positive impacts of testability this would
> > inherently provide.
> >
> > +1 from me.
> >
> > Dikang, thank you for opening this discussion and sharing your efforts so
> > far.
> >
>



-- 
Dikang


Re: Cassandra on RocksDB experiment result

2017-04-24 Thread Patrick McFadin
Dikang,

First I want to thank you and everyone else at Instragram for the
engineering talent you have devoted to the Cassandra project. Here's yet
another great example.

He's going to hate me for dragging him into this, but Vijay Parthasarathy
has done some exploratory work before on integrating non-java storage to
Cassandra. Might be helpful person to consult.

Patrick



On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:

> > Please take a look and let me know your thoughts. I think the biggest
> > latency win comes from we get rid of most Java garbages created by
> current
> > read/write path and compactions, which reduces the JVM overhead and makes
> > the latency to be more predictable.
> >
>
> I want to put this here for the record:
> https://issues.apache.org/jira/browse/CASSANDRA-2995
>
> There are some valid points in the above about increased surface area
> and end-user confusion. That said, just under six years is a long
> time. I think we are a more mature project now and I completely agree
> with others about the positive impacts of testability this would
> inherently provide.
>
> +1 from me.
>
> Dikang, thank you for opening this discussion and sharing your efforts so
> far.
>


Re: Cassandra on RocksDB experiment result

2017-04-24 Thread LuckyBoy
unsubscribe

On Sun, Apr 23, 2017 at 4:25 PM, Nate McCall  wrote:

> > Please take a look and let me know your thoughts. I think the biggest
> > latency win comes from we get rid of most Java garbages created by
> current
> > read/write path and compactions, which reduces the JVM overhead and makes
> > the latency to be more predictable.
> >
>
> I want to put this here for the record:
> https://issues.apache.org/jira/browse/CASSANDRA-2995
>
> There are some valid points in the above about increased surface area
> and end-user confusion. That said, just under six years is a long
> time. I think we are a more mature project now and I completely agree
> with others about the positive impacts of testability this would
> inherently provide.
>
> +1 from me.
>
> Dikang, thank you for opening this discussion and sharing your efforts so
> far.
>


Re: Cassandra on RocksDB experiment result

2017-04-24 Thread Nate McCall
> Please take a look and let me know your thoughts. I think the biggest
> latency win comes from we get rid of most Java garbages created by current
> read/write path and compactions, which reduces the JVM overhead and makes
> the latency to be more predictable.
>

I want to put this here for the record:
https://issues.apache.org/jira/browse/CASSANDRA-2995

There are some valid points in the above about increased surface area
and end-user confusion. That said, just under six years is a long
time. I think we are a more mature project now and I completely agree
with others about the positive impacts of testability this would
inherently provide.

+1 from me.

Dikang, thank you for opening this discussion and sharing your efforts so far.


Re: Cassandra on RocksDB experiment result

2017-04-22 Thread Eric Stevens
In the spirit of what Eric mentions, as a community member, I'm
enthusiastically +1 on the idea.

On Fri, Apr 21, 2017 at 9:28 AM, Eric Evans 
wrote:

> On Fri, Apr 21, 2017 at 4:32 AM, benjamin roth  wrote:
> > I am not a PMC member or sth but just my 2 cents:
>
> Somewhat off-topic here, but I'd like to start discouraging people
> from prefacing remarks like this ("not a PMC member", "non-binding
> +1").  The exchange rate here is 1:1 IMO, your 2 cents are worth the
> same as any others! ;)
>
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>


Re: Cassandra on RocksDB experiment result

2017-04-21 Thread Eric Evans
On Fri, Apr 21, 2017 at 4:32 AM, benjamin roth  wrote:
> I am not a PMC member or sth but just my 2 cents:

Somewhat off-topic here, but I'd like to start discouraging people
from prefacing remarks like this ("not a PMC member", "non-binding
+1").  The exchange rate here is 1:1 IMO, your 2 cents are worth the
same as any others! ;)


-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Cassandra on RocksDB experiment result

2017-04-20 Thread Jason Brown
I'm +1 on the idea of a pluggable storage engine. There's clearly a
bandwidth problem for developing/reviewing/maintain multiple storage
engines, but I think having the interface is a good thing and can enhance
testability.

At a minimum I think it's worthwhile to explore the storage engine
interface, although it may turn out that it's infeasible/impractical given
the current system. And that's OK.

Thanks,

-Jason


On Thu, Apr 20, 2017 at 4:25 PM, Jeff Jirsa  wrote:

> Let's try to make this actionable. Long time
> contributors/committers/members of the PMC (especially you guys who have
> been working on internals for 4-8 years):
>
> Setting aside details of the implementation, does anyone feel that
> pluggable storage in itself is inherently a bad idea (so much so that you'd
> -1 it if someone else did the work)?
>
> If we can establish loose consensus on it being something generally
> acceptable (assuming someone can come up with an interface/abstraction upon
> which everyone can agree), then it seems like the next step is working on
> defining the proper interface.
>
> - Jeff
>
>
> On Wed, Apr 19, 2017 at 9:21 AM, Dikang Gu  wrote:
>
> > Hi Cassandra developers,
> >
> > This is Dikang from Instagram, I'd like to share you some experiment
> > results we did recently, to use RocksDB as Cassandra's storage engine. In
> > the experiment, I built a prototype to integrate Cassandra 3.0.12 and
> > RocksDB on single column (key-value) use case, shadowed one of our
> > production use case, and saw about 4-6X P99 read latency drop during peak
> > time, compared to 3.0.12. Also, the P99 latency became more predictable
> as
> > well.
> >
> > Here is detailed note with more metrics:
> >
> > https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
> > sV-PmfiJYvu_Dc/edit?usp=sharing
> >
> > Please take a look and let me know your thoughts. I think the biggest
> > latency win comes from we get rid of most Java garbages created by
> current
> > read/write path and compactions, which reduces the JVM overhead and makes
> > the latency to be more predictable.
> >
> > We are very excited about the potential performance gain. As the next
> step,
> > I propose to make the Cassandra storage engine to be pluggable (like
> Mysql
> > and MongoDB), and we are very interested in providing RocksDB as one
> > storage option with more predictable performance, together with
> community.
> >
> > Thanks.
> >
> > --
> > Dikang
> >
>


Re: Cassandra on RocksDB experiment result

2017-04-20 Thread Jeff Jirsa
Let's try to make this actionable. Long time
contributors/committers/members of the PMC (especially you guys who have
been working on internals for 4-8 years):

Setting aside details of the implementation, does anyone feel that
pluggable storage in itself is inherently a bad idea (so much so that you'd
-1 it if someone else did the work)?

If we can establish loose consensus on it being something generally
acceptable (assuming someone can come up with an interface/abstraction upon
which everyone can agree), then it seems like the next step is working on
defining the proper interface.

- Jeff


On Wed, Apr 19, 2017 at 9:21 AM, Dikang Gu  wrote:

> Hi Cassandra developers,
>
> This is Dikang from Instagram, I'd like to share you some experiment
> results we did recently, to use RocksDB as Cassandra's storage engine. In
> the experiment, I built a prototype to integrate Cassandra 3.0.12 and
> RocksDB on single column (key-value) use case, shadowed one of our
> production use case, and saw about 4-6X P99 read latency drop during peak
> time, compared to 3.0.12. Also, the P99 latency became more predictable as
> well.
>
> Here is detailed note with more metrics:
>
> https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
> sV-PmfiJYvu_Dc/edit?usp=sharing
>
> Please take a look and let me know your thoughts. I think the biggest
> latency win comes from we get rid of most Java garbages created by current
> read/write path and compactions, which reduces the JVM overhead and makes
> the latency to be more predictable.
>
> We are very excited about the potential performance gain. As the next step,
> I propose to make the Cassandra storage engine to be pluggable (like Mysql
> and MongoDB), and we are very interested in providing RocksDB as one
> storage option with more predictable performance, together with community.
>
> Thanks.
>
> --
> Dikang
>


RE: Cassandra on RocksDB experiment result

2017-04-20 Thread Jacques-Henri Berthemet
It's an interesting experiment!
Did you test with more than 2 nodes? Do you expect scalability to be as linear 
as regular Cassandra?

--
Jacques-Henri Berthemet

-Original Message-
From: Eric Stevens [mailto:migh...@gmail.com] 
Sent: jeudi 20 avril 2017 00:41
To: dev@cassandra.apache.org
Subject: Re: Cassandra on RocksDB experiment result

> Right now all the compaction strategies share the assumption that the data
> structure and layout on disk is fixed. With pluggable storage engine, we
> need to special case each compaction strategy (or at least the Abstract
> class of compaction strategy) for each engine.

As Ben points out, compaction should be treated as _part_ of the storage
layer.  The need to engage in compaction is a consequence of immutable data
stores behind a mutable interface.  A different storage layer might require
a different approach or might not even require compaction in the same sense.

On Wed, Apr 19, 2017 at 4:01 PM Ben Bromhead <b...@instaclustr.com> wrote:

> This looks super cool would love to see more details.
>
> On a general note, a pluggable storage layer allows other storage engines
> (and possibly datastores) to leverage Cassandras distributed primitives
> (dynamo, gossip, paxsos?, drivers, cql etc). This could allow Cassandra to
> fill similar use cases as Dynomite from Netflix.
>
> Also as Sankalp mentioned we get some other benefits including better
> testability.
>
> In my experience with pluggable storage engines (in the MySQL world), the
> > engine manages all storage that it "owns." The higher tiers in the
> > architecture don't need to get involved unless multiple storage engines
> > have to deal with compaction (or similar) issues over the entire
> database,
> > e.g., every storage engine has read/write access to every piece of data,
> > even if that data is owned by another storage engine.
> >
> > I don't know enough about Cassandra internals to have an opinion as to
> > whether or not the above scenario makes sense in the Cassandra context.
> But
> > "sharing" (processes or data) between storage engines gets pretty hairy,
> > easily deadlocky (!), even in something as relatively straightforward as
> > MySQL.
>
>
> This would be an implementation detail, but given that tables in Cassandra
> don't know about each other (no joins, foreign keys etc... ignore mv for
> the moment), but storage engine interactions probably wouldn't be an issue.
>
>
> > This was a long and old debate we had several times in the past. One of
> > the difficulty of pluggable storage engine is that we need to manage the
> > differences between the LSMT of native C* and RockDB engine for
> compaction,
> > repair, streaming etc...
> >
> > Right now all the compaction strategies share the assumption that the
> data
> > structure and layout on disk is fixed. With pluggable storage engine, we
> > need to special case each compaction strategy (or at least the Abstract
> > class of compaction strategy) for each engine.
>
>
> > The current approach is one storage engine, many compaction strategies
> for
> > different use-cases (TWCS for time series, LCS for heavy update...).
> >
> > With pluggable storage engine, we'll have a matrix of storage engine x
> > compaction strategies.
> >
>
> Compaction is part of the storage engine, and if I understand Dikangs
> design spec, it is bypassed?
>
> Cassandras currently storage engine is a log structured merge tree. RocksDB
> does it's own thing.
>
> Again this is an implementation detail about where the storage engine
> interface line is drawn, but from the above example compaction I think it
> is a non issue?
>
>
> > And not even mentioning the other operations to handle like streaming and
> > repair.
> >
>
> Streaming and repair would be the harder problem to solve than compaction
> imho.
> --
> Ben Bromhead
> CTO | Instaclustr <https://www.instaclustr.com/>
> +1 650 284 9692 <(650)%20284-9692>
> Managed Cassandra / Spark on AWS, Azure and Softlayer
>


Re: Cassandra on RocksDB experiment result

2017-04-19 Thread Ben Bromhead
This looks super cool would love to see more details.

On a general note, a pluggable storage layer allows other storage engines
(and possibly datastores) to leverage Cassandras distributed primitives
(dynamo, gossip, paxsos?, drivers, cql etc). This could allow Cassandra to
fill similar use cases as Dynomite from Netflix.

Also as Sankalp mentioned we get some other benefits including better
testability.

In my experience with pluggable storage engines (in the MySQL world), the
> engine manages all storage that it "owns." The higher tiers in the
> architecture don't need to get involved unless multiple storage engines
> have to deal with compaction (or similar) issues over the entire database,
> e.g., every storage engine has read/write access to every piece of data,
> even if that data is owned by another storage engine.
>
> I don't know enough about Cassandra internals to have an opinion as to
> whether or not the above scenario makes sense in the Cassandra context. But
> "sharing" (processes or data) between storage engines gets pretty hairy,
> easily deadlocky (!), even in something as relatively straightforward as
> MySQL.


This would be an implementation detail, but given that tables in Cassandra
don't know about each other (no joins, foreign keys etc... ignore mv for
the moment), but storage engine interactions probably wouldn't be an issue.


> This was a long and old debate we had several times in the past. One of
> the difficulty of pluggable storage engine is that we need to manage the
> differences between the LSMT of native C* and RockDB engine for compaction,
> repair, streaming etc...
>
> Right now all the compaction strategies share the assumption that the data
> structure and layout on disk is fixed. With pluggable storage engine, we
> need to special case each compaction strategy (or at least the Abstract
> class of compaction strategy) for each engine.


> The current approach is one storage engine, many compaction strategies for
> different use-cases (TWCS for time series, LCS for heavy update...).
>
> With pluggable storage engine, we'll have a matrix of storage engine x
> compaction strategies.
>

Compaction is part of the storage engine, and if I understand Dikangs
design spec, it is bypassed?

Cassandras currently storage engine is a log structured merge tree. RocksDB
does it's own thing.

Again this is an implementation detail about where the storage engine
interface line is drawn, but from the above example compaction I think it
is a non issue?


> And not even mentioning the other operations to handle like streaming and
> repair.
>

Streaming and repair would be the harder problem to solve than compaction
imho.
-- 
Ben Bromhead
CTO | Instaclustr 
+1 650 284 9692
Managed Cassandra / Spark on AWS, Azure and Softlayer


RE: Cassandra on RocksDB experiment result

2017-04-19 Thread Bob Dourandish
There is probably something I missed the message below.

In my experience with pluggable storage engines (in the MySQL world), the 
engine manages all storage that it "owns." The higher tiers in the architecture 
don't need to get involved unless multiple storage engines have to deal with 
compaction (or similar) issues over the entire database, e.g., every storage 
engine has read/write access to every piece of data, even if that data is owned 
by another storage engine.

I don't know enough about Cassandra internals to have an opinion as to whether 
or not the above scenario makes sense in the Cassandra context. But "sharing" 
(processes or data) between storage engines gets pretty hairy, easily deadlocky 
(!), even in something as relatively straightforward as MySQL. 

So this could be a way cool project and I'd love to get involved if it gets 
off the ground.

Bob


-Original Message-
From: DuyHai Doan [mailto:doanduy...@gmail.com] 
Sent: Wednesday, April 19, 2017 3:33 PM
To: dev@cassandra.apache.org
Subject: Re: Cassandra on RocksDB experiment result

"I have no clue what it would take to accomplish a pluggable storage engine, 
but I love this idea."

This was a long and old debate we had several times in the past. One of the 
difficulty of pluggable storage engine is that we need to manage the 
differences between the LSMT of native C* and RockDB engine for compaction, 
repair, streaming etc...

Right now all the compaction strategies share the assumption that the data 
structure and layout on disk is fixed. With pluggable storage engine, we need 
to special case each compaction strategy (or at least the Abstract class of 
compaction strategy) for each engine.

The current approach is one storage engine, many compaction strategies for 
different use-cases (TWCS for time series, LCS for heavy update...).

With pluggable storage engine, we'll have a matrix of storage engine x 
compaction strategies.

And not even mentioning the other operations to handle like streaming and 
repair.

Another question that arose is: will the storage engine be run in the same JVM 
as the C* server or is it a separate process ? For the later, we're opening the 
door to yet-another-distributed-system complexity. For instance, how the C* JVM 
will communicate with the storage engine process ?
How to handle failure, crash, resume etc ...

That being said, if we manage to get the code base to this stage eventually 
it'd be super cool !

On Wed, Apr 19, 2017 at 12:03 PM, Salih Gedik <m...@salih.xyz> wrote:

> Hi Dikang,
>
> I guess there is something wrong with the link that you shared.
>
>
> 19.04.2017 19:21 tarihinde Dikang Gu yazdı:
>
> Hi Cassandra developers,
>>
>> This is Dikang from Instagram, I'd like to share you some experiment 
>> results we did recently, to use RocksDB as Cassandra's storage 
>> engine. In the experiment, I built a prototype to integrate Cassandra 
>> 3.0.12 and RocksDB on single column (key-value) use case, shadowed 
>> one of our production use case, and saw about 4-6X P99 read latency 
>> drop during peak time, compared to 3.0.12. Also, the P99 latency 
>> became more predictable as well.
>>
>> Here is detailed note with more metrics:
>>
>> https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
>> sV-PmfiJYvu_Dc/edit?usp=sharing
>>
>> Please take a look and let me know your thoughts. I think the biggest 
>> latency win comes from we get rid of most Java garbages created by 
>> current read/write path and compactions, which reduces the JVM 
>> overhead and makes the latency to be more predictable.
>>
>> We are very excited about the potential performance gain. As the next 
>> step, I propose to make the Cassandra storage engine to be pluggable 
>> (like Mysql and MongoDB), and we are very interested in providing 
>> RocksDB as one storage option with more predictable performance, 
>> together with community.
>>
>> Thanks.
>>
>>
>



Re: Cassandra on RocksDB experiment result

2017-04-19 Thread DuyHai Doan
"I have no clue what it would take to accomplish a pluggable storage
engine, but I love this idea."

This was a long and old debate we had several times in the past. One of the
difficulty of pluggable storage engine is that we need to manage the
differences between the LSMT of native C* and RockDB engine for compaction,
repair, streaming etc...

Right now all the compaction strategies share the assumption that the data
structure and layout on disk is fixed. With pluggable storage engine, we
need to special case each compaction strategy (or at least the Abstract
class of compaction strategy) for each engine.

The current approach is one storage engine, many compaction strategies for
different use-cases (TWCS for time series, LCS for heavy update...).

With pluggable storage engine, we'll have a matrix of storage engine x
compaction strategies.

And not even mentioning the other operations to handle like streaming and
repair.

Another question that arose is: will the storage engine be run in the same
JVM as the C* server or is it a separate process ? For the later, we're
opening the door to yet-another-distributed-system complexity. For
instance, how the C* JVM will communicate with the storage engine process ?
How to handle failure, crash, resume etc ...

That being said, if we manage to get the code base to this stage eventually
it'd be super cool !

On Wed, Apr 19, 2017 at 12:03 PM, Salih Gedik  wrote:

> Hi Dikang,
>
> I guess there is something wrong with the link that you shared.
>
>
> 19.04.2017 19:21 tarihinde Dikang Gu yazdı:
>
> Hi Cassandra developers,
>>
>> This is Dikang from Instagram, I'd like to share you some experiment
>> results we did recently, to use RocksDB as Cassandra's storage engine. In
>> the experiment, I built a prototype to integrate Cassandra 3.0.12 and
>> RocksDB on single column (key-value) use case, shadowed one of our
>> production use case, and saw about 4-6X P99 read latency drop during peak
>> time, compared to 3.0.12. Also, the P99 latency became more predictable as
>> well.
>>
>> Here is detailed note with more metrics:
>>
>> https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
>> sV-PmfiJYvu_Dc/edit?usp=sharing
>>
>> Please take a look and let me know your thoughts. I think the biggest
>> latency win comes from we get rid of most Java garbages created by current
>> read/write path and compactions, which reduces the JVM overhead and makes
>> the latency to be more predictable.
>>
>> We are very excited about the potential performance gain. As the next
>> step,
>> I propose to make the Cassandra storage engine to be pluggable (like Mysql
>> and MongoDB), and we are very interested in providing RocksDB as one
>> storage option with more predictable performance, together with community.
>>
>> Thanks.
>>
>>
>


Re: Cassandra on RocksDB experiment result

2017-04-19 Thread sankalp kohli
We should definitely evaluate pluggable storage engine...Besides several
other advantages, it also helps in adding lot of tests to the storage
engine.

On Wed, Apr 19, 2017 at 11:22 AM, Jon Haddad 
wrote:

> I have no clue what it would take to accomplish a pluggable storage
> engine, but I love this idea.
>
> Obviously the devil is in the details, & a simple K/V is very different
> from supporting partitions, collections, etc, but this is very cool & seems
> crazy not to explore further.  Will you be open sourcing this work?
>
> Jon
>
>
> > On Apr 19, 2017, at 9:21 AM, Dikang Gu  wrote:
> >
> > Hi Cassandra developers,
> >
> > This is Dikang from Instagram, I'd like to share you some experiment
> > results we did recently, to use RocksDB as Cassandra's storage engine. In
> > the experiment, I built a prototype to integrate Cassandra 3.0.12 and
> > RocksDB on single column (key-value) use case, shadowed one of our
> > production use case, and saw about 4-6X P99 read latency drop during peak
> > time, compared to 3.0.12. Also, the P99 latency became more predictable
> as
> > well.
> >
> > Here is detailed note with more metrics:
> >
> > https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
> > sV-PmfiJYvu_Dc/edit?usp=sharing
> >
> > Please take a look and let me know your thoughts. I think the biggest
> > latency win comes from we get rid of most Java garbages created by
> current
> > read/write path and compactions, which reduces the JVM overhead and makes
> > the latency to be more predictable.
> >
> > We are very excited about the potential performance gain. As the next
> step,
> > I propose to make the Cassandra storage engine to be pluggable (like
> Mysql
> > and MongoDB), and we are very interested in providing RocksDB as one
> > storage option with more predictable performance, together with
> community.
> >
> > Thanks.
> >
> > --
> > Dikang
>
>


Re: Cassandra on RocksDB experiment result

2017-04-19 Thread Jon Haddad
I have no clue what it would take to accomplish a pluggable storage engine, but 
I love this idea.  

Obviously the devil is in the details, & a simple K/V is very different from 
supporting partitions, collections, etc, but this is very cool & seems crazy 
not to explore further.  Will you be open sourcing this work?

Jon


> On Apr 19, 2017, at 9:21 AM, Dikang Gu  wrote:
> 
> Hi Cassandra developers,
> 
> This is Dikang from Instagram, I'd like to share you some experiment
> results we did recently, to use RocksDB as Cassandra's storage engine. In
> the experiment, I built a prototype to integrate Cassandra 3.0.12 and
> RocksDB on single column (key-value) use case, shadowed one of our
> production use case, and saw about 4-6X P99 read latency drop during peak
> time, compared to 3.0.12. Also, the P99 latency became more predictable as
> well.
> 
> Here is detailed note with more metrics:
> 
> https://docs.google.com/document/d/1Ztqcu8Jzh4USKoWBgDJQw82DBurQm
> sV-PmfiJYvu_Dc/edit?usp=sharing
> 
> Please take a look and let me know your thoughts. I think the biggest
> latency win comes from we get rid of most Java garbages created by current
> read/write path and compactions, which reduces the JVM overhead and makes
> the latency to be more predictable.
> 
> We are very excited about the potential performance gain. As the next step,
> I propose to make the Cassandra storage engine to be pluggable (like Mysql
> and MongoDB), and we are very interested in providing RocksDB as one
> storage option with more predictable performance, together with community.
> 
> Thanks.
> 
> -- 
> Dikang