Re: Cassandra data model right definition

selcuk mart Fri, 14 Oct 2016 07:01:46 -0700

unsubscribe


3.10.2016 16:25 tarihinde Edward Capriolo yazdı:

The phrase is defensible, but that is the root of the problem. Takefor example a skateboard.


"A skateboard is like a bike because it has wheels and you ride on it."

That is true and defensively true. :) However with not much more textyou can accurately describe what it is, as opposed to something it isalmost like.

"A skateboard is a thin piece of wood on top of four small wheels thatyou stand on and ride"

The old sentence Cassandra statement was something to the effect of"with the storage model of big table and the consistency model ofdynamo". This accurately described the system and gave reference tospecific known quantities (bigtable/dynamo) in which white papersexisted for further reading.

On Mon, Oct 3, 2016 at 6:24 AM, Benedict Elliott Smith<bened...@apache.org <mailto:bened...@apache.org>> wrote:


    While that sentence leaves a lot to be desired (for me because it
    confers a different meaning on row store), it doesn't say
    "Cassandra is like a RDBMS" - it says "like an RDBMS, it organises
    data by rows and columns" - i.e., in this regard only it is like
    an RDBMS, not more generally.

    I believe it was meant to help people, especially those afraid of
    the NoSQL thrift world, understand that it still uses the basic
    concept of a rows and columns they are used to.  I agree it could
    be improved to minimise the chance of misreading it, and I'm
    certain contributions would be welcome here.

    I don't personally want to get bogged down in analysing every
    piece of text anyone has ever written, so I'll bow out of further
    discussion on this.  These phrases may all be suboptimal, but they
    are certainly defensible.  Column store is not, that's all I
    wanted to contribute here.





    On 1 October 2016 at 19:35, Peter Lin <wool...@gmail.com
    <mailto:wool...@gmail.com>> wrote:

        I'll second Ed's comment.

        The documentation should be more careful when using phrases
        "like relational databases". When we look at the history of
        relational databases, people expect certain things like ACID
        transactions, primary/foriegn key constraints, query planners,
        joins and relational algebra. Clearly Cassandra's storage
        engine does not follow most of those principals for a good reason.

        The term row oriented storage would be more descriptive and
        appropriate. It avoids conflating Cassandra storage engine
        with "traditional" relational storage engines. Those of us
        that have spent over a decade using IBM DB2, Oracle, Sql
        Server and Sybase tend to think of relational databases in a
        certain way. If we go back to 1998, most RDBMS storage engine
        had a max row size limit. Databases like Sybase before version
        9 preferred RAW disk for optimal performance. I can go on and
        on, but there's no point really.

        Cassandra's storage engine is "row oriented", but it's not
        relational in RDBMS sense. We do everyone a huge disservice by
        using confusing terminology and then making fun of those who
        get confused. No one wins when that happens. At the end of the
        day, what differentiates cassandra's storage engine is it
        support static and dynamic columns, which traditional RDBMS
        don't support today. Calling Cassandra storage "distributed
        tables" doesn't really help in my bias opinion.

        For example, if you tell a SqlServer or Oracle RAC admin
        "cassandra uses distributed tables" they might answer "so
        what, sql server and oracle can do that too." The difference
        is with RDBMS the partitioning is optional and requires more
        work to configure. Whereas with Cassandra you can have
        everything in 1 node, which means there is only 1 partition
        and no different to 1 instance of sql server. Where you win is
        when you need to add 2 more nodes, Cassandra makes this easier
        whereas with SqlServer and Oracle you have to do a little bit
        more work. I've lost count of how many times I've to explained
        noSql databases to RDBMS admins and had to explain the
        official docs are stupid.



        On Sat, Oct 1, 2016 at 11:31 AM, Edward Capriolo
        <edlinuxg...@gmail.com <mailto:edlinuxg...@gmail.com>> wrote:

            https://github.com/apache/cassandra
            <https://github.com/apache/cassandra>

            Row store
            <http://wiki.apache.org/cassandra/DataModel> means that
            like relational databases, Cassandra organizes data by
            rows and columns. The Cassandra Query Language (CQL) is a
            close relative of SQL.

            I generally do not know what to say about these high level
            "oversimplifications" like "firewalls block hackers". Are
            there "firewalls" or do they mean IP routers with layer 4
            packet inspections and layer 3 Access Control Lists?

            We say (and I catch myself doing it all the time) "like
            relational databases" often as if all relational databases
            work alike. A columnar store like HP Vertica is a
            relational database.MySql has different storage engines
            does MyIsam work like InnoDB?

            Google docs organizes data by rows and columns as well.
            You can wrap any storage system into an API that makes
            them look like rows and columns. Microsoft LINQ can
            enumerate your network cars and query them
            https://msdn.microsoft.com/en-us/library/bb308959.aspx
            <https://msdn.microsoft.com/en-us/library/bb308959.aspx> ,
            that really does not make your network cards a "row store"

            "Theoretically a row can have 2 billion columns, but in
            practice it shouldn't have more than 100 million columns."
            In practice (In my experience) the number is much lower
            than 100 million, and if the data actually is deleted and
            readded frequently the number of live columns(rows,
            whatever) you can use happily is even lower


            I believe on twitter (I am unable to find the tweet)
            someone was trying to convince me Cassandra was a
            "columnar analytic database".  ROFL

            I believe telling someone it "row store" "like a
            database", is not a good idea. They might away content
            with that explanation. You are setting them up to walk
            into an anti-pattern. Like a case where the user is
            attempting to write and deleting 1 row and 1 column 6
            billion times a day. Then you end up explaining to them
            
http://stackoverflow.com/questions/21755286/what-exactly-happens-when-tombstone-limit-is-reached
            
<http://stackoverflow.com/questions/21755286/what-exactly-happens-when-tombstone-limit-is-reached>


            and how the cassandra storage model is not "like a
            relational database".

            On Fri, Sep 30, 2016 at 9:22 PM, Edward Capriolo
            <edlinuxg...@gmail.com <mailto:edlinuxg...@gmail.com>> wrote:

                I can iterate over JSON data stored in mongo and
                present it as a table with rows and columns. It does
                not make mongo a rowstore.

                On Fri, Sep 30, 2016 at 9:16 PM, Edward Capriolo
                <edlinuxg...@gmail.com <mailto:edlinuxg...@gmail.com>>
                wrote:

                    The problem with calling it a row store:

                    https://en.wikipedia.org/wiki/Row_(database)
                    <https://en.wikipedia.org/wiki/Row_%28database%29>

                    In the context of a relational database
                    <https://en.wikipedia.org/wiki/Relational_database>,
                    a *row*—also called a record
                    
<https://en.wikipedia.org/wiki/Record_%28computer_science%29> or
                    tuple
                    <https://en.wikipedia.org/wiki/Tuple>—represents a
                    single, implicitly structured data
                    <https://en.wikipedia.org/wiki/Data> item in a
                    table
                    <https://en.wikipedia.org/wiki/Table_%28database%29>.
                    In simple terms, a database table can be thought
                    of as consisting of /rows/ andcolumns
                    <https://en.wikipedia.org/wiki/Column_%28database%29> or
                    fields
                    
<https://en.wikipedia.org/wiki/Field_%28computer_science%29>.^[1]
                    
<https://en.wikipedia.org/wiki/Row_%28database%29#cite_note-1>
                     Each row in a table represents a set of related
                    data, and every row in the table has the same
                    structure.

                    When you have static columns and rows with maps,
                    and lists, it is hard to argue that every row has
                    the same structure. Physically at the storage
                    layer they do not have the same structure and
                    logically when accessing the data they barely have
                    the same structure, as the static column is just
                    appearing inside each row it is actually not
                    contained in.

                    On Fri, Sep 30, 2016 at 4:47 PM, Jonathan Haddad
                    <j...@jonhaddad.com <mailto:j...@jonhaddad.com>> wrote:

                        +1000 to what Benedict says. I usually call it
                        a "partitioned row store" which usually needs
                        some extra explanation but is more accurate
                        than "column family" or whatever other thrift
                        era terminology people still use.
                        On Fri, Sep 30, 2016 at 1:53 PM DuyHai Doan
                        <doanduy...@gmail.com
                        <mailto:doanduy...@gmail.com>> wrote:

                            I used to present Cassandra as a NoSQL
                            datastore with "distributed" table. This
                            definition is closer to CQL and has some
                            academic background (distributed hash table).


                            On Fri, Sep 30, 2016 at 7:43 PM, Benedict
                            Elliott Smith <bened...@apache.org
                            <mailto:bened...@apache.org>> wrote:

                                Cassandra is not a "wide column store"
                                anymore.  It has a schema. Only thrift
                                users no longer think they have a
                                schema (though they do), and thrift is
                                being deprecated.

                                I really wish everyone would kill the

term "wide column store" with fire.It seems to have never meant anything

                                beyond "schema-less, row-oriented",
                                and a "column store" means literally
                                the opposite of this.

                                Not only that, but people don't even
                                seem to realise the term "column
                                store" existed long before "wide
                                column store" and the latter is often
                                abbreviated to the former, as here:
                                http://www.planetcassandra.org/what-is-nosql/
                                <http://www.planetcassandra.org/what-is-nosql/>


                                Since it no longer applies, let's all
                                agree as a community to forget this
                                awful nomenclature ever existed.



                                On 30 September 2016 at 18:09, Joaquin
                                Casares <joaq...@thelastpickle.com
                                <mailto:joaq...@thelastpickle.com>> wrote:

                                    Hi Mehdi,

                                    I can help clarify a few things.

                                    As Carlos said, Cassandra is a
                                    Wide Column Store. Theoretically a
                                    row can have 2 billion columns,
                                    but in practice it shouldn't have
                                    more than 100 million columns.

                                    Cassandra partitions data to
                                    certain nodes based on the
                                    partition key(s), but does provide
                                    the option of setting zero or more
                                    clustering keys. Together,
                                    the partition key(s) and
                                    clustering key(s) form the primary
                                    key.

                                    When writing to Cassandra, you
                                    will need to provide the full
                                    primary key, however, when reading
                                    from Cassandra, you only need to
                                    provide the full partition key.

                                    When you only provide the
                                    partition key for a read
                                    operation, you're able to return
                                    all columns that exist on that
                                    partition with low latency. These
                                    columns are displayed as "CQL
                                    rows" to make it easier to reason
                                    about.

                                    Consider the schema:

                                        CREATE TABLE foo (
                                          bar uuid,

                                          boz uuid,

                                          baz timeuuid,
                                          data1 text,

                                          data2 text,

                                          PRIMARY KEY ((bar, boz), baz)

                                        );


                                    When you write to Cassandra you
                                    will need to send bar, boz, and
                                    baz and optionally data*, if it's
                                    relevant for that CQL row. If you
                                    chose not to define a data* field
                                    for a particular CQL row, then
                                    nothing is stored nor allocated on
                                    disk. But I wouldn't consider that
                                    caveat to be "schema-less".

                                    However, all writes to the same
                                    bar/boz will end up on the same
                                    Cassandra replica set (a
                                    configurable number of nodes) and
                                    be stored on the same place(s) on
                                    disk within the SSTable(s). And on
                                    disk, each field that's not a
                                    partition key is stored as a
                                    column, including clustering keys
                                    (this is optimized in Cassandra
                                    3+, but now we're getting deep
                                    into internals).

                                    In this way you can get fast
                                    responses for all activity for
                                    bar/boz either over time, or for a
                                    specific time, with roughly the
                                    same number of disk seeks, with
                                    varying lengths on the disk scans.

                                    Hope that helps!

                                    Joaquin Casares
                                    Consultant
                                    Austin, TX

                                    Apache Cassandra Consulting
                                    http://www.thelastpickle.com

                                    On Fri, Sep 30, 2016 at 11:40 AM,
                                    Carlos Alonso <i...@mrcalonso.com
                                    <mailto:i...@mrcalonso.com>> wrote:

                                        Cassandra is a Wide Column
                                        Store
                                        
http://db-engines.com/en/system/Cassandra
                                        
<http://db-engines.com/en/system/Cassandra>

                                        Carlos Alonso | Software
                                        Engineer | @calonso
                                        <https://twitter.com/calonso>

                                        On 30 September 2016 at 18:24,
                                        Mehdi Bada
                                        <mehdi.b...@dbi-services.com
                                        <mailto:mehdi.b...@dbi-services.com>>
                                        wrote:

                                            Hi all,

                                            I have a theoritical
                                            question:
                                            - Is Apache Cassandra
                                            really a column store?
                                            Column store mean storing
                                            the data as column rather
                                            than as a rows.

                                            In fact C* store the data
                                            as row, and data is
                                            partionned with row key.

                                            Finally, for me, Cassandra
                                            is a row oriented schema
                                            less DBMS.... Is it true
                                            for you also???

                                            Many thanks in advance for
                                            your reply

                                            Best Regards
                                            Mehdi Bada
                                            ----

                                            *Mehdi Bada* | Consultant
                                            Phone: +41 32 422 96 00
                                            <tel:%2B41%2032%20422%2096%2000>
                                            | Mobile: +41 79 928 75 48
                                            <tel:%2B41%2079%20928%2075%2048>
                                            | Fax: +41 32 422 96 15
                                            <tel:%2B41%2032%20422%2096%2015>

                                            dbi services, Rue de la
                                            Jeunesse 2, CH-2800 Delémont
                                            mehdi.b...@dbi-services.com
                                            <mailto:mehdi.b...@dbi-services.com>

                                            www.dbi-services.com
                                            <http://www.dbi-services.com>



                                            *⇒ dbi services is
                                            recruiting Oracle & SQL
                                            Server experts ! – Join
                                            the team
                                            
<http://www.dbi-services.com/fr/dbi-services-et-ses-collaborateurs/offres-emplois-opportunites-carrieres/>
                                            *


--
İyi Çalışmalar
Selçuk MART
ONLINE KURUM
Hacettepe Üniversitesi Teknokent,
Üniversiteliler Mah. 1596. Sok.
Safir Blokları, E BLOK 802/A,
Beytepe, Çankaya/ANKARA
Tel: +90 (312) 227 000 5

Re: Cassandra data model right definition

Reply via email to