Re: Size of a single Data Row?

Ralph Soika Sun, 10 Jun 2018 09:35:43 -0700

Thanks for your answer. Ok - I think I understand your points and theworries you have about my architecture.To give more inside information: We are working on the Open SourceProject Imixs-Workflow <http://www.imixs.org>. This is a human-centricworkflow engine based on Java EE. The engine runs on JPA/SQL Databases.This is to have full transactional support. We also use Lucene Searchtechnology to find records in a very unstructured amount of businessdata. Everything runs stable and fast (for example with PostgreSQL) -also if we have records containing 100MB of attachments.

But we need also a stable archive strategy. Normal Backups are notreally an option because of the fact that databases grow over the yearsand so we are seeking a Big Table solution. Cassandra seems muchstronger in this area than traditional SQL solutions. And it seems to beeasy to setup a cluster of 3 nodes. It is not easy to build the samewith Hadoop.

Our Cassandra approach is not for data live access. It is for anasynchronous archive service with the goal of an highly data consistencedecentralized storage. And this is why I am not worried aboutperformance. Only in case of an restore or a big-data analyses we arereading data from Cassandra.

I can't change the fact that I have business transactions that containfiles with more than 100MB of data.Do you really think Cassandra has less performance in writing/reading a200MB media file than PostgreSQL? In my first test I have not. I havethe concern that through some Internet discussion the impression is,that Cassandra is worse than a traditional SQL solution. I thoughtCassandra is basically a big-data solution??If Cassandra is not suitable to store records larger than 100MB, I askif the only alternative would be HBase?

To put it more clearly: it's always a challenge to handle a record withmore than 100MB. But the question is: Does Cassandra break in this kindof task?

So if we exclude the performance issue for a moment, would you agree tothe solution or advise against it?


Thanks again for you help


===
Ralph



Am 10.06.2018 um 17:43 schrieb daemeon reiydelle:

I'd like to split your question into two parts.
Part one is around recovery. If you lose a copy of the underlying databecause a note fails and let's assume you have three copies, how longcan you tolerate the time to restore the third copy?
The second question is about the absolute length of a row. Thisquestion is more about the time to read a row if it's a single superlong row, that can only be read from one node, if the row is splitinto multiple shorter rows then in most cases there is an opportunityto read it in parallel.
The sizes you're looking at are not in themselves an issue, it's morehow you want to access and use the data.
I might argue that you might not want to use Cassandra, if this isyour only use case for Cassandra. I might suggest you look atsomething like elk, whether or not you use elasticsearch or Cassandramight get you thinking about your architecture to meet this particularbusiness case. But of course if you have multiple use cases to storesomething some tables or shorter columns and others, then overallCassandra would be an excellent choice.
But as is often the case, and I do hope I'm being helpful in thisresponse, your overall family of business processes can drivecompromises in one business process to facilitate a single storagesolution and simplified Administration
Daemeon (Dæmœn) Reiydelle
USA 1.415.501.0198
On Sun, Jun 10, 2018, 02:54 Ralph Soika <[email protected]<mailto:[email protected]>> wrote:
    Hi,
    I have a general question concerning the Cassandra technology. I
    already read 2 books but after all I am more and more confused
    about the question if Cassandra is the right technology. My goal
    is to store Business Data form a workflow engine into Cassandra. I
    want to use Cassandra as a kind of archive service because of its
    fault tolerant and decentralized approach.

    But here are two things which are confusing me. On the one hand
    the project claims that a single column value can be 2 GB (1 MB is
    recommended). On the other hand people explain that a partition
    should not be larger than 100MB.

    I plan only one single simple table:

        CREATE TABLE documents (
           created text,
           id text,
           data text,
           PRIMARY KEY (created,id)
        );

    'created' is the partition key holding the date in ISO fomat
    (YYYY-MM-DD). The 'id' is a clustering key and is unique.

    But my 'data' column holds a XML document with business data. This
    cell contains many unstructured data and also media data. The data
    cell will be between 1 and 10 MB. BUT it can also hold more than
    100MB and less than 2GB in some cases.

    Is Cassandra able to handle this kind of table? Or is Cassandra at
    the end not recommended for this kind of data?

    For example I would like to ask if data for a specific date is
    available :

        SELECT created,id WHERE created = '2018-06-10'

    I select without the data column and just ask if data exists. Is
    the performance automatically poor only because the data cell (no
    primary key) of some rows is grater then 100MB? Or is cassandra
    running out of heap space in any case? It is perfectly clear that
    it makes no sense to select multiple cells which each contain over
    100 MB of data in one single query. But this is a fundamental
    problem and has nothing to do with Cassandra. My java application
    running in Wildfly would also not be able to handle a data result
    with multiple GB of data.  But I would expect hat I can select a
    set of keys just to decide whether to load one single data cell.

    Cassandra seems like a great system. But many people seem to claim
    that it is only suitable for mapping a user status list ala
    Facebook? Is this true? Thanks for you comments in advance.




    ===
    Ralph


--

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org<http://www.imixs.org>

Re: Size of a single Data Row?

Reply via email to