Re: Size of a single Data Row?

Ralph Soika Sun, 10 Jun 2018 23:50:28 -0700

Hi Jeff,

thanks for that answer. I understand the problem now much better. As youexplain the problem also exists in the VM and so also in the 'other'part of my application which is running on JavaEE/JPA. At the end the100MB byte arrays also cause a HeapSpace problem there. So Cassandra isnot the core problem in my consideration.

Your solution with splitting up the blob in junks is good but I did notneed this because in deed the same problem exists on the WildflyApplication Server side. It was my mistake to say that I have no heapsize problem with files over 100MB.


So I will create simply two tables:

CREATE TABLE documents_meta (
created text,
document_id text,
hash text
PRIMARY KEY (created,document_id))

CREATE TABLE documents_data (
document_id text,
data blob,
PRIMARY KEY(document_id))

The table 'documents_meta' is to verify the data consistency of filesstored in the JavaEE part. As I explained, Cassandra plays the role of ahigh available backup cluster.

What I was not aware is the "problem" with the partition size. Can yougive me a link where to read about the CQL partition issue. In the book"Cassandra: The Definitive Guide" I did not find this.


best regards

===
Ralph


Am 10.06.2018 um 19:04 schrieb Jeff Jirsa:

Let's talk about what the real limitations are. There are two herethat you should care about:

1) Cassandra runs in the JVM. When you read and write to Cassandra,those objects end up in the heap as byte arrays. If you're regularlyreading and writing 100MB byte arrays, it's easy to see situationswhere you'll have some latency pains, especially if you have a lot ofconcurrent requests.2) On the read path, we build up an index of CQL rows within a CQLpartition. You've been reading books, I suspect you know thedifference (if not, ask, and I'll re-explain). In all versions ofcassandra released so far, the cost of that index scales with thewidth of the partition and is paid ON READ (not on write like otherdatabases). If you have a very wide CQL partition and you query itquickly, you will create JVM GC pressure. It sounds like this is asecondary concern here.

That doesn't mean it's not a good fit. There are workarounds to bothof these issues.



For example:

- On the write path, running with offheap memtables will get the cellvalue into direct memory for the period of time between when it'swritten in the commitlog and when it's flushed to disk. This is likelyimportant for you.- Instead of writing the 100MB document in a single cell, chunk itinto 1MB chunks


CREATE TABLE documents (
document_id text,
chunk_order int,
chunk_id text,
PRIMARY KEY (document_id, chunk_order))

CREATE TABLE chunks (
chunk_id text,
chunk blob,
PRIMARY KEY(chunk_id))

Then when you go to write the document, you break it into 1MB blobs,and take the hash (md5, sha1, sha256, whatever suits your needs basedon pain of collisions), write the chunk into the chunks table, andthe chunk_id into the documents table for the document (in the rightorder).


This does a few things:

1) You can reassemble the document chunk by chunk by querying it inpieces. Each piece is small enough not to overwhelm the garbagecollector (and you control that with paging)2) The only partition here that can get large is document_id, and it'dbe incredibly unlikely that you'll get 100MB per partition here basedon your description, so you dont have to worry about the index pain onthe read path3) You naturally dedup chunks, which you didnt ask for, but may careabout.


Hope that helps,
- Jeff

On Sun, Jun 10, 2018 at 9:35 AM, Ralph Soika <ralph.so...@imixs.com<mailto:ralph.so...@imixs.com>> wrote:


    Thanks for your answer. Ok - I think I understand your points and
    the worries you have about my architecture.
    To give more inside information: We are working on the Open Source
    Project Imixs-Workflow <http://www.imixs.org>. This is a
    human-centric workflow engine based on Java EE. The engine runs on
    JPA/SQL Databases. This is to have full transactional support. We
    also use Lucene Search technology to find records in a very
    unstructured amount of business data.  Everything runs stable and
    fast (for example with PostgreSQL) - also if we have records
    containing 100MB of attachments.

    But we need also a stable archive strategy. Normal Backups are not
    really an option because of the fact that databases grow over the
    years and so we are seeking a Big Table solution. Cassandra seems
    much stronger in this area than traditional SQL solutions. And it
    seems to be easy to setup a cluster of 3 nodes. It is not easy to
    build the same with Hadoop.

    Our Cassandra approach is not for data live access. It is for an
    asynchronous archive service with the goal of an highly data
    consistence decentralized storage. And this is why I am not
    worried about performance. Only in case of an restore or a
    big-data analyses we are reading data from Cassandra.

    I can't change the fact that I have business transactions that
    contain files with more than 100MB of data.
    Do you really think Cassandra has less performance in
    writing/reading a 200MB media file than PostgreSQL? In my first
    test I have not. I have the concern that through some Internet
    discussion the impression is, that Cassandra is worse than a
    traditional SQL solution. I thought Cassandra is basically a
    big-data solution??
    If Cassandra is not suitable to store records larger than 100MB, I
    ask if the only alternative would be HBase?

    To put it more clearly: it's always a challenge to handle a record
    with more than 100MB. But the question is: Does Cassandra break in
    this kind of task?

    So if we exclude the performance issue for a moment, would you
    agree to the solution or advise against it?

    Thanks again for you help


    ===
    Ralph



    Am 10.06.2018 um 17:43 schrieb daemeon reiydelle:

    I'd like to split your question into two parts.

    Part one is around recovery. If you lose a copy of the underlying
    data because a note fails and let's assume you have three copies,
    how long can you tolerate the time to restore the third copy?

    The second question is about the absolute length of a row. This
    question is more about the time to read a row if it's a single
    super long row, that can only be read from one node, if the row
    is split into multiple shorter rows then in most cases there is
    an opportunity to read it in parallel.

    The sizes you're looking at are not in themselves an issue, it's
    more how you want to access and use the data.

    I might argue that you might not want to use Cassandra, if this
    is your only use case for Cassandra. I might suggest you look at
    something like elk, whether or not you use elasticsearch or
    Cassandra might get you thinking about your architecture to meet
    this particular business case. But of course if you have multiple
    use cases to store something some tables or shorter columns and
    others, then overall Cassandra would be an excellent choice.

    But as is often the case, and I do hope I'm being helpful in this
    response, your overall family of business processes can drive
    compromises in one business process to facilitate a single
    storage solution and simplified Administration


    Daemeon (Dæmœn) Reiydelle
    USA 1.415.501.0198

    On Sun, Jun 10, 2018, 02:54 Ralph Soika <ralph.so...@imixs.com
    <mailto:ralph.so...@imixs.com>> wrote:

        Hi,
        I have a general question concerning the Cassandra
        technology. I already read 2 books but after all I am more
        and more confused about the question if Cassandra is the
        right technology. My goal is to store Business Data form a
        workflow engine into Cassandra. I want to use Cassandra as a
        kind of archive service because of its fault tolerant and
        decentralized approach.

        But here are two things which are confusing me. On the one
        hand the project claims that a single column value can be 2
        GB (1 MB is recommended). On the other hand people explain
        that a partition should not be larger than 100MB.

        I plan only one single simple table:

            CREATE TABLE documents (
               created text,
               id text,
               data text,
               PRIMARY KEY (created,id)
            );

        'created' is the partition key holding the date in ISO fomat
        (YYYY-MM-DD). The 'id' is a clustering key and is unique.

        But my 'data' column holds a XML document with business data.
        This cell contains many unstructured data and also media
        data. The data cell will be between 1 and 10 MB. BUT it can
        also hold more than 100MB and less than 2GB in some cases.

        Is Cassandra able to handle this kind of table? Or is
        Cassandra at the end not recommended for this kind of data?

        For example I would like to ask if data for a specific date
        is available :

            SELECT created,id WHERE created = '2018-06-10'

        I select without the data column and just ask if data exists.
        Is the performance automatically poor only because the data
        cell (no primary key) of some rows is grater then 100MB? Or
        is cassandra running out of heap space in any case? It is
        perfectly clear that it makes no sense to select multiple
        cells which each contain over 100 MB of data in one single
        query. But this is a fundamental problem and has nothing to
        do with Cassandra. My java application running in Wildfly
        would also not be able to handle a data result with multiple
        GB of data.  But I would expect hat I can select a set of
        keys just to decide whether to load one single data cell.

        Cassandra seems like a great system. But many people seem to
        claim that it is only suitable for mapping a user status list
        ala Facebook? Is this true? Thanks for you comments in advance.




        ===
        Ralph

Re: Size of a single Data Row?

Reply via email to