I'd like to split your question into two parts.

Part one is around recovery. If you lose a copy of the underlying data
because a note fails and let's assume you have three copies, how long can
you tolerate the time to restore the third copy?

The second question is about the absolute length of a row. This question is
more about the time to read a row if it's a single super long row, that can
only be read from one node, if the row is split into multiple shorter rows
then in most cases there is an opportunity to read it in parallel.

The sizes you're looking at are not in themselves an issue, it's more how
you want to access and use the data.

I might argue that you might not want to use Cassandra, if this is your
only use case for Cassandra. I might suggest you look at something like
elk, whether or not you use elasticsearch or Cassandra might get you
thinking about your architecture to meet this particular business case. But
of course if you have multiple use cases to store something some tables or
shorter columns and others, then overall Cassandra would be an excellent
choice.

But as is often the case, and I do hope I'm being helpful in this response,
your overall family of business processes can drive compromises in one
business process to facilitate a single storage solution and simplified
Administration


Daemeon (Dæmœn) Reiydelle
USA 1.415.501.0198

On Sun, Jun 10, 2018, 02:54 Ralph Soika <ralph.so...@imixs.com> wrote:

> Hi,
> I have a general question concerning the Cassandra technology. I already
> read 2 books but after all I am more and more confused about the question
> if Cassandra is the right technology. My goal is to store Business Data
> form a workflow engine into Cassandra. I want to use Cassandra as a kind of
> archive service because of its fault tolerant and decentralized approach.
>
> But here are two things which are confusing me. On the one hand the
> project claims that a single column value can be 2 GB (1 MB is
> recommended). On the other hand people explain that a partition should not
> be larger than 100MB.
>
> I plan only one single simple table:
>
>     CREATE TABLE documents (
>        created text,
>        id text,
>        data text,
>        PRIMARY KEY (created,id)
>     );
>
> 'created' is the partition key holding the date in ISO fomat (YYYY-MM-DD).
> The 'id' is a clustering key and is unique.
>
> But my 'data' column holds a XML document with business data. This cell
> contains many unstructured data and also media data. The data cell will be
> between 1 and 10 MB. BUT it can also hold more than 100MB and less than 2GB
> in some cases.
>
> Is Cassandra able to handle this kind of table? Or is Cassandra at the end
> not recommended for this kind of data?
>
> For example I would like to ask if data for a specific date is available :
>
>     SELECT created,id WHERE created = '2018-06-10'
>
> I select without the data column and just ask if data exists. Is the
> performance automatically poor only because the data cell (no primary key)
> of some rows is grater then 100MB? Or is cassandra running out of heap
> space in any case? It is perfectly clear that it makes no sense to select
> multiple cells which each contain over 100 MB of data in one single query.
> But this is a fundamental problem and has nothing to do with Cassandra. My
> java application running in Wildfly would also not be able to handle a data
> result with multiple GB of data.  But I would expect hat I can select a set
> of keys just to decide whether to load one single data cell.
>
> Cassandra seems like a great system. But many people seem to claim that it
> is only suitable for mapping a user status list ala Facebook? Is this true?
> Thanks for you comments in advance.
>
>
>
>
> ===
> Ralph
>
>

Reply via email to