Hi Ralph, Yes, having partitions of 100mb will seriously hit your performance. But usually the issue here is for people handling large numbers of transactions and aiming for low latency. My understanding is the column value up to 2GB is it’s max. Like after that the system would start to fail, but well before that you are going to be seeing a significant performance hit (for most use cases).
I think an important question for you is are you going to be reading these files from Cassandra regularly? It sounds like something S3 or Hadoop might be more appropriate for. The other option is if your xml files have some format you could extract the data from it and store it that way. One final point, I’m pretty sure a TEXT type won’t hold a 10mb file let alone a 1GB file, I think the max size is like 64K characters. Regards, Eevee. > On 10 Jun 2018, at 7:54 pm, Ralph Soika <ralph.so...@imixs.com> wrote: > > Hi, > I have a general question concerning the Cassandra technology. I already read > 2 books but after all I am more and more confused about the question if > Cassandra is the right technology. My goal is to store Business Data form a > workflow engine into Cassandra. I want to use Cassandra as a kind of archive > service because of its fault tolerant and decentralized approach. > > But here are two things which are confusing me. On the one hand the project > claims that a single column value can be 2 GB (1 MB is recommended). On the > other hand people explain that a partition should not be larger than 100MB. > > I plan only one single simple table: > > CREATE TABLE documents ( > created text, > id text, > data text, > PRIMARY KEY (created,id) > ); > > 'created' is the partition key holding the date in ISO fomat (YYYY-MM-DD). > The 'id' is a clustering key and is unique. > > But my 'data' column holds a XML document with business data. This cell > contains many unstructured data and also media data. The data cell will be > between 1 and 10 MB. BUT it can also hold more than 100MB and less than 2GB > in some cases. > > Is Cassandra able to handle this kind of table? Or is Cassandra at the end > not recommended for this kind of data? > > For example I would like to ask if data for a specific date is available : > > SELECT created,id WHERE created = '2018-06-10' > > I select without the data column and just ask if data exists. Is the > performance automatically poor only because the data cell (no primary key) of > some rows is grater then 100MB? Or is cassandra running out of heap space in > any case? It is perfectly clear that it makes no sense to select multiple > cells which each contain over 100 MB of data in one single query. But this is > a fundamental problem and has nothing to do with Cassandra. My java > application running in Wildfly would also not be able to handle a data result > with multiple GB of data. But I would expect hat I can select a set of keys > just to decide whether to load one single data cell. > > Cassandra seems like a great system. But many people seem to claim that it is > only suitable for mapping a user status list ala Facebook? Is this true? > Thanks for you comments in advance. > > > > > === > Ralph >