Re: questions related to the SSTable file

Takenori Sato(Cloudian) Mon, 16 Sep 2013 19:13:42 -0700

Hi,

> 1) I will expect same row key could show up in both sstable2jsonoutput, as this one row exists in both SSTable files, right?


Yes.

> 2) If so, what is the boundary? Will Cassandra guarantee the columnlevel as the boundary? What I mean is that for one column's data, itwill be guaranteed to be either in the first file, or 2nd file, right?There is no chance that Cassandra will cut the data of one column into 2part, and one part stored in first SSTable file, and the other partstored in second SSTable file. Is my understanding correct?

No.

> 3) If what we are talking about are only the SSTable files insnapshot, incremental backup SSTable files, exclude the runtime SSTablefiles, will anything change? For snapshot or incremental backup SSTablefiles, first can one row data still may exist in more than one SSTablefile? And any boundary change in this case?> 4) If I want to use incremental backup SSTable files as the way tocatch data being changed, is it a good way to do what I try to archive?In this case, what happen in the following example:

I don't fully understand, but snapshot will do. It will create hardlinks to all the SSTable files present at snapshot.



Let me explain how SSTable and compaction works.

Suppose we have 4 files being compacted(the last one has bee justflushed, then which triggered compaction). Note that file names aresimplified.


- Color-1-Data.db: [{Lavender: {hex: #E6E6FA}}, {Blue: {hex: #0000FF}}]
- Color-2-Data.db: [{Green: {hex: #008000}}, {Blue: {hex2: #2c86ff}}]

- Color-3-Data.db: [{Aqua: {hex: #00FFFF}}, {Green: {hex2: #32CD32}},{Blue: {}}]

- Color-4-Data.db: [{Magenta: {hex: #FF00FF}}, {Gold: {hex: #FFD700}}]

They are created by the following operations.

- Add a row of (key, column, column_value = Blue, hex, #0000FF)
- Add a row of (key, column, column_value = Lavender, hex, #E6E6FA)
---- memtable is flushed => Color-1-Data.db ----
- Add a row of (key, column, column_value = Green, hex, #008000)
- Add a column of (key, column, column_value = Blue, hex2, #2c86ff)
---- memtable is flushed => Color-2-Data.db ----
- Add a column of (key, column, column_value = Green, hex2, #32CD32)
- Add a row of (key, column, column_value = Aqua, hex, #00FFFF)
- Delete a row of (key = Blue)
---- memtable is flushed => Color-3-Data.db ----
- Add a row of (key, column, column_value = Magenta, hex, #FF00FF)
- Add a row of (key, column, column_value = Gold, hex, #FFD700)
---- memtable is flushed => Color-4-Data.db ----

Then, a compaction will merge all those fragments together into thelatest ones as follows.

- Color-5-Data.db: [{Lavender: {hex: #E6E6FA}, {Aqua: {hex: #00FFFF},{Green: {hex: #008000, hex2: #32CD32}}, {Magenta: {hex: #FF00FF}},{Gold: {hex: #FFD700}}]

* assuming RandomPartitioner is used

Hope they would help.

- Takenori

(2013/09/17 10:51), java8964 java8964 wrote:

Hi, I have some questions related to the SSTable in the Cassandra, asI am doing a project to use it and hope someone in this list can sharesome thoughts.
My understand is the SSTable is per column family. But each columnfamily could have multi SSTable files. During the runtime, one rowCOULD split into more than one SSTable file, even this is not good forperformance, but it does happen, and Cassandra will try to merge andstore one row data into one SSTable file during compassion.
The question is when one row is split in multi SSTable files, what isthe boundary? Or let me ask this way, if one row exists in 2 SSTablefiles, if I run sstable2json tool to run on both SSTable filesindividually:
1) I will expect same row key could show up in both sstable2jsonoutput, as this one row exists in both SSTable files, right?2) If so, what is the boundary? Will Cassandra guarantee the columnlevel as the boundary? What I mean is that for one column's data, itwill be guaranteed to be either in the first file, or 2nd file, right?There is no chance that Cassandra will cut the data of one column into2 part, and one part stored in first SSTable file, and the other partstored in second SSTable file. Is my understanding correct?3) If what we are talking about are only the SSTable files insnapshot, incremental backup SSTable files, exclude the runtimeSSTable files, will anything change? For snapshot or incrementalbackup SSTable files, first can one row data still may exist in morethan one SSTable file? And any boundary change in this case?4) If I want to use incremental backup SSTable files as the way tocatch data being changed, is it a good way to do what I try toarchive? In this case, what happen in the following example:
For column family A:
at Time 0, one row key (key1) has some data. It is being stored andback up in SSTable file 1.at Time 1, if any column for key1 has any change (a new column insert,a column updated/deleted, or even whole row being deleted), I willexpect this whole row exists in the any incremental backup SSTablefiles after time 1, right?
What happen if the above row just happen to store in more than oneSSTable file?at Time 0, one row key (key1) has some data, and it just is stored inSSTable file1 and file2, and being backup.at Time 1, if one column is added in row key1, and the change in factwill happen in SSTable file2 only in this case, and if we do aincremental backup after that, what SSTable files should I expect inthis backup? Both SSTable files? Or Just SSTable file 2?
I was thinking incremental backup SSTable files are good candidate forcatching data being changed, but as one row data could exist in multiSSTable file makes thing complex now. Did anyone have any experienceto use SSTable file in this way? What are the lessons?
Thanks

Yong

Re: questions related to the SSTable file

Reply via email to