Hi all,

As a way of gaining familiarity with Cassandra I am migrating a table
that is currently stored in a relational database and mapping it into
a Cassandra column family. We add about 700,000 new rows a day to this
table, and the average disk space used per row is ~ 300 bytes
including indexes.

The mapping from table to column family is straight forward - there is
a one-one relationship between table columns and column family column
names. The relational table has 19 columns. The length of the names of
the columns is nearly 200 bytes whereas the average amount of data per
row is only 130 bytes.

Initially I used the identify map for this translation - i.e. my
Cassandra column names were the same as the relational column names. I
then found out I could save a lot of disk space by using single letter
column names instead of the original relational names. I.e. use 'L'
instead of 'LINK_IDENTIFIER' for a column name.

The procedure I use to determine space used is:

1. rm -rf the cassandra var-lib directory
2. start cassandra, create keyspace, column families, etc.
3. insert records
4. stop cassandra
5. re-start cassandra
6. measure disk space with du -s the cassandra var-lib directory

This seems to replace the commit logs with .db files.

My questions are:

1. Is this a common practice (i.e. making the client responsible for
shortening the column names) when dealing with a large number of fixed
column names and a high volume of inserts? Is there any way that
Cassandra can help out here?

2. Is there another way to transform the commit logs into .db files
without stopping and starting the server?

Thanks,
ER

Reply via email to