compression for regular column names?

2011-06-16 Thread E R
Hi all,

As a way of gaining familiarity with Cassandra I am migrating a table
that is currently stored in a relational database and mapping it into
a Cassandra column family. We add about 700,000 new rows a day to this
table, and the average disk space used per row is ~ 300 bytes
including indexes.

The mapping from table to column family is straight forward - there is
a one-one relationship between table columns and column family column
names. The relational table has 19 columns. The length of the names of
the columns is nearly 200 bytes whereas the average amount of data per
row is only 130 bytes.

Initially I used the identify map for this translation - i.e. my
Cassandra column names were the same as the relational column names. I
then found out I could save a lot of disk space by using single letter
column names instead of the original relational names. I.e. use 'L'
instead of 'LINK_IDENTIFIER' for a column name.

The procedure I use to determine space used is:

1. rm -rf the cassandra var-lib directory
2. start cassandra, create keyspace, column families, etc.
3. insert records
4. stop cassandra
5. re-start cassandra
6. measure disk space with du -s the cassandra var-lib directory

This seems to replace the commit logs with .db files.

My questions are:

1. Is this a common practice (i.e. making the client responsible for
shortening the column names) when dealing with a large number of fixed
column names and a high volume of inserts? Is there any way that
Cassandra can help out here?

2. Is there another way to transform the commit logs into .db files
without stopping and starting the server?

Thanks,
ER


Re: compression for regular column names?

2011-06-16 Thread Ryan King
On Thu, Jun 16, 2011 at 3:41 PM, E R pc88m...@gmail.com wrote:
 Hi all,

 As a way of gaining familiarity with Cassandra I am migrating a table
 that is currently stored in a relational database and mapping it into
 a Cassandra column family. We add about 700,000 new rows a day to this
 table, and the average disk space used per row is ~ 300 bytes
 including indexes.

 The mapping from table to column family is straight forward - there is
 a one-one relationship between table columns and column family column
 names. The relational table has 19 columns. The length of the names of
 the columns is nearly 200 bytes whereas the average amount of data per
 row is only 130 bytes.

 Initially I used the identify map for this translation - i.e. my
 Cassandra column names were the same as the relational column names. I
 then found out I could save a lot of disk space by using single letter
 column names instead of the original relational names. I.e. use 'L'
 instead of 'LINK_IDENTIFIER' for a column name.

 The procedure I use to determine space used is:

 1. rm -rf the cassandra var-lib directory
 2. start cassandra, create keyspace, column families, etc.
 3. insert records
 4. stop cassandra
 5. re-start cassandra
 6. measure disk space with du -s the cassandra var-lib directory

 This seems to replace the commit logs with .db files.

 My questions are:

 1. Is this a common practice (i.e. making the client responsible for
 shortening the column names) when dealing with a large number of fixed
 column names and a high volume of inserts? Is there any way that
 Cassandra can help out here?

Yes, we're working on a new, compressed format CASSANDRA-674.

 2. Is there another way to transform the commit logs into .db files
 without stopping and starting the server?

nodetool flush.

-ryan