Re: data size difference between supercolumn and regular column

Yiming Sun Fri, 06 Apr 2012 13:46:33 -0700

Thanks for the advice, Maki, especially on the ulimit!  Yes, we will play
with the configuration and figure out some optimal sstable size.


-- Y.

On Wed, Apr 4, 2012 at 9:49 AM, Watanabe Maki <watanabe.m...@gmail.com>wrote:

> LeveledCompaction will use less disk space(load), but need more IO.
> If your traffic is too high for your disk, you will have many pending
> compaction tasks, and large number of sstables which wait to be compacted.
> Also the default sstable_size_in_mb  (5MB) will be too small for large
> data set. You should better to have test iteration with different size
> configuration.
> Don't forget to unlimit number of file descriptors, and monitor tpstats
> and iostat.
>
> maki
>
> From iPhone
>
>
> On 2012/04/04, at 22:19, Yiming Sun <yiming....@gmail.com> wrote:
>
> Cool, I will look into this new leveled compaction strategy and give it a
> try.
>
> BTW, Aaron, I think the last word of your message meant to say
> "compression", correct?
>
> -- Y.
>
> On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> If you have a workload with overwrites you will end up with some data
>> needing compaction. Running a nightly manual compaction would remove this,
>> but it will also soak up some IO so it may not be the best solution.
>>
>> I do not know if Leveled compaction would result in a smaller disk load
>> for the same workload.
>>
>> I agree with other people, turn on compaction.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 3/04/2012, at 9:19 AM, Yiming Sun wrote:
>>
>> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it
>> runs out of disk space :-S.    I didn't try the compression, but when it
>> ran out of disk space, or near running out, compaction would fail because
>> it needs space to create some tmp data files.
>>
>> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
>>
>> -- Y.
>>
>> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan <
>> jeremiah.jor...@morningstar.com> wrote:
>>
>>>  Is that 80% with compression?  If not, the first thing to do is turn on
>>> compression.  Cassandra doesn't behave well when it runs out of disk space.
>>>  You really want to try and stay around 50%,  60-70% works, but only if it
>>> is spread across multiple column families, and even then you can run into
>>> issues when doing repairs.
>>>
>>>  -Jeremiah
>>>
>>>
>>>
>>>  On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
>>>
>>> Thanks Aaron.  Well I guess it is possible the data files from
>>> sueprcolumns could've been reduced in size after compaction.
>>>
>>>  This bring yet another question.  Say I am on a shoestring budget and
>>> can only put together a cluster with very limited storage space.  The first
>>> iteration of pushing data into cassandra would drive the disk usage up into
>>> the 80% range.  As time goes by, there will be updates to the data, and
>>> many columns will be overwritten.  If I just push the updates in, the disks
>>> will run out of space on all of the cluster nodes.  What would be the best
>>> way to handle such a situation if I cannot to buy larger disks? Do I need
>>> to delete the rows/columns that are going to be updated, do a compaction,
>>> and then insert the updates?  Or is there a better way?  Thanks
>>>
>>>  -- Y.
>>>
>>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton 
>>> <aa...@thelastpickle.com>wrote:
>>>
>>>>   does cassandra 1.0 perform some default compression?
>>>>
>>>>  No.
>>>>
>>>>  The on disk size depends to some degree on the work load.
>>>>
>>>>  If there are a lot of overwrites or deleted you may have rows/columns
>>>> that need to be compacted. You may have some big old SSTables that have not
>>>> been compacted for a while.
>>>>
>>>>  There is some overhead involved in the super columns: the super col
>>>> name, length of the name and the number of columns.
>>>>
>>>>  Cheers
>>>>
>>>>     -----------------
>>>> Aaron Morton
>>>> Freelance Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>>
>>>>  On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>>>>
>>>> Actually, after I read an article on cassandra 1.0 compression just now
>>>> (
>>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
>>>> I am more puzzled.  In our schema, we didn't specify any compression
>>>> options -- does cassandra 1.0 perform some default compression? or is the
>>>> data reduction purely because of the schema change?  Thanks.
>>>>
>>>>  -- Y.
>>>>
>>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  We are trying to estimate the amount of storage we need for a
>>>>> production cassandra cluster.  While I was doing the calculation, I 
>>>>> noticed
>>>>> a very dramatic difference in terms of storage space used by cassandra 
>>>>> data
>>>>> files.
>>>>>
>>>>>  Our previous setup consists of a single-node cassandra 0.8.x with no
>>>>> replication, and the data is stored using supercolumns, and the data files
>>>>> total about 534GB on disk.
>>>>>
>>>>>  A few weeks ago, I put together a cluster consisting of 3 nodes
>>>>> running cassandra 1.0 with replication factor of 2, and the data is
>>>>> flattened out and stored using regular columns.  And the aggregated data
>>>>> file size is only 488GB (would be 244GB if no replication).
>>>>>
>>>>>  This is a very dramatic reduction in terms of storage needs, and is
>>>>> certainly good news in terms of how much storage we need to provision.
>>>>>  However, because of the dramatic reduction, I also would like to make 
>>>>> sure
>>>>> it is absolutely correct before submitting it - and also get a sense of 
>>>>> why
>>>>> there was such a difference. -- I know cassandra 1.0 does data 
>>>>> compression,
>>>>> but does the schema change from supercolumn to regular column also help
>>>>> reduce storage usage?  Thanks.
>>>>>
>>>>>  -- Y.
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: data size difference between supercolumn and regular column

Reply via email to