Thanks for the advice, Maki, especially on the ulimit! Yes, we will play with the configuration and figure out some optimal sstable size.
-- Y. On Wed, Apr 4, 2012 at 9:49 AM, Watanabe Maki <watanabe.m...@gmail.com>wrote: > LeveledCompaction will use less disk space(load), but need more IO. > If your traffic is too high for your disk, you will have many pending > compaction tasks, and large number of sstables which wait to be compacted. > Also the default sstable_size_in_mb (5MB) will be too small for large > data set. You should better to have test iteration with different size > configuration. > Don't forget to unlimit number of file descriptors, and monitor tpstats > and iostat. > > maki > > From iPhone > > > On 2012/04/04, at 22:19, Yiming Sun <yiming....@gmail.com> wrote: > > Cool, I will look into this new leveled compaction strategy and give it a > try. > > BTW, Aaron, I think the last word of your message meant to say > "compression", correct? > > -- Y. > > On Mon, Apr 2, 2012 at 9:37 PM, aaron morton <aa...@thelastpickle.com>wrote: > >> If you have a workload with overwrites you will end up with some data >> needing compaction. Running a nightly manual compaction would remove this, >> but it will also soak up some IO so it may not be the best solution. >> >> I do not know if Leveled compaction would result in a smaller disk load >> for the same workload. >> >> I agree with other people, turn on compaction. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 3/04/2012, at 9:19 AM, Yiming Sun wrote: >> >> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it >> runs out of disk space :-S. I didn't try the compression, but when it >> ran out of disk space, or near running out, compaction would fail because >> it needs space to create some tmp data files. >> >> I shall get a tatoo that says keep it around 50% -- this is valuable tip. >> >> -- Y. >> >> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan < >> jeremiah.jor...@morningstar.com> wrote: >> >>> Is that 80% with compression? If not, the first thing to do is turn on >>> compression. Cassandra doesn't behave well when it runs out of disk space. >>> You really want to try and stay around 50%, 60-70% works, but only if it >>> is spread across multiple column families, and even then you can run into >>> issues when doing repairs. >>> >>> -Jeremiah >>> >>> >>> >>> On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote: >>> >>> Thanks Aaron. Well I guess it is possible the data files from >>> sueprcolumns could've been reduced in size after compaction. >>> >>> This bring yet another question. Say I am on a shoestring budget and >>> can only put together a cluster with very limited storage space. The first >>> iteration of pushing data into cassandra would drive the disk usage up into >>> the 80% range. As time goes by, there will be updates to the data, and >>> many columns will be overwritten. If I just push the updates in, the disks >>> will run out of space on all of the cluster nodes. What would be the best >>> way to handle such a situation if I cannot to buy larger disks? Do I need >>> to delete the rows/columns that are going to be updated, do a compaction, >>> and then insert the updates? Or is there a better way? Thanks >>> >>> -- Y. >>> >>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton >>> <aa...@thelastpickle.com>wrote: >>> >>>> does cassandra 1.0 perform some default compression? >>>> >>>> No. >>>> >>>> The on disk size depends to some degree on the work load. >>>> >>>> If there are a lot of overwrites or deleted you may have rows/columns >>>> that need to be compacted. You may have some big old SSTables that have not >>>> been compacted for a while. >>>> >>>> There is some overhead involved in the super columns: the super col >>>> name, length of the name and the number of columns. >>>> >>>> Cheers >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 29/03/2012, at 9:47 AM, Yiming Sun wrote: >>>> >>>> Actually, after I read an article on cassandra 1.0 compression just now >>>> ( >>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), >>>> I am more puzzled. In our schema, we didn't specify any compression >>>> options -- does cassandra 1.0 perform some default compression? or is the >>>> data reduction purely because of the schema change? Thanks. >>>> >>>> -- Y. >>>> >>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com>wrote: >>>> >>>>> Hi, >>>>> >>>>> We are trying to estimate the amount of storage we need for a >>>>> production cassandra cluster. While I was doing the calculation, I >>>>> noticed >>>>> a very dramatic difference in terms of storage space used by cassandra >>>>> data >>>>> files. >>>>> >>>>> Our previous setup consists of a single-node cassandra 0.8.x with no >>>>> replication, and the data is stored using supercolumns, and the data files >>>>> total about 534GB on disk. >>>>> >>>>> A few weeks ago, I put together a cluster consisting of 3 nodes >>>>> running cassandra 1.0 with replication factor of 2, and the data is >>>>> flattened out and stored using regular columns. And the aggregated data >>>>> file size is only 488GB (would be 244GB if no replication). >>>>> >>>>> This is a very dramatic reduction in terms of storage needs, and is >>>>> certainly good news in terms of how much storage we need to provision. >>>>> However, because of the dramatic reduction, I also would like to make >>>>> sure >>>>> it is absolutely correct before submitting it - and also get a sense of >>>>> why >>>>> there was such a difference. -- I know cassandra 1.0 does data >>>>> compression, >>>>> but does the schema change from supercolumn to regular column also help >>>>> reduce storage usage? Thanks. >>>>> >>>>> -- Y. >>>>> >>>> >>>> >>>> >>> >>> >> >> >