http://wiki.apache.org/cassandra/LargeDataSetConsiderations
On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L < wade.l.poziom...@intel.com> wrote: > “Having so much data on each node is a potential bad day.”**** > > ** ** > > Is this discussed somewhere on the Cassandra documentation (limits, > practices etc)? We are also trying to load up quite a lot of data and have > hit memory issues (bloom filter etc.) in 1.0.10. I would like to read up > on big data usage of Cassandra. Meaning terabyte size databases. **** > > ** ** > > I do get your point about the amount of time required to recover downed > node. But this 300-400MB business is interesting to me.**** > > ** ** > > Thanks in advance.**** > > ** ** > > Wade**** > > ** ** > > *From:* aaron morton [mailto:aa...@thelastpickle.com] > *Sent:* Wednesday, December 05, 2012 9:23 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered > compaction.**** > > ** ** > > Basically we were successful on two of the nodes. They both took ~2 days > and 11 hours to complete and at the end we saw one very large file ~900GB > and the rest much smaller (the overall size decreased). This is what we > expected!**** > > I would recommend having up to 300MB to 400MB per node on a regular HDD > with 1GB networking. **** > > ** ** > > But on the 3rd node, we suspect major compaction didn't actually finish > it's job…**** > > The file list looks odd. Check the time stamps, on the files. You should > not have files older than when compaction started. **** > > ** ** > > 8GB heap **** > > The default is 4GB max now days. **** > > ** ** > > 1) Do you expect problems with the 3rd node during 2 weeks more of > operations, in the conditions seen below? **** > > I cannot answer that. **** > > ** ** > > 2) Should we restart with leveled compaction next year? **** > > I would run some tests to see how it works for you workload. **** > > ** ** > > 4) Should we consider increasing the cluster capacity?**** > > IMHO yes.**** > > You may also want to do some experiments with turing compression on if it > not already enabled. **** > > ** ** > > Having so much data on each node is a potential bad day. If instead you > had to move or repair one of those nodes how long would it take for > cassandra to stream all the data over ? (Or to rsync the data over.) How > long does it take to run nodetool repair on the node ?**** > > ** ** > > With RF 3, if you lose a node you have lost your redundancy. It's > important to have a plan about how to get it back and how long it may take. > **** > > ** ** > > Hope that helps. **** > > ** ** > > -----------------**** > > Aaron Morton**** > > Freelance Cassandra Developer**** > > New Zealand**** > > ** ** > > @aaronmorton**** > > http://www.thelastpickle.com**** > > ** ** > > On 6/12/2012, at 3:40 AM, Alexandru Sicoe <adsi...@gmail.com> wrote:**** > > > > **** > > Hi guys, > Sorry for the late follow-up but I waited to run major compactions on all > 3 nodes at a time before replying with my findings. > > Basically we were successful on two of the nodes. They both took ~2 days > and 11 hours to complete and at the end we saw one very large file ~900GB > and the rest much smaller (the overall size decreased). This is what we > expected! > > But on the 3rd node, we suspect major compaction didn't actually finish > it's job. First of all nodetool compact returned much earlier than the rest > - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node > only about 36GB were freed up (almost the same size as before). Saw nothing > in the server log (debug not enabled). Below I pasted some more details > about file sizes before and after compaction on this third node and disk > occupancy. > > The situation is maybe not so dramatic for us because in less than 2 weeks > we will have a down time till after the new year. During this we can > completely delete all the data in the cluster and start fresh with TTLs for > 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks). > > Questions: > > 1) Do you expect problems with the 3rd node during 2 weeks more of > operations, in the conditions seen below? > [Note: we expect the minor compactions to continue building up files but > never really getting to compacting the large file and thus not needing much > temporarily extra disk space]. > > 2) Should we restart with leveled compaction next year? > [Note: Aaron was right, we have 1 week rows which get deleted after 1 > month which means older rows end up in big files => to free up space with > SizeTiered we will have no choice but run major compactions which we don't > know if they will work provided that we get at ~1TB / node / 1 month. You > can see we are at the limit!] > > 3) In case we keep SizeTiered: > > - How can we improve the performance of our major compactions? (we > left all config parameters as default). Would increasing compactions > throughput interfere with writes and reads? What about multi-threaded > compactions? > > - Do we still need to run regular repair operations as well? Do these > also do a major compaction or are they completely separate operations? > > [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and > reading at consistency level ALL. We read primarily for exporting reasons - > we export 1 week worth of data at a time]. > > 4) Should we consider increasing the cluster capacity? > [We generate ~5million new rows every week which shouldn't come close to > the hundreds of millions of rows on a node mentioned by Aaron which are the > volumes that would create problems with bloom filters and indexes]. > > Cheers, > Alex > ------------------ > > The situation in the data folder > > before calling nodetool comapact: > > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 376G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db > 305G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db > 39G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db > 78G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db > 81G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db > 205M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db > 20G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db > 4.9G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db > 333M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db > 92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db > 92M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db > 99M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db > 2.5G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db > 1.4T total > > after nodetool comapact returned: > > du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db > 444G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db > 910G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db > 19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db > 19G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db > 5.0G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db > 4.8G /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db > 338M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db > 339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db > 339M /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db > 98M > > > Looking at the disk occupancy for the logical partition where the data > folder is in: > > df /data_bst > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/sdb1 2927242720 1482502260 1444740460 51% /data_bst > > > and the situation in the cluster > > nodetool -h $HOSTNAME ring (before major compaction) > Address DC Rack Status State Load > Effective-Ownership Token > > 113427455640312821154458202477256070484 > 10.146.44.17 datacenter1 rack1 Up Normal 1.37 TB > 66.67% 0 > 10.146.44.18 datacenter1 rack1 Up Normal 1.04 TB > 66.67% 56713727820156410577229101238628035242 > 10.146.44.32 datacenter1 rack1 Up Normal 1.14 TB > 66.67% 113427455640312821154458202477256070484 > > nodetool -h $HOSTNAME ring (after major compaction) (Note we were > inserting data in the meantime) > Address DC Rack Status State Load > Effective-Ownership Token > > 113427455640312821154458202477256070484 > 10.146.44.17 datacenter1 rack1 Up Normal 1.38 TB > 66.67% 0 > 10.146.44.18 datacenter1 rack1 Up Normal 1.08 TB > 66.67% 56713727820156410577229101238628035242 > 10.146.44.32 datacenter1 rack1 Up Normal 1.19 TB > 66.67% 113427455640312821154458202477256070484 > > **** > > ** ** > > On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com> > wrote:**** > > > From what I know having too much data on one node is bad, not really > sure why, but I think that performance will go down due to the size of > indexes and bloom filters (I may be wrong on the reasons but I'm quite sure > you can't store too much data per node).**** > > If you have many hundreds of millions of rows on a node the memory needed > for bloom filters and index sampling can be significant. These can both be > tuned. > > If you have 1.1T per node the time to do a compaction, repair or upgrade > may be very significant. Also the time taken to copy this data should you > need to remove or replace a node may be prohibitive.**** > > > > 2. Switch to Leveled compaction strategy.**** > > I would avoid making a change like that on an unstable / at risk system. > > > - Our usage pattern is write once, read once (export) and delete once! > > The column TTL may be of use to you, it removes the need to do a delete. > > > - We were thinking of relying on the automatic minor compactions to free > up space for us but as.. > There are some usage patterns which make life harder for STS. For example > if you have very long lived rows that are written to and deleted a lot. Row > fragments that have been around for a while will end up in bigger files, > and these files get compacted less often. > > In this situation, if you are running low on disk space and you think > there is a lot of deleted data in there, I would run a major compaction. A > word or warning though, if do this you will need to continue to do it > regularly. Major compaction creates a single big file, that will not get > compaction often. There are ways to resolve this, and moving to LDB may > help in the future. > > If you are stuck and worried about disk space it's what I would do. Once > you are stable again then look at LDB > http://www.datastax.com/dev/blog/when-to-use-leveled-compaction > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com**** > > > On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > > Hi Alexandru, > > > > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk > per node for the data dir and separate disk for the commitlog, 12 cores, 24 > GB RAM" > > > > I think you should tune your architecture in a very different way. From > what I know having too much data on one node is bad, not really sure why, > but I think that performance will go down due to the size of indexes and > bloom filters (I may be wrong on the reasons but I'm quite sure you can't > store too much data per node). > > > > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would > be better if you have the choice. > > > > "(12GB to Cassandra heap)." > > > > The max heap recommanded is 8GB because if you use more than these 8GB > the Gc jobs will start decreasing your performance. > > > > "We now have 1.1 TB worth of data per node (RF = 2)." > > > > You should use RF=3 unless one out of consistency or SPOF doesn't > matter to you. > > > > With RF=2 you are obliged to write at CL.one to remove the single point > of failure. > > > > "1. Start issuing regular major compactions (nodetool compact). > > - This is not recommended: > > - Stops minor compactions. > > - Major performance hit on node (very bad for us because > need to be taking data all the time)." > > > > Actually, major compaction *does not* stop minor compactions. What > happens is that due to the size of the size of the sstable that remains > after your major compaction, it will never be compacted with the upcoming > new sstables, and because of that, your read performance will go down until > you run an other major compaction. > > > > "2. Switch to Leveled compaction strategy. > > - It is mentioned to help with deletes and disk space usage. Can > someone confirm?" > > > > From what I know, Leveled compaction will not free disk space. It will > allow you to use a greater percentage of your total disk space (50% max for > sized tier compaction vs about 80% for leveled compaction) > > > > "Our usage pattern is write once, read once (export) and delete once! " > > > > In this case, I think that leveled compaction fits your needs. > > > > "Can anyone suggest which (if any) is better? Are there better > solutions?" > > > > Are your sstable compressed ? You have 2 types of built-in compression > and you may use them depending on the model of each of your CF. > > > > see: > http://www.datastax.com/docs/1.1/operations/tuning#configure-compression > > > > Alain > > > > 2012/11/22 Alexandru Sicoe <adsi...@gmail.com> > > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk > per node for the data dir and separate disk for the commitlog, 12 cores, 24 > GB RAM (12GB to Cassandra heap). > >**** > > ** ** > > ** ** >