http://wiki.apache.org/cassandra/LargeDataSetConsiderations


On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L <
wade.l.poziom...@intel.com> wrote:

>  “Having so much data on each node is a potential bad day.”****
>
> ** **
>
> Is this discussed somewhere on the Cassandra documentation (limits,
> practices etc)?  We are also trying to load up quite a lot of data and have
> hit memory issues (bloom filter etc.) in 1.0.10.  I would like to read up
> on big data usage of Cassandra.  Meaning terabyte size databases.  ****
>
> ** **
>
> I do get your point about the amount of time required to recover downed
> node. But this 300-400MB business is interesting to me.****
>
> ** **
>
> Thanks in advance.****
>
> ** **
>
> Wade****
>
> ** **
>
> *From:* aaron morton [mailto:aa...@thelastpickle.com]
> *Sent:* Wednesday, December 05, 2012 9:23 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered
> compaction.****
>
> ** **
>
> Basically we were successful on two of the nodes. They both took ~2 days
> and 11 hours to complete and at the end we saw one very large file ~900GB
> and the rest much smaller (the overall size decreased). This is what we
> expected!****
>
> I would recommend having up to 300MB to 400MB per node on a regular HDD
> with 1GB networking. ****
>
> ** **
>
> But on the 3rd node, we suspect major compaction didn't actually finish
> it's job…****
>
> The file list looks odd. Check the time stamps, on the files. You should
> not have files older than when compaction started. ****
>
> ** **
>
> 8GB heap ****
>
> The default is 4GB max now days. ****
>
> ** **
>
> 1) Do you expect problems with the 3rd node during 2 weeks more of
> operations, in the conditions seen below? ****
>
> I cannot answer that. ****
>
> ** **
>
> 2) Should we restart with leveled compaction next year? ****
>
> I would run some tests to see how it works for you workload. ****
>
> ** **
>
> 4) Should we consider increasing the cluster capacity?****
>
> IMHO yes.****
>
> You may also want to do some experiments with turing compression on if it
> not already enabled. ****
>
> ** **
>
> Having so much data on each node is a potential bad day. If instead you
> had to move or repair one of those nodes how long would it take for
> cassandra to stream all the data over ? (Or to rsync the data over.) How
> long does it take to run nodetool repair on the node ?****
>
> ** **
>
> With RF 3, if you lose a node you have lost your redundancy. It's
> important to have a plan about how to get it back and how long it may take.
>   ****
>
> ** **
>
> Hope that helps. ****
>
> ** **
>
> -----------------****
>
> Aaron Morton****
>
> Freelance Cassandra Developer****
>
> New Zealand****
>
> ** **
>
> @aaronmorton****
>
> http://www.thelastpickle.com****
>
> ** **
>
> On 6/12/2012, at 3:40 AM, Alexandru Sicoe <adsi...@gmail.com> wrote:****
>
>
>
> ****
>
> Hi guys,
> Sorry for the late follow-up but I waited to run major compactions on all
> 3 nodes at a time before replying with my findings.
>
> Basically we were successful on two of the nodes. They both took ~2 days
> and 11 hours to complete and at the end we saw one very large file ~900GB
> and the rest much smaller (the overall size decreased). This is what we
> expected!
>
> But on the 3rd node, we suspect major compaction didn't actually finish
> it's job. First of all nodetool compact returned much earlier than the rest
> - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node
> only about 36GB were freed up (almost the same size as before). Saw nothing
> in the server log (debug not enabled). Below I pasted some more details
> about file sizes before and after compaction on this third node and disk
> occupancy.
>
> The situation is maybe not so dramatic for us because in less than 2 weeks
> we will have a down time till after the new year. During this we can
> completely delete all the data in the cluster and start fresh with TTLs for
> 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks).
>
> Questions:
>
> 1) Do you expect problems with the 3rd node during 2 weeks more of
> operations, in the conditions seen below?
> [Note: we expect the minor compactions to continue building up files but
> never really getting to compacting the large file and thus not needing much
> temporarily extra disk space].
>
> 2) Should we restart with leveled compaction next year?
> [Note: Aaron was right, we have 1 week rows which get deleted after 1
> month which means older rows end up in big files => to free up space with
> SizeTiered we will have no choice but run major compactions which we don't
> know if they will work provided that we get at ~1TB / node / 1 month. You
> can see we are at the limit!]
>
> 3) In case we keep SizeTiered:
>
>     - How can we improve the performance of our major compactions? (we
> left all config parameters as default). Would increasing compactions
> throughput interfere with writes and reads? What about multi-threaded
> compactions?
>
>     - Do we still need to run regular repair operations as well? Do these
> also do a major compaction or are they completely separate operations?
>
> [Note: we have 3 nodes with RF=2 and inserting at consistency level 1 and
> reading at consistency level ALL. We read primarily for exporting reasons -
> we export 1 week worth of data at a time].
>
> 4) Should we consider increasing the cluster capacity?
> [We generate ~5million new rows every week which shouldn't come close to
> the hundreds of millions of rows on a node mentioned by Aaron which are the
> volumes that would create problems with bloom filters and indexes].
>
> Cheers,
> Alex
> ------------------
>
> The situation in the data folder
>
>     before calling nodetool comapact:
>
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 376G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-46431-Data.db
> 305G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-68959-Data.db
> 39G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-7352-Data.db
> 78G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-74076-Data.db
> 81G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-79663-Data.db
> 205M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80370-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-80968-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-82330-Data.db
> 20G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-83710-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84015-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84356-Data.db
> 4.9G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84696-Data.db
> 333M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84707-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84712-Data.db
> 92M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84717-Data.db
> 99M     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84722-Data.db
> 2.5G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-tmp-he-84723-Data.db
> 1.4T    total
>
>     after nodetool comapact returned:
>
> du -csh /data_bst/cassandra/data/ATLAS/Data/*-Data.db
> 444G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-24370-Data.db
> 910G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-84723-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-86229-Data.db
> 19G     /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87639-Data.db
> 5.0G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-87923-Data.db
> 4.8G    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88261-Data.db
> 338M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88271-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88292-Data.db
> 339M    /data_bst/cassandra/data/ATLAS/Data/ATLAS-Data-he-88312-Data.db
> 98M
>
>
> Looking at the disk occupancy for the logical partition where the data
> folder is in:
>
> df /data_bst
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sdb1            2927242720 1482502260 1444740460  51% /data_bst
>
>
> and the situation in the cluster
>
> nodetool -h $HOSTNAME ring (before major compaction)
> Address         DC          Rack        Status State   Load
> Effective-Ownership Token
>
> 113427455640312821154458202477256070484
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.37 TB
> 66.67%              0
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.04 TB
> 66.67%              56713727820156410577229101238628035242
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.14 TB
> 66.67%              113427455640312821154458202477256070484
>
> nodetool -h $HOSTNAME ring (after major compaction) (Note we were
> inserting data in the meantime)
> Address         DC          Rack        Status State   Load
> Effective-Ownership Token
>
> 113427455640312821154458202477256070484
> 10.146.44.17    datacenter1 rack1       Up     Normal  1.38 TB
> 66.67%              0
> 10.146.44.18    datacenter1 rack1       Up     Normal  1.08 TB
> 66.67%              56713727820156410577229101238628035242
> 10.146.44.32    datacenter1 rack1       Up     Normal  1.19 TB
> 66.67%              113427455640312821154458202477256070484
>
> ****
>
> ** **
>
> On Fri, Nov 23, 2012 at 2:16 AM, aaron morton <aa...@thelastpickle.com>
> wrote:****
>
> >  From what I know having too much data on one node is bad, not really
> sure why, but  I think that performance will go down due to the size of
> indexes and bloom filters (I may be wrong on the reasons but I'm quite sure
> you can't store too much data per node).****
>
> If you have many hundreds of millions of rows on a node the memory needed
> for bloom filters and index sampling can be significant. These can both be
> tuned.
>
> If you have 1.1T per node the time to do a compaction, repair or upgrade
> may be very significant. Also the time taken to copy this data should you
> need to remove or replace a node may be prohibitive.****
>
>
> > 2. Switch to Leveled compaction strategy.****
>
> I would avoid making a change like that on an unstable / at risk system.
>
> > - Our usage pattern is write once, read once (export) and delete once!
>
>  The column TTL may be of use to you, it removes the need to do a delete.
>
> > - We were thinking of relying on the automatic minor compactions to free
> up space for us but as..
> There are some usage patterns which make life harder for STS. For example
> if you have very long lived rows that are written to and deleted a lot. Row
> fragments that have been around for a while will end up in bigger files,
> and these files get compacted less often.
>
> In this situation, if you are running low on disk space and you think
> there is a lot of deleted data in there, I would run a major compaction. A
> word or warning though, if do this you will need to continue to do it
> regularly. Major compaction creates a single big file, that will not get
> compaction often. There are ways to resolve this, and moving to LDB may
> help in the future.
>
> If you are stuck and worried about disk space it's what I would do. Once
> you are stable again then look at LDB
> http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com****
>
>
> On 23/11/2012, at 9:18 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>
> > Hi Alexandru,
> >
> > "We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM"
> >
> > I think you should tune your architecture in a very different way. From
> what I know having too much data on one node is bad, not really sure why,
> but  I think that performance will go down due to the size of indexes and
> bloom filters (I may be wrong on the reasons but I'm quite sure you can't
> store too much data per node).
> >
> > Anyway, I am 6 nodes with half of these resources (6 cores / 12GB) would
> be better if you have the choice.
> >
> > "(12GB to Cassandra heap)."
> >
> > The max heap recommanded is 8GB because if you use more than these 8GB
> the Gc jobs will start decreasing your performance.
> >
> > "We now have 1.1 TB worth of data per node (RF = 2)."
> >
> > You should use RF=3 unless one out of consistency or SPOF  doesn't
> matter to you.
> >
> > With RF=2 you are obliged to write at CL.one to remove the single point
> of failure.
> >
> > "1. Start issuing regular major compactions (nodetool compact).
> >      - This is not recommended:
> >             - Stops minor compactions.
> >             - Major performance hit on node (very bad for us because
> need to be taking data all the time)."
> >
> > Actually, major compaction *does not* stop minor compactions. What
> happens is that due to the size of the size of the sstable that remains
> after your major compaction, it will never be compacted with the upcoming
> new sstables, and because of that, your read performance will go down until
> you run an other major compaction.
> >
> > "2. Switch to Leveled compaction strategy.
> >       - It is mentioned to help with deletes and disk space usage. Can
> someone confirm?"
> >
> > From what I know, Leveled compaction will not free disk space. It will
> allow you to use a greater percentage of your total disk space (50% max for
> sized tier compaction vs about 80% for leveled compaction)
> >
> > "Our usage pattern is write once, read once (export) and delete once! "
> >
> > In this case, I think that leveled compaction fits your needs.
> >
> > "Can anyone suggest which (if any) is better? Are there better
> solutions?"
> >
> > Are your sstable compressed ? You have 2 types of built-in compression
> and you may use them depending on the model of each of your CF.
> >
> > see:
> http://www.datastax.com/docs/1.1/operations/tuning#configure-compression
> >
> > Alain
> >
> > 2012/11/22 Alexandru Sicoe <adsi...@gmail.com>
> > We are running a 3 node Cassandra 1.1.5 cluster with a 3TB Raid 0 disk
> per node for the data dir and separate disk for the commitlog, 12 cores, 24
> GB RAM (12GB to Cassandra heap).
> >****
>
> ** **
>
> ** **
>

Reply via email to