Yes, some sort of data structure to coordinate this could reduce the problem
as well.
I made some comments on that at the end of 2558.

I believe a coordinator could be in place both to
- plan the start of compaction
and
- to coordinate compaction thread shutdown and tmp file deletion before we
completely run out of disk space

Regards,
Terje

On Wed, May 4, 2011 at 10:09 PM, Jonathan Ellis <jbel...@gmail.com> wrote:

> Or we could "reserve" space when starting a compaction.
>
> On Wed, May 4, 2011 at 2:32 AM, Terje Marthinussen
> <tmarthinus...@gmail.com> wrote:
> > Partially, I guess this may be a side effect of multithreaded
> compactions?
> > Before running out of space completely, I do see a few of these:
> >  WARN [CompactionExecutor:448] 2011-05-02 01:08:10,480
> > CompactionManager.java (line 516) insufficient space to compact all
> > requested files SSTableReader(path='/data/cassandra/JP_MALL_P
> > H/Order-f-12858-Data.db'),
> > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12851-Data.db'),
> > SSTableReader(path='/data/cassandra/JP_MALL_PH/Order-f-12864-Data.db')
> >  INFO [CompactionExecutor:448] 2011-05-02 01:08:10,481
> StorageService.java
> > (line 2066) requesting GC to free disk space
> > In this case, there would be 24 threads that asked if there was empty
> disk
> > space.
> > Most of them probably succeeded in that request, but they could have
> > requested 24x available space in theory since I do not think there is any
> > global pool of used disk in place that manages which how much disk space
> > will be needed for already started compactions?
> > Of course, regardless how much checking there is in advance, we could
> still
> > run out of disk, so I guess there is also a need for checking if
> diskspace
> > is about to run out while compaction runs so things may be
> halted/aborted.
> > Unfortunately that would need global coordination so we do not stop all
> > compaction threads....
> > After reducing to 6 compaction threads in 0.8 beta2, the data has
> compacted
> > just fine without any disk space issues, so I guess another problem you
> may
> > hit as you get a lot of sstables which have updates (that is, duplicates)
> to
> > the same data, is that of course, the massively concurrent compaction
> taking
> > place with nproc threads could also concurrently duplicate all the
> > duplicates on a large scale.
> > Yes, this is in favour of multithreaded compaction as it should normally
> > help keeping sstables to a sane level and avoid such problems, but it is
> > unfortunately just a kludge to the real problem which is to segment the
> > sstables somehow on keyspace so we can get down the disk requirements and
> > recover from scenarios where disk gets above 50%.
> > Regards,
> > Terje
> >
> >
> > On Wed, May 4, 2011 at 3:33 PM, Terje Marthinussen <
> tmarthinus...@gmail.com>
> > wrote:
> >>
> >> Well, just did not look at these logs very well at all last night
> >> First out of disk message:
> >> ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> Thread[CompactionExecutor:387,1,main]
> >> java.io.IOException: No space left on device
> >> Then finally the last one
> >> ERROR [FlushWriter:128] 2011-05-02 01:51:06,112
> >> AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >> Thread[FlushWriter:128,5,main]
> >> java.lang.RuntimeException: java.lang.RuntimeException: Insufficient
> disk
> >> space to flush 554962 bytes
> >>         at
> >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>         at java.lang.Thread.run(Thread.java:662)
> >> Caused by: java.lang.RuntimeException: Insufficient disk space to flush
> >> 554962 bytes
> >>         at
> >>
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> >>         at
> >>
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> >>         at
> >> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> >>         at org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> >>         at
> >> org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> >>         at
> >> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> >>         ... 3 more
> >>  INFO [CompactionExecutor:451] 2011-05-02 01:51:06,113
> StorageService.java
> >> (line 2066) requesting GC to free disk space
> >> [lots of sstables deleted]
> >> After this is starts running again (although not fine it seems).
> >> So the disk seems to have been full for 35 minutes due to un-deleted
> >> sstables.
> >> Terje
> >> On Wed, May 4, 2011 at 6:34 AM, Terje Marthinussen
> >> <tmarthinus...@gmail.com> wrote:
> >>>
> >>> Hm... peculiar.
> >>> Post flush is not involved in compactions, right?
> >>>
> >>> May 2nd
> >>> 01:06 - Out of disk
> >>> 01:51 - Starts a mix of major and minor compactions on different column
> >>> families
> >>> It then starts a few minor compactions extra over the day, but given
> that
> >>> there are more than 1000 sstables, and we are talking 3 minor
> compactions
> >>> started, it is not normal I think.
> >>> May 3rd 1 minor compaction started.
> >>> When I checked today, there was a bunch of tmp files on the disk with
> >>> last modify time from 01:something on may 2nd and 200GB empty disk...
> >>> Definitely no compaction going on.
> >>> Guess I will add some debug logging and see if I get lucky and run out
> of
> >>> disk again.
> >>> Terje
> >>> On Wed, May 4, 2011 at 5:06 AM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> >>>>
> >>>> Compaction does, but flush didn't until
> >>>> https://issues.apache.org/jira/browse/CASSANDRA-2404
> >>>>
> >>>> On Tue, May 3, 2011 at 2:26 PM, Terje Marthinussen
> >>>> <tmarthinus...@gmail.com> wrote:
> >>>> > Yes, I realize that.
> >>>> > I am bit curious why it ran out of disk, or rather, why I have 200GB
> >>>> > empty
> >>>> > disk now, but unfortunately it seems like we may not have had
> >>>> > monitoring
> >>>> > enabled on this node to tell me what happened in terms of disk
> usage.
> >>>> > I also thought that compaction was supposed to resume (try again
> with
> >>>> > less
> >>>> > data) if it fails?
> >>>> > Terje
> >>>> >
> >>>> > On Wed, May 4, 2011 at 3:50 AM, Jonathan Ellis <jbel...@gmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> post flusher is responsible for updating commitlog header after a
> >>>> >> flush; each task waits for a specific flush to complete, then does
> >>>> >> its
> >>>> >> thing.
> >>>> >>
> >>>> >> so when you had a flush catastrophically fail, its corresponding
> >>>> >> post-flush task will be stuck.
> >>>> >>
> >>>> >> On Tue, May 3, 2011 at 1:20 PM, Terje Marthinussen
> >>>> >> <tmarthinus...@gmail.com> wrote:
> >>>> >> > Just some very tiny amount of writes in the background here (some
> >>>> >> > hints
> >>>> >> > spooled up on another node slowly coming in).
> >>>> >> > No new data.
> >>>> >> >
> >>>> >> > I thought there was no exceptions, but I did not look far enough
> >>>> >> > back in
> >>>> >> > the
> >>>> >> > log at first.
> >>>> >> > Going back a bit further now however, I see that about 50 hours
> >>>> >> > ago:
> >>>> >> > ERROR [CompactionExecutor:387] 2011-05-02 01:16:01,027
> >>>> >> > AbstractCassandraDaemon.java (line 112) Fatal exception in thread
> >>>> >> > Thread[CompactionExecutor:387,1,main]
> >>>> >> > java.io.IOException: No space left on device
> >>>> >> >         at java.io.RandomAccessFile.writeBytes(Native Method)
> >>>> >> >         at
> >>>> >> > java.io.RandomAccessFile.write(RandomAccessFile.java:466)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.flush(BufferedRandomAccessFile.java:160)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.reBuffer(BufferedRandomAccessFile.java:225)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.writeAtMost(BufferedRandomAccessFile.java:356)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.util.BufferedRandomAccessFile.write(BufferedRandomAccessFile.java:335)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.PrecompactedRow.write(PrecompactedRow.java:102)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:130)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:566)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:146)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:112)
> >>>> >> >         at
> >>>> >> >
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>> >> >         at
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>> >> >         at java.lang.Thread.run(Thread.java:662)
> >>>> >> > [followed by a few more of those...]
> >>>> >> > and then a bunch of these:
> >>>> >> > ERROR [FlushWriter:123] 2011-05-02 01:21:12,690
> >>>> >> > AbstractCassandraDaemon.java
> >>>> >> > (line 112) Fatal exception in thread
> Thread[FlushWriter:123,5,main]
> >>>> >> > java.lang.RuntimeException: java.lang.RuntimeException:
> >>>> >> > Insufficient
> >>>> >> > disk
> >>>> >> > space to flush 40009184 bytes
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>> >> >         at java.lang.Thread.run(Thread.java:662)
> >>>> >> > Caused by: java.lang.RuntimeException: Insufficient disk space to
> >>>> >> > flush
> >>>> >> > 40009184 bytes
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:597)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2100)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:239)
> >>>> >> >         at
> >>>> >> > org.apache.cassandra.db.Memtable.access$400(Memtable.java:50)
> >>>> >> >         at
> >>>> >> > org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:263)
> >>>> >> >         at
> >>>> >> >
> >>>> >> >
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
> >>>> >> >         ... 3 more
> >>>> >> > Seems like compactions stopped after this (a bunch of tmp tables
> >>>> >> > there
> >>>> >> > still
> >>>> >> > from when those errors where generated), and I can only suspect
> the
> >>>> >> > post
> >>>> >> > flusher may have stopped at the same time.
> >>>> >> > There is 890GB of disk for data, sstables are currently using
> 604G
> >>>> >> > (139GB is
> >>>> >> > old tmp tables from when it ran out of disk) and "ring" tells me
> >>>> >> > the
> >>>> >> > load on
> >>>> >> > the node is 313GB.
> >>>> >> > Terje
> >>>> >> >
> >>>> >> >
> >>>> >> > On Wed, May 4, 2011 at 3:02 AM, Jonathan Ellis <
> jbel...@gmail.com>
> >>>> >> > wrote:
> >>>> >> >>
> >>>> >> >> ... and are there any exceptions in the log?
> >>>> >> >>
> >>>> >> >> On Tue, May 3, 2011 at 1:01 PM, Jonathan Ellis <
> jbel...@gmail.com>
> >>>> >> >> wrote:
> >>>> >> >> > Does it resolve down to 0 eventually if you stop doing writes?
> >>>> >> >> >
> >>>> >> >> > On Tue, May 3, 2011 at 12:56 PM, Terje Marthinussen
> >>>> >> >> > <tmarthinus...@gmail.com> wrote:
> >>>> >> >> >> Cassandra 0.8 beta trunk from about 1 week ago:
> >>>> >> >> >> Pool Name                    Active   Pending      Completed
> >>>> >> >> >> ReadStage                         0         0              5
> >>>> >> >> >> RequestResponseStage              0         0          87129
> >>>> >> >> >> MutationStage                     0         0         187298
> >>>> >> >> >> ReadRepairStage                   0         0              0
> >>>> >> >> >> ReplicateOnWriteStage             0         0              0
> >>>> >> >> >> GossipStage                       0         0        1353524
> >>>> >> >> >> AntiEntropyStage                  0         0              0
> >>>> >> >> >> MigrationStage                    0         0             10
> >>>> >> >> >> MemtablePostFlusher               1       190            108
> >>>> >> >> >> StreamStage                       0         0              0
> >>>> >> >> >> FlushWriter                       0         0            302
> >>>> >> >> >> FILEUTILS-DELETE-POOL             0         0             26
> >>>> >> >> >> MiscStage                         0         0              0
> >>>> >> >> >> FlushSorter                       0         0              0
> >>>> >> >> >> InternalResponseStage             0         0              0
> >>>> >> >> >> HintedHandoff                     1         4              7
> >>>> >> >> >>
> >>>> >> >> >> Anyone with nice theories about the pending value on the
> >>>> >> >> >> memtable
> >>>> >> >> >> post
> >>>> >> >> >> flusher?
> >>>> >> >> >> Regards,
> >>>> >> >> >> Terje
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > --
> >>>> >> >> > Jonathan Ellis
> >>>> >> >> > Project Chair, Apache Cassandra
> >>>> >> >> > co-founder of DataStax, the source for professional Cassandra
> >>>> >> >> > support
> >>>> >> >> > http://www.datastax.com
> >>>> >> >> >
> >>>> >> >>
> >>>> >> >>
> >>>> >> >>
> >>>> >> >> --
> >>>> >> >> Jonathan Ellis
> >>>> >> >> Project Chair, Apache Cassandra
> >>>> >> >> co-founder of DataStax, the source for professional Cassandra
> >>>> >> >> support
> >>>> >> >> http://www.datastax.com
> >>>> >> >
> >>>> >> >
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> Jonathan Ellis
> >>>> >> Project Chair, Apache Cassandra
> >>>> >> co-founder of DataStax, the source for professional Cassandra
> support
> >>>> >> http://www.datastax.com
> >>>> >
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jonathan Ellis
> >>>> Project Chair, Apache Cassandra
> >>>> co-founder of DataStax, the source for professional Cassandra support
> >>>> http://www.datastax.com
> >>>
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to