Hi Jeremiah, Thank you for the information - it certainly is a relief. Two questions though:
1. I came across an old thread which seemed to be saying 0.7.0 cassandra has a bug and doesn't remove these compact files properly. Should we upgrade to a newer version that has this bug fixed? 2. Do we must do the garbage collection via Jconsole manually? Is there anyway I can force the GC in our code? (we are using Hector as our java client). Thanks! On Tue, Aug 2, 2011 at 5:19 PM, Jeremiah Jordan < jeremiah.jor...@morningstar.com> wrote: > Connect with jconsole and run garbage collection. > All of the files that have a -Compacted with the same name will get > deleted the next time a full garbage collection runs, or when the node > is restarted. They have already been combined into new files, the old > ones just haven't been deleted yet. > > On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote: > > Hi, > > > > I am new to Cassandra, and am hoping someone could help me understand > > the (large amount of small) data files on disk that Cassandra > > generates. > > > > The reason we are using Cassandra is because we are dealing with > > thousands to millions of small text files on disk, so we are > > experimenting with Cassandra hoping that by dropping the files > > contents into Cassandra, it will achieve more efficient disk usage > > because Cassandra is going to aggregate them into bigger files (one > > file per column family, according to the wiki). > > > > But after we pushed a subset of the files into a single node Cassandra > > v0.7.0 instance, we noted that in the Cassandra data directory for the > > keyspace, there are 8.5 million very small files, most are named > > > > <SuperColumnFamilyName>-e-<nnnnn>.Filter.db > > <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db > > <SuperColumnFamilyName>-e-<nnnnn>.Index.db > > <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db > > > > and among these files, the Compacted.db are always empty, Filter and > > Index are under 100 bytes, and Statistics are around 4k. > > > > What are these files? Why are there so many of them? We originally > > hope that Cassandra was going to solve our issue with the small files > > we have, but now it doesn't seem to help -- we still end up with tons > > of small files. Is there any way to reduce/combine these small > > files? > > > > Thanks. > > > > -- Y. > >