Connect with jconsole and run garbage collection. All of the files that have a -Compacted with the same name will get deleted the next time a full garbage collection runs, or when the node is restarted. They have already been combined into new files, the old ones just haven't been deleted yet.
On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote: > Hi, > > I am new to Cassandra, and am hoping someone could help me understand > the (large amount of small) data files on disk that Cassandra > generates. > > The reason we are using Cassandra is because we are dealing with > thousands to millions of small text files on disk, so we are > experimenting with Cassandra hoping that by dropping the files > contents into Cassandra, it will achieve more efficient disk usage > because Cassandra is going to aggregate them into bigger files (one > file per column family, according to the wiki). > > But after we pushed a subset of the files into a single node Cassandra > v0.7.0 instance, we noted that in the Cassandra data directory for the > keyspace, there are 8.5 million very small files, most are named > > <SuperColumnFamilyName>-e-<nnnnn>.Filter.db > <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db > <SuperColumnFamilyName>-e-<nnnnn>.Index.db > <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db > > and among these files, the Compacted.db are always empty, Filter and > Index are under 100 bytes, and Statistics are around 4k. > > What are these files? Why are there so many of them? We originally > hope that Cassandra was going to solve our issue with the small files > we have, but now it doesn't seem to help -- we still end up with tons > of small files. Is there any way to reduce/combine these small > files? > > Thanks. > > -- Y.