Re: 8 million Cassandra data files on disk

Jeremiah Jordan Tue, 02 Aug 2011 14:19:57 -0700

Connect with jconsole and run garbage collection.
All of the files that have a -Compacted with the same name will get
deleted the next time a full garbage collection runs, or when the node
is restarted.  They have already been combined into new files, the old
ones just haven't been deleted yet.


On Tue, 2011-08-02 at 16:09 -0400, Yiming Sun wrote:
> Hi,
> 
> I am new to Cassandra, and am hoping someone could help me understand
> the (large amount of small) data files on disk that Cassandra
> generates. 
> 
> The reason we are using Cassandra is because we are dealing with
> thousands to millions of small text files on disk, so we are
> experimenting with Cassandra hoping that by dropping the files
> contents into Cassandra, it will achieve more efficient disk usage
> because Cassandra is going to aggregate them into bigger files (one
> file per column family, according to the wiki).
> 
> But after we pushed a subset of the files into a single node Cassandra
> v0.7.0 instance, we noted that in the Cassandra data directory for the
> keyspace, there are 8.5 million very small files, most are named
> 
>     <SuperColumnFamilyName>-e-<nnnnn>.Filter.db
>     <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db
>     <SuperColumnFamilyName>-e-<nnnnn>.Index.db
>     <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db
> 
> and among these files, the Compacted.db are always empty,  Filter and
> Index are under 100 bytes, and Statistics are around 4k.
> 
> What are these files? Why are there so many of them?  We originally
> hope that Cassandra was going to solve our issue with the small files
> we have, but now it doesn't seem to help -- we still end up with tons
> of small files.   Is there any way to reduce/combine these small
> files?
> 
> Thanks.
> 
> -- Y.

Re: 8 million Cassandra data files on disk

Reply via email to