Hi Jason, First question: what filesystem and OS are you running?
This has been an ongoing area of work; we fixed a few major issues in 1.2, and a few more major issues in 1.3, and have a new tool ('kudu fs check') that will be released in 1.4 to diagnose and fix further issues. In some cases we are underestimating the true size of the data, and in some cases we are keeping around data that could be cleaned up. I've included a list of relevant JIRAs below if you are interested in specifics. It should be possible to get early access to the 'kudu fs check' tool by compiling Kudu locally, but I'm going to defer to Adar on that, since he's the resident expert on the subject. KUDU-1755 <https://issues.apache.org/jira/browse/KUDU-1755> KUDU-1853 <https://issues.apache.org/jira/browse/KUDU-1853> KUDU-1856 <https://issues.apache.org/jira/browse/KUDU-1856> KUDU-1769 <https://issues.apache.org/jira/browse/KUDU-1769> On Wed, Apr 12, 2017 at 5:02 AM, Jason Heo <jason.heo....@gmail.com> wrote: > Hello. > > I'm using Apache Kudu 1.2 on CDH 1.2. > > I'm estimating how many servers needed to store my data. > > After loading my test data sets, total_kudu_on_disk_size_ > across_kudu_replicas in chart library at CDH is 27.9TB whereas sum of `du > -sh /path/to/tablet_data/data` on each node is 39.9TB which is 43% bigger > than chart library. > > I also observed the same difference on my another Kudu test cluster. > > I'm curious this is normal and wanted to know there is a way to reduce > physical file size. > > Thanks, > > Jason. > > > > >