Hi Jason,

First question: what filesystem and OS are you running?

This has been an ongoing area of work; we fixed a few major issues in 1.2,
and a few more major issues in 1.3, and have a new tool ('kudu fs check')
that will be released in 1.4 to diagnose and fix further issues.  In some
cases we are underestimating the true size of the data, and in some cases
we are keeping around data that could be cleaned up.  I've included a list
of relevant JIRAs below if you are interested in specifics.  It should be
possible to get early access to the 'kudu fs check' tool by compiling Kudu
locally, but I'm going to defer to Adar on that, since he's the resident
expert on the subject.

KUDU-1755 <https://issues.apache.org/jira/browse/KUDU-1755>
KUDU-1853 <https://issues.apache.org/jira/browse/KUDU-1853>
KUDU-1856 <https://issues.apache.org/jira/browse/KUDU-1856>
KUDU-1769 <https://issues.apache.org/jira/browse/KUDU-1769>




On Wed, Apr 12, 2017 at 5:02 AM, Jason Heo <jason.heo....@gmail.com> wrote:

> Hello.
>
> I'm using Apache Kudu 1.2 on CDH 1.2.
>
> I'm estimating how many servers needed to store my data.
>
> After loading my test data sets, total_kudu_on_disk_size_
> across_kudu_replicas in chart library at CDH is 27.9TB whereas sum of `du
> -sh /path/to/tablet_data/data` on each node is 39.9TB which is 43% bigger
> than chart library.
>
> I also observed the same difference on my another Kudu test cluster.
>
> I'm curious this is normal and wanted to know there is a way to reduce
> physical file size.
>
> Thanks,
>
> Jason.
>
>
>
>
>

Reply via email to