Re: About data file size and on-disk size

Todd Lipcon Mon, 09 Jan 2017 21:38:37 -0800

Hi,

Sorry for the late response here -- things got busy during the holidays.


Yes, the upcoming 1.2 release will include this fix.

-Todd

On Mon, Dec 12, 2016 at 1:41 AM, 阿香 <1654407...@qq.com> wrote:

>
> Todd,
>
> Thanks.
> Have not yet trying to get back the empty space from the container.
> I will try it later this month.
>
> By the way, when will kudu's next release come out?  Will 1.2 release in
> mid-January include this fix?
>
> Thanks.
> BR
> -GU
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Todd Lipcon";<t...@cloudera.com>;
> *发送时间:* 2016年12月12日(星期一) 下午2:26
> *收件人:* "user"<user@kudu.apache.org>;
> *主题:* Re: About data file size and on-disk size
>
> Just a follow-up note here: if you did end up cherry-picking that change,
> you should also be sure to cherry-pick 
> faa587c639aa9e5dcf3fac04259f46ba1921140a
> to avoid a potential data loss bug.
>
> On Wed, Nov 30, 2016 at 9:00 AM, Adar Dembo <a...@cloudera.com> wrote:
>
>> If you're comfortable rebuilding Kudu from source, you can apply
>> https://gerrit.cloudera.org/#/c/5254, rebuild the tserver, and restart
>> it. Once the tserver is done restarting, it should trim the empty space off
>> of the ends of all of your container data files.
>>
>> Otherwise, you'll have to wait until the next Kudu release.
>>
>> On Tue, Nov 29, 2016 at 5:48 PM, 阿香 <1654407...@qq.com> wrote:
>>
>>>
>>> Hi Todd,
>>>
>>> Thanks.
>>> From the results, I think you successfully got the bug.
>>> By the way, can I get back the wasted disk space?
>>>
>>>
>>> # du -sm 542d51e55d524034a5274600c31abd11.data
>>> 29 542d51e55d524034a5274600c31abd11.data
>>>
>>> # filefrag -v -b 542d51e55d524034a5274600c31abd11.data
>>>
>>> filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
>>> Filesystem type is: ef53
>>> File size of 542d51e55d524034a5274600c31abd11.data is 10767867904
>>> (10515496 blocks of 1024 bytes)
>>>  ext:     logical_offset:        physical_offset: length:   expected:
>>> flags:
>>>    0: 10486144..10497543:  278086588.. 278097987:  11400:
>>> unwritten
>>>    1: 10497544..10514191:  278691588.. 278708235:  16648:  278097988:
>>> unwritten
>>>    2: 10514192..10514199:  279581160.. 279581167:      8:  278708236:
>>> unwritten
>>>    3: 10514200..10514203:  280291284.. 280291287:      4:  279581168:
>>> unwritten
>>>    4: 10514204..10514227:  280652252.. 280652275:     24:  280291288:
>>> unwritten
>>>    5: 10514228..10515259:  281289216.. 281290247:   1032:  280652276:
>>> unwritten
>>>    6: 10515260..10515263:  282068816.. 282068819:      4:  281290248:
>>> unwritten
>>>    7: 10515264..10515495:  283429184.. 283429415:    232:  282068820:
>>> unwritten,eof
>>> 542d51e55d524034a5274600c31abd11.data: 8 extents found
>>>
>>> # echo $[11400 + 16648 + 1032 + 232]
>>> 29312
>>>
>>> # ls -l 542d51e55d524034a5274600c31abd11.data
>>> -rw-r--r-- 1 kudu kudu 10767867904 Oct 26 06:51
>>> 542d51e55d524034a5274600c31abd11.data
>>>
>>> # ls -lh 542d51e55d524034a5274600c31abd11.data
>>> -rw-r--r-- 1 kudu kudu 11G Oct 26 06:51 542d51e55d524034a5274600c31abd
>>> 11.data
>>>
>>> BR
>>> -GU
>>>
>>> ------------------ 原始邮件 ------------------
>>> *发件人:* "Todd Lipcon";<t...@cloudera.com>;
>>> *发送时间:* 2016年11月29日(星期二) 凌晨4:15
>>> *收件人:* "user"<user@kudu.apache.org>;
>>> *主题:* Re: About data file size and on-disk size
>>>
>>> Hi Xiang,
>>>
>>> Adar and I did some investigation and came up with a likely cause:
>>> https://issues.apache.org/jira/browse/KUDU-1764
>>>
>>> Can you please try the following on one of your .data files? (preferably
>>> one which has a modification time a few weeks old?)
>>>
>>> $ du -sm abcdef.data
>>> $ filefrag -v -b abcdef.data
>>> $ ls -l abcdef.data
>>>
>>> We can use this to confirm whether you are hitting the same bug we just
>>> discovered.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Thu, Nov 24, 2016 at 6:57 AM, 阿香 <1654407...@qq.com> wrote:
>>>
>>>>
>>>> > If the workload doesn't involve normal (merging) compactions, then
>>>> UNDOs won't be GCed at all. So, if you have a relatively static set of
>>>> keys, and are just updating them without causing many new inserts, this
>>>> could be the problem.
>>>>
>>>> The keys are not relatively static and increasing all the time.
>>>> The key of the table is a uuid string with hash partition (16 buckets).
>>>> Currently there are about 1000,000,000 rows in this cluster.
>>>>
>>>> Will these big data files increase the latency time of the upsert
>>>> operation?
>>>>
>>>> I saw the metrics like following by kudu web UI.
>>>>
>>>>             {
>>>>                 "name": "write_op_duration_client_propagated_consistency",
>>>>                 "total_count": 8568729,
>>>>                 "min": 116,
>>>>                 "mean": 2499.56,
>>>>                 "percentile_75": 2176,
>>>>                 "percentile_95": 7680,
>>>>                 "percentile_99": 29568,
>>>>                 "percentile_99_9": 78336,
>>>>                 "percentile_99_99": 123904,
>>>>                 "max": 1562967,
>>>>                 "total_sum": 21418050385
>>>>             }
>>>>
>>>>
>>>>
>>>>
>>>> ------------------ 原始邮件 ------------------
>>>> *发件人:* "Todd Lipcon";<t...@cloudera.com>;
>>>> *发送时间:* 2016年11月24日(星期四) 中午11:55
>>>> *收件人:* "user"<user@kudu.apache.org>;
>>>> *主题:* Re: About data file size and on-disk size
>>>>
>>>> On Wed, Nov 23, 2016 at 2:30 PM, Adar Dembo <a...@cloudera.com> wrote:
>>>>
>>>>> The difference between du with --apparent-size and without suggests
>>>>> that hole punching is working properly. Quick back of the envelope
>>>>> math shows that with 8133 containers, each container is just over 10G
>>>>> of "apparent size", which means nearly all of the containers were full
>>>>> at one point or another. That makes sense; it means that Kudu is
>>>>> generally writing to a small number of containers at any given time,
>>>>> but is filling them up over time.
>>>>>
>>>>> I took a look at the tablet disk estimation code and found that it
>>>>> excludes the size of all of the UNDO data blocks. I think this is
>>>>> because the size estimation is also used to drive decisions regarding
>>>>> delta compaction, but with an UPSERT-only workload like yours, we'd
>>>>> expect to see many UNDO data blocks over time as updated (and now
>>>>> historical) data is further and further compacted. I filed
>>>>> https://issues.apache.org/jira/browse/KUDU-1755 to track these issues.
>>>>> However, if this were the case, I'd expect the "tablet history GC"
>>>>> feature (new in Kudu 1.0) to remove old data that was mutated in an
>>>>> UPSERT. The default value for --tablet_history_max_age_sec (which
>>>>> controls how old the data must be before it is removed) is 15 minutes;
>>>>> have you changed the value of this flag? If not, could you look at
>>>>> your tserver log for the presence of major delta compactions? Look for
>>>>> references to MajorDeltaCompactionOp. If there aren't any, that means
>>>>> Kudu isn't getting opportunities to age out old data.
>>>>>
>>>>
>>>> Worth noting that major delta compaction doesn't actually remove old
>>>> UNDOs. There are still some open JIRAs about scheduling tasks to age-off
>>>> UNDOs, but as it stands today, they only get collected during a normal
>>>> compaction.
>>>>
>>>> If the workload doesn't involve normal (merging) compactions, then
>>>> UNDOs won't be GCed at all. So, if you have a relatively static set of
>>>> keys, and are just updating them without causing many new inserts, this
>>>> could be the problem.
>>>>
>>>>
>>>>>
>>>>> It's also possible that simply not accounting for the composite index
>>>>> and bloom blocks (see KUDU-1755) is the reason. Take a look at
>>>>> https://issues.apache.org/jira/browse/KUDU-624?focusedCommen
>>>>> tId=15165054&page=com.atlassian.jira.plugin.system.issuetabp
>>>>> anels:comment-tabpanel#comment-15165054
>>>>> and run the same two commands to compare the total on-disk size of all
>>>>> the .data files to the number of bytes that the tserver is aware of.
>>>>> If the two numbers are close, it's a sign that, at the very least,
>>>>> Kudu is aware of and actively managing all that disk space (i.e.
>>>>> there's no "orphaned" data).
>>>>>
>>>>
>>>> -Todd
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 23, 2016 at 12:39 AM, 阿香 <1654407...@qq.com> wrote:
>>>>> > Hi,
>>>>> >
>>>>> >> Can you tell us a little bit more about your table, as well as any
>>>>> deleted
>>>>> >> tables you once had? How many columns did they have?
>>>>> >
>>>>> > I do not delete any tables before.
>>>>> > There is only one table with 12 columns(string and int) in the kudu
>>>>> cluster.
>>>>> > This cluster has three tablet servers.
>>>>> >
>>>>> > I use upsert operation to insert&update rows.
>>>>> >
>>>>> >> what version of Kudu are you using?
>>>>> >
>>>>> > kudu -version
>>>>> > kudu 1.0.0
>>>>> > revision 6f6e49ca98c3e3be7d81f88ab8a0f9173959b191
>>>>> > build type RELEASE
>>>>> > built by jenkins at 16 Sep 2016 00:23:10 PST on
>>>>> > impala-ec2-pkg-centos-7-0dc0.vpc.cloudera.com
>>>>> > build id 2016-09-16_00-03-04
>>>>> >
>>>>> >> It's conceivable that there's a pathological case wherein each of
>>>>> the 8133
>>>>> >> data files is used, one at a time, to store data blocks, which
>>>>> would cause
>>>>> >> each to allocate 32 MB of disk space (totaling about 254G).
>>>>> >
>>>>> > Can the number of data files be decreased? The SSD disk is almost
>>>>> out of
>>>>> > space now.
>>>>> >
>>>>> >> Can you try running du with --apparent-size and compare the results?
>>>>> >
>>>>> > # du -sh /data/kudu/tserver/data/
>>>>> > 213G /data/kudu/tserver/data/
>>>>> > # du -sh --apparent-size  /data/kudu/tserver/data/
>>>>> > 81T /data/kudu/tserver/data/
>>>>> >
>>>>> >> What filesystem is being used for /data/kudu/tserver/data?
>>>>> >
>>>>> > # file -s /dev/vdb1
>>>>> > /dev/vdb1: Linux rev 1.0 ext4 filesystem data,
>>>>> > UUID=9f95ba79-f387-42be-a43f-d1421c83e2e5 (needs journal recovery)
>>>>> (extents)
>>>>> > (64bit) (large files) (huge files)
>>>>> >
>>>>> >
>>>>> > Thanks.
>>>>> >
>>>>> >
>>>>> > ------------------ 原始邮件 ------------------
>>>>> > 发件人: "Adar Dembo";<a...@cloudera.com>;
>>>>> > 发送时间: 2016年11月23日(星期三) 上午9:35
>>>>> > 收件人: "user"<user@kudu.apache.org>;
>>>>> > 主题: Re: About data file size and on-disk size
>>>>> >
>>>>> > Also, if you haven't explicitly disabled it, each .data file is going
>>>>> > to preallocate 32 MB of data when used. It's conceivable that there's
>>>>> > a pathological case wherein each of the 8133 data files is used, one
>>>>> > at a time, to store data blocks, which would cause each to allocate
>>>>> 32
>>>>> > MB of disk space (totaling about 254G).
>>>>> >
>>>>> > Can you tell us a little bit more about your table, as well as any
>>>>> > deleted tables you once had? How many columns did they have? Also,
>>>>> > what version of Kudu are you using?
>>>>> >
>>>>> > On Tue, Nov 22, 2016 at 11:39 AM, Adar Dembo <a...@cloudera.com>
>>>>> wrote:
>>>>> >> The files in /data/kudu/tserver/data are supposed to be sparse; that
>>>>> >> is, when Kudu decides to delete data, it'll punch a hole in one of
>>>>> >> those files, allowing the filesystem to reclaim the space in that
>>>>> >> hole. Yet, 'du' should reflect that because it measures real space
>>>>> >> usage. Can you try running du with --apparent-size and compare the
>>>>> >> results? If they're the same or similar, it suggests that the hole
>>>>> >> punching behavior isn't working properly. What distribution are you
>>>>> >> using? What filesystem is being used for /data/kudu/tserver/data?
>>>>> >>
>>>>> >> You should also check if maybe Kudu has failed to delete the data
>>>>> >> belonging to deleted tables. Has this tserver hosted any tablets
>>>>> >> belonging to tables that have since been deleted? Does the tserver
>>>>> log
>>>>> >> describe any errors when trying to delete the data belonging to
>>>>> those
>>>>> >> tablets?
>>>>> >>
>>>>> >> On Tue, Nov 22, 2016 at 7:19 AM, 阿香 <1654407...@qq.com> wrote:
>>>>> >>> Hi,
>>>>> >>>
>>>>> >>>
>>>>> >>> I have a table with 16 buckets over 3 physical machines. The
>>>>> tablet only
>>>>> >>> has
>>>>> >>> one replica.
>>>>> >>>
>>>>> >>>
>>>>> >>> Tablets Web UI shows that each tablet has around ~4.5G on-disk
>>>>> size.
>>>>> >>>
>>>>> >>> In one machine, there are total  8 tablets, so the on-disk size is
>>>>> about
>>>>> >>> 4.5*8 = 36G.
>>>>> >>>
>>>>> >>> however, in the same machine, the disk actually used is about 211G.
>>>>> >>>
>>>>> >>>
>>>>> >>> # du -sh /data/kudu/tserver/data/
>>>>> >>>
>>>>> >>> 210G /data/kudu/tserver/data/
>>>>> >>>
>>>>> >>>
>>>>> >>> # find /data/kudu/tserver/data/ -name "*.data" | wc -l
>>>>> >>>
>>>>> >>> 8133
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> What’s the difference between data file and on-disk size.
>>>>> >>>
>>>>> >>> Can files in  /data/kudu/tserver/data/ be compacted, purged, or
>>>>> some of
>>>>> >>> them
>>>>> >>> be deleted?
>>>>> >>>
>>>>> >>>
>>>>> >>> Thanks very much.
>>>>> >>>
>>>>> >>>
>>>>> >>> BR
>>>>> >>>
>>>>> >>> Brooks
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: About data file size and on-disk size

Reply via email to