[jira] [Resolved] (KUDU-1508) Log block manager triggers ext4 hole punching bug in el6

2017-03-22 Thread Adar Dembo (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adar Dembo resolved KUDU-1508.
--
Resolution: Fixed

Spoke offline with JD and Todd. We've decided to make the following the  
project's official position:
# Because of the block limits in place, it's pretty rare to encounter this 
corruption in the wild. In our test cluster, only 21 containers out of ~600,000 
were corrupted.
# We believe the corruption is harmless and can be repaired as part of a 
regular fsck.
# It's specific to ext4, so switching to xfs will solve the problem.
# Red Hat has fixed the underlying issue in el6.9 and has also backported the 
fix to the kernel in el6.8.

So I'm going to leave this FIXED for the time being, but please reopen if 
you're experiencing this and the above recommendations do not work for you!

> Log block manager triggers ext4 hole punching bug in el6
> 
>
> Key: KUDU-1508
> URL: https://issues.apache.org/jira/browse/KUDU-1508
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 0.9.0
>Reporter: Todd Lipcon
>Assignee: Adar Dembo
>Priority: Blocker
> Fix For: 1.2.0
>
> Attachments: debugfs.txt, e9f83e4acef3405f99d01914317351ce.metadata, 
> filefrag.txt, pbc_dump.txt, replay_container.py
>
>
> I've experienced many times that when I reboot an el6 node that was running 
> Kudu tservers, fsck reports issues like:
> data6 contains a file system with errors, check forced.
> data6: Interior extent node level 0 of inode 5259348:
> Logical start 154699 does not match logical start 2623046 at next level.  
> After some investigation, I've determined that this is due to an ext4 kernel 
> bug: https://patchwork.ozlabs.org/patch/206123/
> Details in a comment to follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1508) Log block manager triggers ext4 hole punching bug in el6

2017-03-22 Thread Adar Dembo (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937297#comment-15937297
 ] 

Adar Dembo commented on KUDU-1508:
--

Some additional stats.
* Each node has had ~167m blocks created and ~166m blocks deleted, with ~600k 
live blocks.
* Each node is consuming ~1.6 TB of disk space.
* Of the ~120,000 containers on each node, ~117,000 have no live blocks. I 
calculated this using the following shell snippet:
{noformat}
for md in $(find /data/*/kudu/data/ -name '*.metadata'); do kudu pbc dump 
--oneline $md | awk '/CREATE/ { creates++ } /DELETE/ { deletes++ } END { if 
(creates != deletes) exit 1 }' && echo $md; done | wc -l
{noformat}


> Log block manager triggers ext4 hole punching bug in el6
> 
>
> Key: KUDU-1508
> URL: https://issues.apache.org/jira/browse/KUDU-1508
> Project: Kudu
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 0.9.0
>Reporter: Todd Lipcon
>Assignee: Adar Dembo
>Priority: Blocker
> Fix For: 1.2.0
>
> Attachments: debugfs.txt, e9f83e4acef3405f99d01914317351ce.metadata, 
> filefrag.txt, pbc_dump.txt, replay_container.py
>
>
> I've experienced many times that when I reboot an el6 node that was running 
> Kudu tservers, fsck reports issues like:
> data6 contains a file system with errors, check forced.
> data6: Interior extent node level 0 of inode 5259348:
> Logical start 154699 does not match logical start 2623046 at next level.  
> After some investigation, I've determined that this is due to an ext4 kernel 
> bug: https://patchwork.ozlabs.org/patch/206123/
> Details in a comment to follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1830) Reduce Kudu WAL log disk usage

2017-03-22 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1830:
--
Target Version/s: 1.4.0

> Reduce Kudu WAL log disk usage
> --
>
> Key: KUDU-1830
> URL: https://issues.apache.org/jira/browse/KUDU-1830
> Project: Kudu
>  Issue Type: Improvement
>  Components: consensus, log
>Reporter: Juan Yu
>  Labels: data-scalability
>
> WAL log can take significent disk space. So far there are some config to 
> limit it. but it can go very high.
> WAL log size = #tablets * log_segment_size_mb  * log segments (1 if there is 
> write ops to this tablet, can go up to log_max_segments_to_retain)
> Logs are retained even if there is no write for a while.
> We could reduce the WAL log usage by
> - reduce min_segments_to_retain to 1 instead of 2, a
> - reduce steady state consumption of idle tablets, roll a WAL if it has had 
> no writes for a few minutes and size more than a MB or two so that "idle" 
> tablets have 0 WAL space consumed



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-680) block cache limit seems to not be fully respected

2017-03-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937258#comment-15937258
 ] 

Todd Lipcon commented on KUDU-680:
--

wrote some code to sample cache block references and keep stack traces to those 
that are removed from the LRU but not freed, and found it was a bug in an 
in-flight patch I'm working on.

I think the remaining cases are mostly the bloom reader caches, which I'm also 
working on removing (in the same patch that introduced the new leak, no less). 
Will confirm with a longer run.

> block cache limit seems to not be fully respected
> -
>
> Key: KUDU-680
> URL: https://issues.apache.org/jira/browse/KUDU-680
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> The ITBLL cluster configures block cache capacity to 512MB, but the 
> memtracker is reporting 685MB of usage. It's clearly not un-bounded, because 
> this server has been up for days doing lots of work, but we're either not 
> properly counting the memory, or not properly respecting the configured 
> limit. Maybe we're only counting the values and not the keys, or something of 
> that nature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1860) ksck doesn't identify tablets that are evicted but still in config

2017-03-22 Thread Mike Percy (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937037#comment-15937037
 ] 

Mike Percy commented on KUDU-1860:
--

Even worse than this but related, we saw a case in the field where if there is 
no leader, it's possible that ksck does not report that (it will report the 
tablet as healthy) because the leader is cached in the master's data.

> ksck doesn't identify tablets that are evicted but still in config
> --
>
> Key: KUDU-1860
> URL: https://issues.apache.org/jira/browse/KUDU-1860
> Project: Kudu
>  Issue Type: Bug
>  Components: ksck, ops-tooling
>Affects Versions: 1.2.0
>Reporter: Jean-Daniel Cryans
>
> As reported by a user on Slack, ksck can give you a wrong output such as:
> {noformat}
>   ca199fafca544df2a1b2a01be9d5266d (server1:7250): RUNNING [LEADER]
>   a077957f627c4758ab5a989aca8a1ca8 (server2:7250): RUNNING
>   5c09a555c205482b8131f15b2c249ec6 (server3:7250): bad state
> State:   NOT_STARTED
> Data state:  TABLET_DATA_TOMBSTONED
> Last status: Tablet initializing...
> {noformat}
> The problem is that server2 was already evicted out of the configuration 
> (based on reading the logs) but it wasn't committed in the config (which 
> contains server 1 and 3) since there's really only 1 server left out of 3.
> Ideally ksck should try to see what each server thinks the configuration is 
> and see if there's a difference from what's in the master. As it is, it looks 
> like we're missing 1 replica but in reality this is a broken tablet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-1954) Improve maintenance manager behavior in heavy write workload

2017-03-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936844#comment-15936844
 ] 

Todd Lipcon commented on KUDU-1954:
---

Started writing various notes here:
https://docs.google.com/document/d/17-2CcmrjxZY0Gd9wDUh83xCCNL574famw8_2Bhfu-_8/edit?usp=sharing

> Improve maintenance manager behavior in heavy write workload
> 
>
> Key: KUDU-1954
> URL: https://issues.apache.org/jira/browse/KUDU-1954
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
> Attachments: mm-trace.png
>
>
> During the investigation in [this 
> doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit]
>  I found a few maintenance-manager-related issues during heavy writes:
> - we don't schedule flushes until we are already in "backpressure" realm, so 
> we spent most of our time doing backpressure
> - even if we configure N maintenance threads, we typically are only using 
> ~50% of those threads due to the scheduling granularity
> - when we do hit the "memory-pressure flush" threshold, all threads quickly 
> switch to flushing, which then brings us far beneath the threshold
> - long running compactions can temporarily starve flushes
> - high volume of writes can starve compactions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-861) Add client APIs for more alter-schema operations

2017-03-22 Thread Matthew Jacobs (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936801#comment-15936801
 ] 

Matthew Jacobs commented on KUDU-861:
-

[~wdberkeley] is this still being considered?

> Add client APIs for more alter-schema operations
> 
>
> Key: KUDU-861
> URL: https://issues.apache.org/jira/browse/KUDU-861
> Project: Kudu
>  Issue Type: Improvement
>  Components: api, client
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
>Assignee: Will Berkeley
>Priority: Critical
>
> We don't seem to have any APIs to do some important operations:
> - SET DEFAULT / DROP DEFAULT
> - change storage properties (encoding/compression)
> We should add these in both clients and add some tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KUDU-1954) Improve maintenance manager behavior in heavy write workload

2017-03-22 Thread Todd Lipcon (JIRA)

 [ 
https://issues.apache.org/jira/browse/KUDU-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated KUDU-1954:
--
Attachment: mm-trace.png

> Improve maintenance manager behavior in heavy write workload
> 
>
> Key: KUDU-1954
> URL: https://issues.apache.org/jira/browse/KUDU-1954
> Project: Kudu
>  Issue Type: Improvement
>  Components: perf, tserver
>Affects Versions: 1.3.0
>Reporter: Todd Lipcon
> Attachments: mm-trace.png
>
>
> During the investigation in [this 
> doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit]
>  I found a few maintenance-manager-related issues during heavy writes:
> - we don't schedule flushes until we are already in "backpressure" realm, so 
> we spent most of our time doing backpressure
> - even if we configure N maintenance threads, we typically are only using 
> ~50% of those threads due to the scheduling granularity
> - when we do hit the "memory-pressure flush" threshold, all threads quickly 
> switch to flushing, which then brings us far beneath the threshold
> - long running compactions can temporarily starve flushes
> - high volume of writes can starve compactions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KUDU-1954) Improve maintenance manager behavior in heavy write workload

2017-03-22 Thread Todd Lipcon (JIRA)
Todd Lipcon created KUDU-1954:
-

 Summary: Improve maintenance manager behavior in heavy write 
workload
 Key: KUDU-1954
 URL: https://issues.apache.org/jira/browse/KUDU-1954
 Project: Kudu
  Issue Type: Improvement
  Components: perf, tserver
Affects Versions: 1.3.0
Reporter: Todd Lipcon


During the investigation in [this 
doc|https://docs.google.com/document/d/1U1IXS1XD2erZyq8_qG81A1gZaCeHcq2i0unea_eEf5c/edit]
 I found a few maintenance-manager-related issues during heavy writes:
- we don't schedule flushes until we are already in "backpressure" realm, so we 
spent most of our time doing backpressure
- even if we configure N maintenance threads, we typically are only using ~50% 
of those threads due to the scheduling granularity
- when we do hit the "memory-pressure flush" threshold, all threads quickly 
switch to flushing, which then brings us far beneath the threshold
- long running compactions can temporarily starve flushes
- high volume of writes can starve compactions



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-680) block cache limit seems to not be fully respected

2017-03-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936621#comment-15936621
 ] 

Todd Lipcon commented on KUDU-680:
--

The block cache 'usage_' members seem to respect the size, but the memtracker 
doesn't. Verified using gdb:

{code}
set $c = 'Singleton::instance_'
set $c2 = $c->cache_.impl_.data_.ptr
set $sum = 0
set $i = 0
while ($i < 64)
  set $u = $c2->shards_._M_impl._M_start[$i]->usage_
  set $i = $i + 1
  set $sum = $sum + $u
end
p $sum
{code}

sum = 536539241 (512MB). But the memtracker on this server is reporting 3GB 
(and seems to be accurate as to actual space usage).

So the question is what's grabbing block cache references and then not 
dereffing them.

> block cache limit seems to not be fully respected
> -
>
> Key: KUDU-680
> URL: https://issues.apache.org/jira/browse/KUDU-680
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> The ITBLL cluster configures block cache capacity to 512MB, but the 
> memtracker is reporting 685MB of usage. It's clearly not un-bounded, because 
> this server has been up for days doing lots of work, but we're either not 
> properly counting the memory, or not properly respecting the configured 
> limit. Maybe we're only counting the values and not the keys, or something of 
> that nature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KUDU-680) block cache limit seems to not be fully respected

2017-03-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936573#comment-15936573
 ] 

Todd Lipcon commented on KUDU-680:
--

I'm now able to reproduce this by starting a Kudu server and then inserting 
7-8B rows into it via Impala. Interestingly, starting a server with the same 
settings and inserting twice as many rows using tpch_real_world doesn't seem to 
have the same magnitude effect.

> block cache limit seems to not be fully respected
> -
>
> Key: KUDU-680
> URL: https://issues.apache.org/jira/browse/KUDU-680
> Project: Kudu
>  Issue Type: Bug
>  Components: tablet
>Affects Versions: Private Beta
>Reporter: Todd Lipcon
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> The ITBLL cluster configures block cache capacity to 512MB, but the 
> memtracker is reporting 685MB of usage. It's clearly not un-bounded, because 
> this server has been up for days doing lots of work, but we're either not 
> properly counting the memory, or not properly respecting the configured 
> limit. Maybe we're only counting the values and not the keys, or something of 
> that nature.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)