[jira] [Commented] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
[ https://issues.apache.org/jira/browse/KUDU-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405844#comment-16405844 ] Todd Lipcon commented on KUDU-2359: --- cc [~anjuwong] for thoughts > tserver should allow starting with a small number of missing data dirs > -- > > Key: KUDU-2359 > URL: https://issues.apache.org/jira/browse/KUDU-2359 > Project: Kudu > Issue Type: Improvement > Components: fs, tserver >Reporter: Todd Lipcon >Priority: Major > > Often when a disk fails, its mount point will not come back up when the > server is restarted. Currently, Kudu will respond to this by failing to > restart with an error like: > F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() > Bad status: Already present: FS layout already exists; not overwriting > existing layout. See > https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: > unable to create file system roots: FSManager roots already exist: > /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal > However, this defeats some of the advantages of the "allow single disk > failure" work. One could use the update_data_dirs tool to remove the missing > disk, but you'd also need to persistently change the configuration of the > daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2359) tserver should allow starting with a small number of missing data dirs
Todd Lipcon created KUDU-2359: - Summary: tserver should allow starting with a small number of missing data dirs Key: KUDU-2359 URL: https://issues.apache.org/jira/browse/KUDU-2359 Project: Kudu Issue Type: Improvement Components: fs, tserver Reporter: Todd Lipcon Often when a disk fails, its mount point will not come back up when the server is restarted. Currently, Kudu will respond to this by failing to restart with an error like: F0314 18:23:39.353916 112051 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Already present: FS layout already exists; not overwriting existing layout. See https://kudu.apache.org/releases/1.8.0-SNAPSHOT/docs/troubleshooting.html: unable to create file system roots: FSManager roots already exist: /data/1/kudu,/data/2/kudu,/data/3/kudu,/data/5/kudu,/data/6/kudu,/data/7/kudu,/data/8/kudu,/data/1/kudu-wal However, this defeats some of the advantages of the "allow single disk failure" work. One could use the update_data_dirs tool to remove the missing disk, but you'd also need to persistently change the configuration of the daemon, which is hard to do with a consistent configuration management. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2358) ~600MB of untracked string memory
Todd Lipcon created KUDU-2358: - Summary: ~600MB of untracked string memory Key: KUDU-2358 URL: https://issues.apache.org/jira/browse/KUDU-2358 Project: Kudu Issue Type: Improvement Components: tserver Affects Versions: 1.8.0 Reporter: Todd Lipcon Looking at a heap dump of a tserver which has 1.6G of tracked memory, there is about 800MB of apparently-untracked memory held in std::string objects. Since libstdcxx doesn't have frame pointers, tcmalloc profiling isn't telling the stack trace responsible here, but merits further investigation (probably using a custom-built libstdcxx with frame pointers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KUDU-2351) Error message for recv failure should include IP/port, etc
[ https://issues.apache.org/jira/browse/KUDU-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-2351: -- Labels: newbie (was: ) > Error message for recv failure should include IP/port, etc > -- > > Key: KUDU-2351 > URL: https://issues.apache.org/jira/browse/KUDU-2351 > Project: Kudu > Issue Type: Improvement > Components: client, rpc, supportability >Reporter: Todd Lipcon >Priority: Major > Labels: newbie > > I was running an Impala query and killed a server. The resulting error was > just: WARNINGS: Unable to advance iterator: Network error: Recv() got EOF > from remote (error 108) > We should make the error coming back at the scanner level more informative > for when a non-fault-tolerant scan fails due to a single tablet failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KUDU-2357) Allow altering the replication factor of a table or even a single tablet
Mike Percy created KUDU-2357: Summary: Allow altering the replication factor of a table or even a single tablet Key: KUDU-2357 URL: https://issues.apache.org/jira/browse/KUDU-2357 Project: Kudu Issue Type: New Feature Components: consensus Reporter: Mike Percy It would be useful in certain cases to be able to alter the replication factor of an existing table or partition. This could be useful in a backup / restore context, or when converting a testing cluster to be a production cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (KUDU-2250) Document odd interaction between upserts and Spark Datasets
[ https://issues.apache.org/jira/browse/KUDU-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengling Wang reassigned KUDU-2250: --- Assignee: Fengling Wang > Document odd interaction between upserts and Spark Datasets > --- > > Key: KUDU-2250 > URL: https://issues.apache.org/jira/browse/KUDU-2250 > Project: Kudu > Issue Type: Task > Components: spark >Affects Versions: 1.6.0 >Reporter: Jean-Daniel Cryans >Assignee: Fengling Wang >Priority: Major > Labels: newbie > > We need to document a specific behavior of Spark Datasets that runs contrary > to how Kudu works. > Say you have 3 columns "k, x, y" where k is the primary key. > You run a first insert on a row "k=1, x=2, y=3". > Now you upsert "k=1, y=4". > Using any Kudu API, the full row would now be "k=1, x=2, y=4" but with > Datasets you have "k=1, x=*NULL*, y=4". This means that Datasets put a null > value when some columns aren't specified. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2152) Tablet stuck under-replicated after some kind of tablet copy issue
[ https://issues.apache.org/jira/browse/KUDU-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405252#comment-16405252 ] Mike Percy commented on KUDU-2152: -- I think this may have a related cause to KUDU-2293, where certain faults are no longer fatal due to disk failure work and so our error handling isn't as robust as it should be in the tablet copy client cleanup code. > Tablet stuck under-replicated after some kind of tablet copy issue > -- > > Key: KUDU-2152 > URL: https://issues.apache.org/jira/browse/KUDU-2152 > Project: Kudu > Issue Type: Bug > Components: consensus >Affects Versions: 1.5.0 >Reporter: Todd Lipcon >Assignee: Andrew Wong >Priority: Critical > Attachments: raft_consensus_stress-itest.txt.gz > > > I was stress testing with the following setup: > - 8 servers (n1-standard-4 GCE boxes) > - created a bunch of 100-tablet tablets using loadgen until I had ~2500 > replicas on each server > - mounted another server using sshfs and put cmeta on that mount point (to > make slower cmeta writes) > - stress -c4 on all machines > - shut down a server and wait for re-replication (green ksck), restart the > server, rinse repeat > Eventually I got a stuck tablet. ksck reports: > {code} > Tablet 271df8901d98442cb478593babd8a609 of table > 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 > replica(s) not RUNNING > 20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING > [LEADER] > c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad > state > State: STOPPED > Data state: TABLET_DATA_COPYING > Last status: Deleted tablet blocks from disk > cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING > 1 replicas' active configs differ from the master's. > All the peers reported by the master and tablet servers are: > A = 20d4d86f182043398594b67492d13fdc > D = 471027436ee8405ab7cdf8d22407696b > B = c2ea8f22f4034bcc97e26c9236811960 > > C = cd0997b908ad41839f56a1b61210f2d4 > The consensus matrix is: > Config source | Voters | Current term | Config index | Committed? > ---+--+--+--+ > master| A* B C| | | Yes > A | A* B C| 11 | 29 | Yes > B | D B C| 9| 23 | Yes > C | A* B C| 11 | 29 | Yes > {code} > The leader ("A" above) just keeps reporting that it's failing to send > requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts > it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming > that it actually means BOOTSTRAPPING). > "B"'s last bit in the logs were: > {code} > I0920 16:41:48.556422 3808 tablet_copy_client.cc:209] T > 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet > copy: Beginning tablet copy session from remote peer at address > kudu513-8.gce.cloudera.com:7050 > I0920 16:41:48.562335 3808 ts_tablet_manager.cc:1118] T > 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting > tablet data with delete state TABLET_DATA_COPYING > W0920 16:41:48.578610 3808 env_util.cc:277] Failed to determine if path is a > directory: > /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: > Not found: > /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: > No such file or directory (error 2) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)