[jira] [Created] (KUDU-1579) into "safe mode" when large number of node crash
zhangsong created KUDU-1579: --- Summary: into "safe mode" when large number of node crash Key: KUDU-1579 URL: https://issues.apache.org/jira/browse/KUDU-1579 Project: Kudu Issue Type: New Feature Reporter: zhangsong Currently, replication will happen when met node crash . However when met large number of node crash , it will lead to replicate storm which will cause mess and data loss. replication should be prudent and the cluster should be into a "safe mode" in aboved node crash case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KUDU-1578) kudu-tserver should refuse service or "freeze" instead of crash
zhangsong created KUDU-1578: --- Summary: kudu-tserver should refuse service or "freeze" instead of crash Key: KUDU-1578 URL: https://issues.apache.org/jira/browse/KUDU-1578 Project: Kudu Issue Type: Bug Reporter: zhangsong Currently, kudu-tserver will crash when ntp is unsynchronized. However this behavior maybe not the right in large cluster ,when crash can lead to replicate which can be useless or harm to cluster availability. Instead, kudu-tserver should suspend it self like refusing to serve write , let the administrator decide what to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KUDU-1534) expose software version in ListMaster RPC response
[ https://issues.apache.org/jira/browse/KUDU-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438290#comment-15438290 ] Dinesh Bhat edited comment on KUDU-1534 at 8/26/16 2:24 AM: Cool, thanks. also clarifying: when you say 'trunk', you mean top-of-the-tree right ? We currently have '0.10.0-SNAPSHOT' in trunk so when you said 0.10.0, you probably meant our RC release bits ? I guess there shouldn't be any difference between 0.9.1 or 0.10.0 as far as this test is concerned(addiotionally 0.9.1 has tserver software version missing too). was (Author: dineshabbi): Cool, thanks. also clarifying: when you say 'trunk', you mean top-of-the-tree right ? We currently have '0.10.0-SNAPSHOT' in version_defines.h so when you said 0.10.0, you probably meant our RC release bits ? > expose software version in ListMaster RPC response > -- > > Key: KUDU-1534 > URL: https://issues.apache.org/jira/browse/KUDU-1534 > Project: Kudu > Issue Type: Improvement >Reporter: Dan Burkert >Assignee: Dinesh Bhat >Priority: Minor > Labels: newbie > Attachments: cluster-downgrade.log, cluster-upgrade.log > > > KUDU-1490 exposed the software version of tablet servers in the > GetTabletServers RPC response, but an equivalent doesn't exist for > ListMasters response. This will become more important as multi-master setups > get more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (KUDU-1534) expose software version in ListMaster RPC response
[ https://issues.apache.org/jira/browse/KUDU-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438290#comment-15438290 ] Dinesh Bhat edited comment on KUDU-1534 at 8/26/16 1:51 AM: Cool, thanks. also clarifying: when you say 'trunk', you mean top-of-the-tree right ? We currently have '0.10.0-SNAPSHOT' in version_defines.h so when you said 0.10.0, you probably meant our RC release bits ? was (Author: dineshabbi): Also clarifying: when you say 'trunk', you mean top-of-the-tree right ? We currently have '0.10.0-SNAPSHOT' in version_defines.h so when you said 0.10.0, you probably meant our RC release bits ? > expose software version in ListMaster RPC response > -- > > Key: KUDU-1534 > URL: https://issues.apache.org/jira/browse/KUDU-1534 > Project: Kudu > Issue Type: Improvement >Reporter: Dan Burkert >Assignee: Dinesh Bhat >Priority: Minor > Labels: newbie > Attachments: cluster-downgrade.log, cluster-upgrade.log > > > KUDU-1490 exposed the software version of tablet servers in the > GetTabletServers RPC response, but an equivalent doesn't exist for > ListMasters response. This will become more important as multi-master setups > get more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1534) expose software version in ListMaster RPC response
[ https://issues.apache.org/jira/browse/KUDU-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438244#comment-15438244 ] Adar Dembo commented on KUDU-1534: -- Thanks for doing all of that testing. Unfortunately, I don't think it covers the scenarios at work here. Let's enumerate the changes: * TSRegistrationPB -> ServerRegistrationPB in TSHeartbeatRequestPB. To test this, you need to run a master/tserver combination of mixed versions. For example, a cluster where the master is built from trunk and the tserver is running on 0.10.0 bits. * TSRegistrationPB -> ServerRegistrationPB in ListTabletServersResponsePB. To test this, you need to run a master/client combination of mixed versions. As far as I can tell, your testing didn't mix versions. Moreover, the client you did use (ts-cli) doesn't issue a ListTabletServers request. How about this test plan: # Start a master and tserver using bits from 0.10.0. Create a non-replicated table. Write some data into it. Stop the master but keep the tserver running. # Restart the master using bits from trunk. # Verify that the master is receiving the tserver's heartbeats. You can use the master's web UI for this, or you can look at the tserver/master logs for heartbeating errors. # Run kudu-ksck from 0.10.0. One of the things it'll do is issue a ListTabletServers call to the master. > expose software version in ListMaster RPC response > -- > > Key: KUDU-1534 > URL: https://issues.apache.org/jira/browse/KUDU-1534 > Project: Kudu > Issue Type: Improvement >Reporter: Dan Burkert >Assignee: Dinesh Bhat >Priority: Minor > Labels: newbie > Attachments: cluster-downgrade.log, cluster-upgrade.log > > > KUDU-1490 exposed the software version of tablet servers in the > GetTabletServers RPC response, but an equivalent doesn't exist for > ListMasters response. This will become more important as multi-master setups > get more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (KUDU-1577) Spark insert-ignore is significantly slower that upsert
Dan Burkert created KUDU-1577: - Summary: Spark insert-ignore is significantly slower that upsert Key: KUDU-1577 URL: https://issues.apache.org/jira/browse/KUDU-1577 Project: Kudu Issue Type: Bug Reporter: Dan Burkert We have seen cases where running and insert-ignore spark ingestion job is significantly (10x) slower than the equivalent job using upsert. Not sure at this point if it's due to the increased error messages being returned by the server, or if it's the client not handling them efficiently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KUDU-1534) expose software version in ListMaster RPC response
[ https://issues.apache.org/jira/browse/KUDU-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438194#comment-15438194 ] Dinesh Bhat commented on KUDU-1534: --- Code Reviews are posted at : https://gerrit.cloudera.org/#/c/4099/ I also performed downgrade/upgrade of clusters as per [~adar]'s suggestion in the review, here are the steps followed: Start the cluster with 0.10.0 release: - Picked a random test which creates a table and distributes the replica evenly, test elects the master as leader - Leave the FS layout files intact even after test is done(using flag leave_test_files=always) - Finish the test Downgrade cluster 0.9.1 release: - Start the cluster using same FS layout used earlier - Started the master first, and then tservers, and also reversed the orders in bringing up. - Logs are attached as cluster-downgrade.log - Kill the master/tservers instances Upgrade back to 0.10.0 release: - Start the cluster using same FS layout used earlier - Started master first, and then tservers, and tried permutation/combo of bringups in differenrt order. - Logs are in cluster-upgrade.log - Kill master/tservers > expose software version in ListMaster RPC response > -- > > Key: KUDU-1534 > URL: https://issues.apache.org/jira/browse/KUDU-1534 > Project: Kudu > Issue Type: Improvement >Reporter: Dan Burkert >Assignee: Dinesh Bhat >Priority: Minor > Labels: newbie > > KUDU-1490 exposed the software version of tablet servers in the > GetTabletServers RPC response, but an equivalent doesn't exist for > ListMasters response. This will become more important as multi-master setups > get more common. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (KUDU-668) Log block container metadata files should be more forgiving to truncation
[ https://issues.apache.org/jira/browse/KUDU-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated KUDU-668: - Priority: Critical (was: Major) > Log block container metadata files should be more forgiving to truncation > - > > Key: KUDU-668 > URL: https://issues.apache.org/jira/browse/KUDU-668 > Project: Kudu > Issue Type: Sub-task > Components: fs >Affects Versions: M5 >Reporter: Adar Dembo >Assignee: Adar Dembo >Priority: Critical > > Log block container metadata files are resilient to many different kinds of > failures (see pb_util.h for details). However, they are also overly strict > with respect to truncation. Ideally, a truncation in the middle of a log > block record should result in the record being discarded and the container > reused for additional writes. The only way to do this safely is to prove > that, between the truncation and the end of the file, there do not exist any > other valid log block records. The WAL segment reader code has the same > problem, and it handles this by trying to decode a segment header at every > byte position between the point of truncation and the end of the file. Log > block container metadata files should do the same thing. > Here's what needs to happen: > # Containerized PB files should add a CRC32 checksum to the message header > structure. Otherwise we can't tell if a particular read in the file comprises > a "valid" message header. > # In the event of truncation, they should do what the WAL segment reader does > and scan ahead in the file looking for valid message headers. If one is > found, this is not truncation but corruption, and is unrecoverable. If none > are found (or if the remainder of the file is all zeroes), it's recoverable > truncation. > # If the truncation is recoverable, we should make sure to start writing new > metadata at the point of truncation, not at the end of the file. > Once this is done, containerized PB files will be almost identical to WAL > segments, and we could consider merging the two. As far as I can tell, the > only remaining major difference is that WAL segments allow one to write > different kinds of PB messages, while containerized PB files are restricted > to one type of PB message per file. > For the time being, log block container metadata files don't use memory > mapped writing or preallocation, so that the likelihood of extra zeroes in > the file is low. Still, if we believe that the underlying filesystem or disk > could truncate the file unexpectedly, we will consider such truncation fatal > instead of recovering gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)