http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/markdown/specification/cli-actions.md ---------------------------------------------------------------------- diff --git a/src/site/markdown/specification/cli-actions.md b/src/site/markdown/specification/cli-actions.md deleted file mode 100644 index 5060ab5..0000000 --- a/src/site/markdown/specification/cli-actions.md +++ /dev/null @@ -1,675 +0,0 @@ -<!--- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -# Apache Slider CLI Actions - - -## Important - -1. This document is still being updated from the original hoya design -2. The new cluster model of separated specification files for internal, resource and application configuration -has not been incorporated. -1. What is up to date is the CLI command list and arguments - -## client configuration - -As well as the CLI options, the `conf/slider-client.xml` XML file can define arguments used to communicate with the Application instance - - -#### `fs.defaultFS` - -Equivalent to setting the filesystem with `--filesystem` - - - -## Common - -### System Properties - -Arguments of the form `-S key=value` define JVM system properties. - -These are supported primarily to define options needed for some Kerberos configurations. - -### Definitions - -Arguments of the form `-D key=value` define JVM system properties. - -These can define client options that are not set in `conf/slider-client.xml` - or to override them. - -### Cluster names - -All actions that must take an instance name will fail with `EXIT_UNKNOWN_INSTANCE` -if one is not provided. - -## Action: Build - -Builds a cluster -creates all the on-filesystem datastructures, and generates a cluster description -that is both well-defined and deployable -*but does not actually start the cluster* - - build (instancename, - options:List[(String,String)], - components:List[(String, int)], - componentOptions:List[(String,String, String)], - resourceOptions:List[(String,String)], - resourceComponentOptions:List[(String,String, String)], - confdir: URI, - provider: String - zkhosts, - zkport, - image - apphome - appconfdir - - -#### Preconditions - -(Note that the ordering of these preconditions is not guaranteed to remain constant) - -The instance name is valid - - if not valid-instance-name(instancename) : raise SliderException(EXIT_COMMAND_ARGUMENT_ERROR) - -The instance must not be live. This is purely a safety check as the next test should have the same effect. - - if slider-instance-live(YARN, instancename) : raise SliderException(EXIT_CLUSTER_IN_USE) - -The instance must not exist - - if is-dir(HDFS, instance-path(FS, instancename)) : raise SliderException(EXIT_CLUSTER_EXISTS) - -The configuration directory must exist it does not have to be the instance's HDFS instance, -as it will be copied there -and must contain only files - - let FS = FileSystem.get(appconfdir) - if not isDir(FS, appconfdir) raise SliderException(EXIT_COMMAND_ARGUMENT_ERROR) - forall f in children(FS, appconfdir) : - if not isFile(f): raise IOException - -There's a race condition at build time where between the preconditions being met and the instance specification being saved, the instance -is created by another process. This addressed by creating a lock file, `writelock.json` in the destination directory. If the file -exists, no other process may acquire the lock. - -There is a less exclusive readlock file, `readlock.json` which may be created by any process that wishes to read the configuration. -If it exists when another process wishes to access the files, the subsequent process may read the data, but MUST NOT delete it -afterwards. A process attempting to acquire the writelock must check for the existence of this file before AND after creating the -writelock file, failing if its present. This retains a small race condition: a second or later reader may still be reading the data -when a process successfully acquires the write lock. If this proves to be an issue, a stricter model could be implemented, with each reading process creating a unique named readlock- file. - - - - -#### Postconditions - -All the instance directories exist - - is-dir(HDFS', instance-path(HDFS', instancename)) - is-dir(HDFS', original-conf-path(HDFS', instancename)) - is-dir(HDFS', generated-conf-path(HDFS', instancename)) - -The application cluster specification saved is well-defined and deployable - - let instance-description = parse(data(HDFS', instance-json-path(HDFS', instancename))) - well-defined-instance(instance-description) - deployable-application-instance(HDFS', instance-description) - -More precisely: the specification generated before it is saved as JSON is well-defined and deployable; no JSON file will be created -if the validation fails. - -Fields in the cluster description have been filled in - - internal.global["internal.provider.name"] == provider - app_conf.global["zookeeper.port"] == zkport - app_conf.global["zookeeper.hosts"] == zkhosts - - - package => app_conf.global["agent.package"] = package - - - -Any `apphome` and `image` properties have propagated - - apphome == null or clusterspec.options["cluster.application.home"] == apphome - image == null or clusterspec.options["cluster.application.image.path"] == image - -(The `well-defined-application-instance()` requirement above defines the valid states -of this pair of options) - - -All role sizes have been mapped to `component.instances` fields - - forall (name, size) in components : - resources.components[name]["components.instances"] == size - - - - -All option parameters have been added to the `options` map in the specification - - forall (opt, val) in options : - app_conf.global[opt] == val - - forall (opt, val) in resourceOptions : - resource.global[opt] == val - -All component option parameters have been added to the specific components's option map -in the relevant configuration file - - forall (name, opt, val) in componentOptions : - app_conf.components[name][opt] == val - - forall (name, opt, val) in resourceComponentOptions : - resourceComponentOptions.components[name][opt] == val - -To avoid some confusion as to where keys go, all options beginning with the -prefix `component.` are automatically copied into the resources file: - - forall (opt, val) in options where startswith(opt, "component.") - or startswith(opt, "role.") - or startswith(opt, "yarn."): - resource.global[opt] == val - - forall (name, opt, val) in componentOptions where startswith(opt, "component.") - or startswith(opt, "role.") - or startswith(opt, "yarn."): - resourceComponentOptions.components[name][opt] == val - - -There's no explicit rejection of duplicate options, the outcome of that -state is 'undefined'. - -What is defined is that if Slider or its provider provided a default option value, -the command-line supplied option will override it. - -All files that were in the configuration directory are now copied into the "original" configuration directory - - let FS = FileSystem.get(appconfdir) - let dest = original-conf-path(HDFS', instancename) - forall [c in children(FS, confdir) : - data(HDFS', dest + [filename(c)]) == data(FS, c) - -All files that were in the configuration directory now have equivalents in the generated configuration directory - - let FS = FileSystem.get(appconfdir) - let dest = generated-conf-path(HDFS', instancename) - forall [c in children(FS, confdir) : - isfile(HDFS', dest + [filename(c)]) - - -## Action: Thaw - - thaw <instancename> [--wait <timeout>] - -Thaw takes an application instance with configuration and (possibly) data on disk, and -attempts to create a live application with the specified number of nodes - -#### Preconditions - - if not valid-instance-name(instancename) : raise SliderException(EXIT_COMMAND_ARGUMENT_ERROR) - -The cluster must not be live. This is purely a safety check as the next test should have the same effect. - - if slider-instance-live(YARN, instancename) : raise SliderException(EXIT_CLUSTER_IN_USE) - -The cluster must not exist - - if is-dir(HDFS, application-instance-path(FS, instancename)) : raise SliderException(EXIT_CLUSTER_EXISTS) - -The cluster specification must exist, be valid and deployable - - if not is-file(HDFS, cluster-json-path(HDFS, instancename)) : SliderException(EXIT_UNKNOWN_INSTANCE) - if not well-defined-application-instance(HDFS, application-instance-path(HDFS, instancename)) : raise SliderException(EXIT_BAD_CLUSTER_STATE) - if not deployable-application-instance(HDFS, application-instance-path(HDFS, instancename)) : raise SliderException(EXIT_BAD_CLUSTER_STATE) - -### Postconditions - - -After the thaw has been performed, there is now a queued request in YARN -for the chosen (how?) queue - - YARN'.Queues'[amqueue] = YARN.Queues[amqueue] + [launch("slider", instancename, requirements, context)] - -If a wait timeout was specified, the cli waits until the application is considered -running by YARN (the AM is running), the wait timeout has been reached, or -the application has failed - - waittime < 0 or (exists a in slider-running-application-instances(yarn-application-instances(YARN', instancename, user)) - where a.YarnApplicationState == RUNNING) - - -## Outcome: AM-launched state - -Some time after the AM was queued, if the relevant -prerequisites of the launch request are met, the AM will be deployed - -#### Preconditions - -* The resources referenced in HDFS (still) are accessible by the user -* The requested YARN memory and core requirements could be met on the YARN cluster and -specific YARN application queue. -* There is sufficient capacity in the YARN cluster to create a container for the AM. - -#### Postconditions - -Define a YARN state at a specific time `t` as `YARN(t)`; the fact that -an AM is launched afterwards - -The AM is deployed if there is some time `t` after the submission time `t0` -where the application is listed - - exists t1 where t1 > t0 and slider-instance-live(YARN(t1), user, instancename) - -At which time there is a container in the cluster hosting the AM -it's -context is the launch context - - exists c in containers(YARN(t1)) where container.context = launch.context - -There's no way to determine when this time `t1` will be reached -or if it ever -will -its launch may be postponed due to a lack of resources and/or higher priority -requests using resources as they become available. - -For tests on a dedicated YARN cluster, a few tens of seconds appear to be enough -for the AM-launched state to be reached, a failure to occur, or to conclude -that the resource requirements are unsatisfiable. - -## Outcome: AM-started state - -A (usually short) time after the AM is launched, it should start - -* The node hosting the container is working reliably -* The supplied command line could start the process -* the localized resources in the context could be copied to the container (which implies -that they are readable by the user account the AM is running under) -* The combined classpath of YARN, extra JAR files included in the launch context, -and the resources in the slider client 'conf' dir contain all necessary dependencies -to run Slider. -* There's no issue with the cluster specification that causes the AM to exit -with an error code. - -Node failures/command line failures are treated by YARN as an AM failure which -will trigger a restart attempt -this may be on the same or a different node. - -#### preconditions - -The AM was launched at an earlier time, `t1` - - exists t1 where t1 > t0 and am-launched(YARN(t1) - - -#### Postconditions - -The application is actually started if it is listed in the YARN application list -as being in the state `RUNNING`, an RPC port has been registered with YARN (visible as the `rpcPort` -attribute in the YARN Application Report,and that port is servicing RPC requests -from authenticated callers. - - exists t2 where: - t2 > t1 - and slider-instance-live(YARN(t2), YARN, instancename, user) - and slider-live-instances(YARN(t2))[0].rpcPort != 0 - and rpc-connection(slider-live-instances(YARN(t2))[0], SliderClusterProtocol) - -A test for accepting cluster requests is querying the cluster status -with `SliderClusterProtocol.getJSONClusterStatus()`. If this returns -a parseable cluster description, the AM considers itself live. - -## Outcome: Applicaton Instance operational state - -Once started, Slider enters the operational state of trying to keep the numbers -of live role instances matching the numbers specified in the cluster specification. - -The AM must request the a container for each desired instance of a specific roles of the -application, wait for those requests to be granted, and then instantiate -the specific application roles on the allocated containers. - -Such a request is made on startup, whenever a failure occurs, or when the -cluster size is dynamically updated. - -The AM releases containers when the cluster size is shrunk during a flex operation, -or during teardown. - -### steady state condition - -The steady state of a Slider cluster is that the number of live instances of a role, -plus the number of requested instances , minus the number of instances for -which release requests have been made must match that of the desired number. - -If the internal state of the Slider AM is defined as `AppState` - - forall r in clusterspec.roles : - r["yarn.component.instances"] == - AppState.Roles[r].live + AppState.Roles[r].requested - AppState.Roles[r].released - -The `AppState` represents Slider's view of the external YARN system state, based on its -history of notifications received from YARN. - -It is indirectly observable from the cluster state which an AM can be queried for - - - forall r in AM.getJSONClusterStatus().roles : - r["yarn.component.instances"] == - r["role.actual.instances"] + r["role.requested.instances"] - r["role.releasing.instances"] - -Slider does not consider it an error if the number of actual instances remains below -the desired value (i.e. outstanding requests are not being satisfied) -this is -an operational state of the cluster that Slider cannot address. - -### Cluster startup - -On a healthy dedicated test cluster, the time for the requests to be satisfied is -a few tens of seconds at most: a failure to achieve this state is a sign of a problem. - -### Node or process failure - -After a container or node failure, a new container for a new instance of that role -is requested. - -The failure count is incremented -it can be accessed via the `"role.failed.instances"` -attribute of a role in the status report. - -The number of failures of a role is tracked, and used by Slider as to when to -conclude that the role is somehow failing consistently -and it should fail the -entire application. - -This has initially been implemented as a simple counter, with the cluster -option: `"slider.container.failure.threshold"` defining that threshold. - - let status = AM.getJSONClusterStatus() - forall r in in status.roles : - r["role.failed.instances"] < status.options["slider.container.failure.threshold"] - - -### Instance startup failure - - -Startup failures are measured alongside general node failures. - -A container is deemed to have failed to start if either of the following conditions -were met: - -1. The AM received an `onNodeManagerContainerStartFailed` event. - -1. The AM received an `onCompletedNode` event on a node that started less than -a specified number of seconds earlier -a number given in the cluster option -`"slider.container.failure.shortlife"`. - -More sophisticated failure handling logic than is currently implemented may treat -startup failures differently from ongoing failures -as they can usually be -treated as a sign that the container is failing to launch the program reliably - -either the generated command line is invalid, or the application is failing -to run/exiting on or nearly immediately. - -## Action: Create - -Create is simply `build` + `thaw` in sequence - the postconditions from the first -action are intended to match the preconditions of the second. - -## Action: Freeze - - freeze instancename [--wait time] [--message message] - -The *freeze* action "freezes" the cluster: all its nodes running in the YARN -cluster are stopped, leaving all the persistent state. - -The operation is intended to be idempotent: it is not an error if -freeze is invoked on an already frozen cluster - -#### Preconditions - -The cluster name is valid and it matches a known cluster - - if not valid-instance-name(instancename) : raise SliderException(EXIT_COMMAND_ARGUMENT_ERROR) - - if not is-file(HDFS, application-instance-path(HDFS, instancename)) : - raise SliderException(EXIT_UNKNOWN_INSTANCE) - -#### Postconditions - -If the cluster was running, an RPC call has been sent to it `stopCluster(message)` - -If the `--wait` argument specified a wait time, then the command will block -until the cluster has finished or the wait time was exceeded. - -If the `--message` argument specified a message -it must appear in the -YARN logs as the reason the cluster was frozen. - - -The outcome should be the same: - - not slider-instance-live(YARN', instancename) - -## Action: Flex - -Flex the cluster size: add or remove roles. - - flex instancename - components:List[(String, int)] - -1. The JSON cluster specification in the filesystem is updated -1. if the cluster is running, it is given the new cluster specification, -which will change the desired steady-state of the application - -#### Preconditions - - if not is-file(HDFS, cluster-json-path(HDFS, instancename)) : - raise SliderException(EXIT_UNKNOWN_INSTANCE) - -#### Postconditions - - let originalSpec = data(HDFS, cluster-json-path(HDFS, instancename)) - - let updatedSpec = originalspec where: - forall (name, size) in components : - updatedSpec.roles[name]["yarn.component.instances"] == size - data(HDFS', cluster-json-path(HDFS', instancename)) == updatedSpec - rpc-connection(slider-live-instances(YARN(t2))[0], SliderClusterProtocol) - let flexed = rpc-connection(slider-live-instances(YARN(t2))[0], SliderClusterProtocol).flexClusterupdatedSpec) - - -#### AM actions on flex - - boolean SliderAppMaster.flexCluster(ClusterDescription updatedSpec) - -If the cluster is in a state where flexing is possible (i.e. it is not in teardown), -then `AppState` is updated with the new desired role counts. The operation will -return once all requests to add or remove role instances have been queued, -and be `True` iff the desired steady state of the cluster has been changed. - -#### Preconditions - - well-defined-application-instance(HDFS, updatedSpec) - - -#### Postconditions - - forall role in AppState.Roles.keys: - AppState'.Roles'[role].desiredCount = updatedSpec[roles]["yarn.component.instances"] - result = AppState' != AppState - - -The flexing may change the desired steady state of the cluster, in which -case the relevant requests will have been queued by the completion of the -action. It is not possible to state whether or when the requests will be -satisfied. - -## Action: Destroy - -Idempotent operation to destroy a frozen cluster -it succeeds if the -cluster has already been destroyed/is unknown, but not if it is -actually running. - -#### Preconditions - - if not valid-instance-name(instancename) : raise SliderException(EXIT_COMMAND_ARGUMENT_ERROR) - - if slider-instance-live(YARN, instancename) : raise SliderException(EXIT_CLUSTER_IN_USE) - - -#### Postconditions - -The cluster directory and all its children do not exist - - not is-dir(HDFS', application-instance-path(HDFS', instancename)) - - -## Action: Status - - status instancename [--out outfile] - 2 -#### Preconditions - - if not slider-instance-live(YARN, instancename) : raise SliderException(EXIT_UNKNOWN_INSTANCE) - -#### Postconditions - -The status of the application has been successfully queried and printed out: - - let status = slider-live-instances(YARN).rpcPort.getJSONClusterStatus() - -if the `outfile` value is not defined then the status appears part of stdout - - status in STDOUT' - -otherwise, the outfile exists in the local filesystem - - (outfile != "") ==> data(LocalFS', outfile) == body - (outfile != "") ==> body in STDOUT' - -## Action: Exists - -This probes for a named cluster being defined or actually being in the running -state. - -In the running state; it is essentially the status -operation with only the exit code returned - -#### Preconditions - - - if not is-file(HDFS, application-instance-path(HDFS, instancename)) : - raise SliderException(EXIT_UNKNOWN_INSTANCE) - -#### Postconditions - -The operation succeeds if the cluster is running and the RPC call returns the cluster -status. - - if live and not slider-instance-live(YARN, instancename): - retcode = -1 - else: - retcode = 0 - -## Action: getConf - -This returns the live client configuration of the cluster -the -site-xml file. - - getconf --format (xml|properties) --out [outfile] - -*We may want to think hard about whether this is needed* - -#### Preconditions - - if not slider-instance-live(YARN, instancename) : raise SliderException(EXIT_UNKNOWN_INSTANCE) - - -#### Postconditions - -The operation succeeds if the cluster status can be retrieved and saved to -the named file/printed to stdout in the format chosen - - let status = slider-live-instances(YARN).rpcPort.getJSONClusterStatus() - let conf = status.clientProperties - if format == "xml" : - let body = status.clientProperties.asXmlDocument() - else: - let body = status.clientProperties.asProperties() - - if outfile != "" : - data(LocalFS', outfile) == body - else - body in STDOUT' - -## Action: list - - list [instancename] - -Lists all clusters of a user, or only the one given - -#### Preconditions - -If a instancename is specified it must be in YARNs list of active or completed applications -of that user: - - if instancename != "" and [] == yarn-application-instances(YARN, instancename, user) - raise SliderException(EXIT_UNKNOWN_INSTANCE) - - -#### Postconditions - -If no instancename was given, all slider applications of that user are listed, -else only the one running (or one of the finished ones) - - if instancename == "" : - forall a in yarn-application-instances(YARN, user) : - a.toString() in STDOUT' - else - let e = yarn-application-instances(YARN, instancename, user) - e.toString() in STDOUT' - -## Action: killcontainer - -This is an operation added for testing. It will kill a container in the cluster -*without flexing the cluster size*. As a result, the cluster will detect the -failure and attempt to recover from the failure by instantiating a new instance -of the cluster - - killcontainer cluster --id container-id - -#### Preconditions - - if not slider-instance-live(YARN, instancename) : raise SliderException(EXIT_UNKNOWN_INSTANCE) - - exists c in slider-app-containers(YARN, instancename, user) where c.id == container-id - - let status := AM.getJSONClusterStatus() - exists role = status.instances where container-id in status.instances[role].values - - -#### Postconditions - -The container is not in the list of containers in the cluster - - not exists c in containers(YARN) where c.id == container-id - -And implicitly, not in the running containers of that application - - not exists c in slider-app-containers(YARN', instancename, user) where c.id == container-id - -At some time `t1 > t`, the status of the application (`AM'`) will be updated to reflect -that YARN has notified the AM of the loss of the container - - - let status' = AM'.getJSONClusterStatus() - len(status'.instances[role]) < len(status.instances[role]) - status'.roles[role]["role.failed.instances"] == status'.roles[role]["role.failed.instances"]+1 - - -At some time `t2 > t1` in the future, the size of the containers of the application -in the YARN cluster `YARN''` will be as before - - let status'' = AM''.getJSONClusterStatus() - len(status''.instances[r] == len(status.instances[r])
http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/markdown/specification/index.md ---------------------------------------------------------------------- diff --git a/src/site/markdown/specification/index.md b/src/site/markdown/specification/index.md deleted file mode 100644 index f4c8d67..0000000 --- a/src/site/markdown/specification/index.md +++ /dev/null @@ -1,41 +0,0 @@ -<!--- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -# Specification of Apache Slider behaviour - -This is a a "more rigorous" definition of the behavior of Slider in terms -of its state and its command-line operations -by defining a 'formal' model -of HDFS, YARN and Slider's internal state, then describing the operations -that can take place in terms of their preconditions and postconditions. - -This is to show what tests we can create to verify that an action -with a valid set of preconditions results in an outcome whose postconditions -can be verified. It also makes more apparent what conditions should be -expected to result in failures, as well as what the failure codes should be. - -Specifying the behavior has also helped identify areas where there was ambiguity, -where clarification and more tests were needed. - -The specification depends on ongoing work in [HADOOP-9361](https://issues.apache.org/jira/browse/HADOOP-9361): -to define the Hadoop Filesytem APIs --This specification uses [the same notation](https://github.com/steveloughran/hadoop-trunk/blob/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/notation.md) - - -1. [Model: YARN And Slider](slider-model.html) -1. [CLI actions](cli-actions.html) - -Exceptions and operations may specify exit codes -these are listed in -[Client Exit Codes](../exitcodes.html) http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/markdown/specification/slider-model.md ---------------------------------------------------------------------- diff --git a/src/site/markdown/specification/slider-model.md b/src/site/markdown/specification/slider-model.md deleted file mode 100644 index 75f8c68..0000000 --- a/src/site/markdown/specification/slider-model.md +++ /dev/null @@ -1,286 +0,0 @@ -<!--- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -# Formal Apache Slider Model - -This is the model of Slider and YARN for the rest of the specification. - -## File System - -A File System `HDFS` represents a Hadoop FileSystem -either HDFS or another File -System which spans the cluster. There are also other filesystems that -can act as sources of data that is then copied into HDFS. These will be marked -as `FS` or with the generic `FileSystem` type. - - -There's ongoing work in [HADOOP-9361](https://issues.apache.org/jira/browse/HADOOP-9361) -to define the Hadoop Filesytem APIs using the same notation as here, -the latest version being available on [github](https://github.com/steveloughran/hadoop-trunk/tree/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem) -Two key references are - - 1. [The notation reused in the Slider specifications](https://github.com/steveloughran/hadoop-trunk/blob/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/notation.md) - 1. [The model of the filesystem](https://github.com/steveloughran/hadoop-trunk/blob/stevel/HADOOP-9361-filesystem-contract/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/model.md) - - The model and its predicates and invariants will be used in these specifications. - -## YARN - -From the perspective of YARN application, The YARN runtime is a state, `YARN`, -comprised of: ` (Apps, Queues, Nodes)` - - Apps: Map[AppId, ApplicationReport] - -An application has a name, an application report and a list of outstanding requests - - App: (Name, report: ApplicationReport, Requests:List[AmRequest]) - -An application report contains a mixture of static and dynamic state of the application -and the AM. - - ApplicationReport: AppId, Type, User, YarnApplicationState, AmContainer, RpcPort, TrackingURL, - -YARN applications have a number of states. These are ordered such that if the -`state.ordinal() > RUNNING.ordinal() ` then the application has entered an exit state. - - YarnApplicationState : [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED ] - -AMs can request containers to be added or released - - AmRequest = { add-container(priority, requirements), release(containerId)} - -Job queues are named queues of job requests; there is always a queue called `"default"` - - Queues: Map[String:Queue] - Queue: List[Requests] - Request = { - launch(app-name, app-type, requirements, context) - } - Context: (localized-resources: Map[String,URL], command) - - -This is doesn't completely model the cluster from the AM perspective -there's no -notion of node operations (launching code in a container) or events coming from YARN. - -The `Nodes` structure models the nodes in a cluster - - Nodes: Map[nodeID,(name, containers:List[Container])] - -A container contains some state - - Container: (containerId, appId, context) - -The containers in a cluster are the aggregate set of all containers across -all nodes - - def containers(YARN) = - [c for n in keys(YARN.Nodes) for c in YARN.Nodes[n].Containers ] - - -The containers of an application are all containers that are considered owned by it, - - def app-containers(YARN, appId: AppId) = - [c in containers(YARN) where c.appId == appId ] - -### Operations & predicates used the specifications - - - def applications(YARN, type) = - [ app.report for app in YARN.Apps.values where app.report.Type == type] - - def user-applications(YARN, type, user) - [a in applications(YARN, type) where: a.User == user] - - -## UserGroupInformation - -Applications are launched and executed on hosts computers: either client machines -or nodes in the cluster, these have their own state which may need modeling - - HostState: Map[String, String] - -A key part of the host state is actually the identity of the current user, -which is used to define the location of the persistent state of the cluster -including -its data, and the identity under which a deployed container executes. - -In a secure cluster, this identity is accompanied by kerberos tokens that grant the caller -access to the filesystem and to parts of YARN itself. - -This specification does not currently explicitly model the username and credentials. -If it did they would be used throughout the specification to bind to a YARN or HDFS instance. - -`UserGroupInformation.getCurrentUser(): UserGroupInformation` - -Returns the current user information. This information is immutable and fixed for the duration of the process. - - - -## Slider Model - -### Cluster name - -A valid cluster name is a name of length > 1 which follows the internet hostname scheme of letter followed by letter or digit - - def valid-cluster-name(c) = - len(c)> 0 - and c[0] in ['a'..'z'] - and c[1] in (['a'..'z'] + ['-'] + ['0..9']) - -### Persistent Cluster State - -A Slider cluster's persistent state is stored in a path - - def cluster-path(FS, clustername) = user-home(FS) + ["clusters", clustername] - def cluster-json-path(FS, clustername) = cluster-path(FS, clustername) + ["cluster.json"] - def original-conf-path(FS, clustername) = cluster-path(FS, clustername) + ["original"] - def generated-conf-path(FS, clustername) = cluster-path(FS, clustername) + ["generated"] - def data-path(FS, clustername) = cluster-path(FS, clustername) + ["data"] - -When a cluster is built/created the specified original configuration directory -is copied to `original-conf-path(FS, clustername)`; this is patched for the -specific instance bindings and saved into `generated-conf-path(FS, clustername)`. - -A cluster *exists* if all of these paths are found: - - def cluster-exists(FS, clustername) = - is-dir(FS, cluster-path(FS, clustername)) - and is-file(FS, cluster-json-path(FS, clustername)) - and is-dir(FS, original-conf-path(FS, clustername)) - and generated-conf-path(FS, original-conf-path(FS, clustername)) - -A cluster is considered `running` if there is a Slider application type belonging to the current user in one of the states -`{NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING}`. - - def final-yarn-states = {FINISHED, FAILED, KILLED } - - def slider-app-instances(YARN, clustername, user) = - [a in user-applications(YARN, "slider", user) where: - and a.Name == clustername] - - def slider-app-running-instances(YARN, clustername, user) = - [a in slider-app-instances(YARN, user, clustername) where: - not a.YarnApplicationState in final-yarn-state] - - def slider-app-running(YARN, clustername, user) = - [] != slider-app-running-instances(YARN, clustername, user) - - def slider-app-live-instances(YARN, clustername, user) = - [a in slider-app-instances(YARN, user, clustername) where: - a.YarnApplicationState == RUNNING] - - def slider-app-live(YARN, clustername, user) = - [] != slider-app-live-instances(YARN, clustername, user) - -### Invariant: there must never be more than one running instance of a named Slider cluster - - -There must never be more than one instance of the same Slider cluster running: - - forall a in user-applications(YARN, "slider", user): - len(slider-app-running-instances(YARN, a.Name, user)) <= 1 - -There may be multiple instances in a finished state, and one running instance alongside multiple finished instances -the applications -that work with Slider MUST select a running cluster ahead of any terminated clusters. - -### Containers of an application - - -The containers of a slider application are the set of containers of that application - - def slider-app-containers(YARN, clustername, user) = - app-containers(YARN, appid where - appid = slider-app-running-instances(YARN, clustername, user)[0]) - - - - -### RPC Access to a slider cluster - - - An application is accepting RPC requests for a given protocol if there is a port binding - defined and it is possible to authenticate a connection using the specified protocol - - def rpc-connection(appReport, protocol) = - appReport.host != null - appReport.rpcPort != 0 - and RPC.getProtocolProxy(appReport.host, appReport.rpcPort, protocol) - - Being able to open an RPC port is the strongest definition of liveness possible - to make: if the AM responds to RPC operations, it is doing useful work. - -### Valid Cluster Description - -The `cluster.json` file of a cluster configures Slider to deploy the application. - -#### well-defined-cluster(cluster-description) - -A Cluster Description is well-defined if it is valid JSON and required properties are present - -**OBSOLETE** - - -Irrespective of specific details for deploying the Slider AM or any provider-specific role instances, -a Cluster Description defined in a `cluster.json` file at the path `cluster-json-path(FS, clustername)` -is well-defined if - -1. It is parseable by the jackson JSON parser. -1. Root elements required of a Slider cluster specification must be defined, and, where appropriate, non-empty -1. It contains the extensible elements required of a Slider cluster specification. For example, `options` and `roles` -1. The types of the extensible elements match those expected by Slider. -1. The `version` element matches a supported version -1. Exactly one of `options/cluster.application.home` and `options/cluster.application.image.path` must exist. -1. Any cluster options that are required to be integers must be integers - -This specification is very vague here to avoid duplication: the cluster description structure is currently implicitly defined in -`org.apache.slider.api.ClusterDescription` - -Currently Slider ignores unknown elements during parsing. This may be changed. - -The test for this state does not refer to the cluster filesystem - -#### deployable-cluster(FS, cluster-description) - -A Cluster Description defines a deployable cluster if it is well-defined cluster and the contents contain valid information to deploy a cluster - -This defines how a cluster description is valid in the extends the valid configuration with - -* The entry `name` must match a supported provider -* Any elements that name the cluster match the cluster name as defined by the path to the cluster: - - originConfigurationPath == original-conf-path(FS, clustername) - generatedConfigurationPath == generated-conf-path(FS, clustername) - dataPath == data-path(FS, clustername) - -* The paths defined in `originConfigurationPath` , `generatedConfigurationPath` and `dataPath` must all exist. -* `options/zookeeper.path` must be defined and refer to a path in the ZK cluster -defined by (`options/zookeeper.hosts`, `zookeeper.port)` to which the user has write access (required by HBase and Accumulo) -* If `options/cluster.application.image.path` is defined, it must exist and be readable by the user. -* It must declare a type that maps to a provider entry in the Slider client's XML configuration: - - len(clusterspec["type"]) > 0 - clientconfig["slider.provider."+ clusterspec["type"]] != null - -* That entry must map to a class on the classpath which can be instantiated -and cast to `SliderProviderFactory`. - - let classname = clientconfig["slider.provider."+ clusterspec["type"]] - (Class.forName(classname).newInstance()) instanceof SliderProviderFactory - -#### valid-for-provider(cluster-description, provider) - -A provider considers a specification valid if its own validation logic is satisfied. This normally -consists of rules about the number of instances of different roles; it may include other logic. - http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/markdown/troubleshooting.md ---------------------------------------------------------------------- diff --git a/src/site/markdown/troubleshooting.md b/src/site/markdown/troubleshooting.md deleted file mode 100644 index 42bef8e..0000000 --- a/src/site/markdown/troubleshooting.md +++ /dev/null @@ -1,154 +0,0 @@ -<!--- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> - -# Apache Slider Troubleshooting - -Slider can be tricky to start using, because it combines the need to set -up a YARN application, with the need to have an HBase configuration -that works - - -### Common problems - -## Classpath for Slider AM wrong - -The Slider Application Master, the "Slider AM" builds up its classpath from -those JARs it has locally, and the JARS pre-installed on the classpath - -This often surfaces in an exception that can be summarized as -"hadoop-common.jar is not on the classpath": - - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ExitUtil$ExitException - Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ExitUtil$ExitException - at java.net.URLClassLoader$1.run(URLClassLoader.java:202) - at java.security.AccessController.doPrivileged(Native Method) - at java.net.URLClassLoader.findClass(URLClassLoader.java:190) - at java.lang.ClassLoader.loadClass(ClassLoader.java:306) - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) - at java.lang.ClassLoader.loadClass(ClassLoader.java:247) - Could not find the main class: org.apache.hadoop.yarn.service.launcher.ServiceLauncher. Program will exit. - - -For ambari-managed deployments, we recommend the following - - - <property> - <name>yarn.application.classpath</name> - <value> - /etc/hadoop/conf,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*,/usr/lib/hadoop-hdfs/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/* - </value> - </property> - -The `yarn-site.xml` file for the site will contain the relevant value. - -### Application Instantiation fails, "TriggerClusterTeardownException: Unstable Cluster" - -Slider gives up if it cannot keep enough instances of a role running -or more -precisely, if they keep failing. - -If this happens on cluster startup, it means that the application is not working - - org.apache.slider.core.exceptions.TriggerClusterTeardownException: Unstable Cluster: - - failed with role worker failing 4 times (4 in startup); threshold is 2 - - last failure: Failure container_1386872971874_0001_01_000006 on host 192.168.1.86, - see http://hor12n22.gq1.ygridcore.net:19888/jobhistory/logs/192.168.1.86:45454/container_1386872971874_0001_01_000006/ctx/yarn - -This message warns that a role -here worker- is failing to start and it has failed -more than the configured failure threshold is. What it doesn't do is say why it failed, -because that is not something the AM knows -that is a fact hidden in the logs on -the container that failed. - -The final bit of the exception message can help you track down the problem, -as it points you to the logs. - -In the example above the failure was in `container_1386872971874_0001_01_000006` -on the host `192.168.1.86`. If you go to then node manager on that machine (the YARN -RM web page will let you do this), and look for that container, -you may be able to grab the logs from it. - -A quicker way is to browse to the URL on the next line. -Note: the URL depends on yarn.log.server.url being properly configured. - -It is from those logs that the cause of the problem -because they are the actual -output of the actual application which Slider is trying to deploy. - - - -### Not all the containers start -but whenever you kill one, another one comes up. - -This is often caused by YARN not having enough capacity in the cluster to start -up the requested set of containers. The AM has submitted a list of container -requests to YARN, but only when an existing container is released or killed -is one of the outstanding requests granted. - -Fix #1: Ask for smaller containers - -edit the `yarn.memory` option for roles to be smaller: set it 64 for a smaller -YARN allocation. *This does not affect the actual heap size of the -application component deployed* - -Fix #2: Tell YARN to be less strict about memory consumption - -Here are the properties in `yarn-site.xml` which we set to allow YARN -to schedule more role instances than it nominally has room for. - - <property> - <name>yarn.scheduler.minimum-allocation-mb</name> - <value>1</value> - </property> - <property> - <description>Whether physical memory limits will be enforced for - containers. - </description> - <name>yarn.nodemanager.pmem-check-enabled</name> - <value>false</value> - </property> - <!-- we really don't want checking here--> - <property> - <name>yarn.nodemanager.vmem-check-enabled</name> - <value>false</value> - </property> - -If you create too many instances, your hosts will start swapping and -performance will collapse -we do not recommend using this in production. - - -### Configuring YARN for better debugging - - -One configuration to aid debugging is tell the nodemanagers to -keep data for a short period after containers finish - - <!-- 10 minutes after a failure to see what is left in the directory--> - <property> - <name>yarn.nodemanager.delete.debug-delay-sec</name> - <value>600</value> - </property> - -You can then retrieve logs by either the web UI, or by connecting to the -server (usually by `ssh`) and retrieve the logs from the log directory - - -We also recommend making sure that YARN kills processes - - <!--time before the process gets a -9 --> - <property> - <name>yarn.nodemanager.sleep-delay-before-sigkill.ms</name> - <value>30000</value> - </property> - - http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/hoya_am_architecture.png ---------------------------------------------------------------------- diff --git a/src/site/resources/hoya_am_architecture.png b/src/site/resources/hoya_am_architecture.png deleted file mode 100644 index 191a8db..0000000 Binary files a/src/site/resources/hoya_am_architecture.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/app_config_folders_01.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/app_config_folders_01.png b/src/site/resources/images/app_config_folders_01.png deleted file mode 100644 index 4e78b63..0000000 Binary files a/src/site/resources/images/app_config_folders_01.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/app_package_sample_04.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/app_package_sample_04.png b/src/site/resources/images/app_package_sample_04.png deleted file mode 100644 index 170256b..0000000 Binary files a/src/site/resources/images/app_package_sample_04.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/image_0.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/image_0.png b/src/site/resources/images/image_0.png deleted file mode 100644 index e62a3e7..0000000 Binary files a/src/site/resources/images/image_0.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/image_1.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/image_1.png b/src/site/resources/images/image_1.png deleted file mode 100644 index d0888ac..0000000 Binary files a/src/site/resources/images/image_1.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/managed_client.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/managed_client.png b/src/site/resources/images/managed_client.png deleted file mode 100644 index 9c094b1..0000000 Binary files a/src/site/resources/images/managed_client.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/slider-container.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/slider-container.png b/src/site/resources/images/slider-container.png deleted file mode 100644 index 2e02833..0000000 Binary files a/src/site/resources/images/slider-container.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/resources/images/unmanaged_client.png ---------------------------------------------------------------------- diff --git a/src/site/resources/images/unmanaged_client.png b/src/site/resources/images/unmanaged_client.png deleted file mode 100644 index 739d56d..0000000 Binary files a/src/site/resources/images/unmanaged_client.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/incubator-slider/blob/209cee43/src/site/site.xml ---------------------------------------------------------------------- diff --git a/src/site/site.xml b/src/site/site.xml deleted file mode 100644 index 12dc5cf..0000000 --- a/src/site/site.xml +++ /dev/null @@ -1,84 +0,0 @@ -<?xml version="1.0"?> -<!-- - Licensed to the Apache Software Foundation (ASF) under one or more - contributor license agreements. See the NOTICE file distributed with - this work for additional information regarding copyright ownership. - The ASF licenses this file to You under the Apache License, Version 2.0 - (the "License"); you may not use this file except in compliance with - the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. ---> -<project name="Apache Slider ${project.version} (incubating)"> -<!-- - - <skin> - <groupId>org.apache.maven.skins</groupId> - <artifactId>maven-stylus-skin</artifactId> - <version>1.2</version> - </skin> - - <skin> - <groupId>org.apache.maven.skins</groupId> - <artifactId>maven-application-skin</artifactId> - <version>1.0</version> - </skin> - ---> - - <skin> - <groupId>org.apache.maven.skins</groupId> - <artifactId>maven-fluido-skin</artifactId> - <version>1.3.0</version> - </skin> - - <custom> - <fluidoSkin> - <topBarEnabled>true</topBarEnabled> - <sideBarEnabled>false</sideBarEnabled> - </fluidoSkin> - </custom> - - <version position="right"/> - - <bannerLeft> - <name>Apache Slider (incubating)</name> - <href>http://slider.incubator.apache.org</href> - </bannerLeft> - - <bannerRight> - <src>http://incubator.apache.org/images/apache-incubator-logo.png</src> - </bannerRight> - - <body> - - <menu ref="reports"/> - - <menu name="Documents"> - <item name="Getting Started" href="/getting_started.html"/> - <item name="manpage" href="/manpage.html"/> - <item name="Troubleshooting" href="/troubleshooting.html"/> - <item name="Architecture" href="/architecture/index.html"/> - <item name="Developing" href="/developing/index.html"/> - <item name="Exitcodes" href="/exitcodes.html"/> - </menu> - - <menu name="ASF"> - <item name="How Apache Works" href="http://www.apache.org/foundation/how-it-works.html"/> - <item name="Developer Documentation" href="http://www.apache.org/dev/"/> - <item name="Foundation" href="http://www.apache.org/foundation/"/> - <item name="Sponsor Apache" href="http://www.apache.org/foundation/sponsorship.html"/> - <item name="Thanks" href="http://www.apache.org/foundation/thanks.html"/> - </menu> - - <footer> - <div class="row-fluid">Apache Slider, Slider, Apache, and the Apache Incubator logo are trademarks of The Apache Software Foundation.</div> - </footer> - </body> -</project>