Re: [openstack-dev] [TripleO] Fixing Swift rings when upscaling/replacing nodes in TripleO deployments
On 05.01.2017 17:03, Steven Hardy wrote: > On Thu, Jan 05, 2017 at 02:56:15PM +, arkady.kanev...@dell.com wrote: >> I have concern to rely on undercloud for overcloud swift. >> Undercloud is not HA (yet) so it may not be operational when disk failed or >> swift overcloud node is added/deleted. > > I think the proposal is only for a deploy-time dependency, after the > overcloud is deployed there should be no dependency on the undercloud > swift, because the ring data will have been copied to all the nodes. Yes, exactly - there is no runtime dependency. The overcloud will continue to work even if the undercloud is gone. If you "loose" the undercloud (or more precisely, the overcloud rings that are stored on the undercloud Swift) you can copy them from any overcloud node and run an update. Even if one deletes the rings from the undercloud, the deployment will continue to work after an update - puppet-swift will simply continue to use the already existing .builder files on the nodes. Only if one deletes the rings on the undercloud and runs an update with new/replaced nodes it will fail - the swift-recon check will raise an error in step 5 because rings are inconsistent on the new/replaced nodes. But the inconsistency is already the case today (in fact it's the same way as it works today), except that there is no check and no warning to the operator. -- Christian > During create/update operations you need the undercloud operational by > definition, so I think this is probably OK? > > Steve >> >> -Original Message- >> From: Christian Schwede [mailto:cschw...@redhat.com] >> Sent: Thursday, January 05, 2017 6:14 AM >> To: OpenStack Development Mailing List <openstack-dev@lists.openstack.org> >> Subject: [openstack-dev] [TripleO] Fixing Swift rings when >> upscaling/replacing nodes in TripleO deployments >> >> Hello everyone, >> >> there was an earlier discussion on $subject last year [1] regarding a bug >> when upscaling or replacing nodes in TripleO [2]. >> >> Shortly summarized: Swift rings are built on each node separately, and if >> adding or replacing nodes (or disks) this will break the rings because they >> are no longer consistent across the nodes. What's needed are the previous >> ring builder files on each node before changing the rings. >> >> My former idea in [1] was to build the rings in advance on the undercloud, >> and also using introspection data to gather a set of disks on each node for >> the rings. >> >> However, this changes the current way of deploying significantly, and also >> requires more work in TripleO and Mistral (for example to trigger a ring >> build on the undercloud after the nodes have been started, but before the >> deployment triggers the Puppet run). >> >> I prefer smaller steps to keep everything stable for now, and therefore I >> changed my patches quite a bit. This is my updated proposal: >> >> 1. Two temporary undercloud Swift URLs (one PUT, one GET) will be computed >> before Mistral starts the deployments. A new Mistral action to create such >> URLs is required for this [3]. >> 2. Each overcloud node will try to fetch rings from the undercloud Swift >> deployment before updating it's set of rings locally using the temporary GET >> url. This guarantees that each node uses the same source set of builder >> files. This happens in step 2. [4] 3. puppet-swift runs like today, updating >> the rings if required. >> 4. Finally, at the end of the deployment (in step 5) the nodes will upload >> their modified rings to the undercloud using the temporary PUT urls. >> swift-recon will run before this, ensuring that all rings across all nodes >> are consistent. >> >> The two required patches [3][4] are not overly complex IMO, but they solve >> the problem of adding or replacing nodes without changing the current >> workflow significantly. It should be even easy to backport them if needed. >> >> I'll continue working on an improved way of deploying Swift rings (using >> introspection data), but using this approach it could be even done using >> todays workflow, feeding data into puppet-swift (probably with some updates >> to puppet-swift/tripleo-heat-templates to allow support for regions, zones, >> different disk layouts and the like). However, all of this could be built on >> top of these two patches. >> >> I'm curious about your thoughts and welcome any feedback or reviews! >> >> Thanks, >> >> -- Christian >> >> >> [1] >> http://lists.openstack.org/pipermail/openstack-dev/2016-August/100720.html >
[openstack-dev] [TripleO] Fixing Swift rings when upscaling/replacing nodes in TripleO deployments
Hello everyone, there was an earlier discussion on $subject last year [1] regarding a bug when upscaling or replacing nodes in TripleO [2]. Shortly summarized: Swift rings are built on each node separately, and if adding or replacing nodes (or disks) this will break the rings because they are no longer consistent across the nodes. What's needed are the previous ring builder files on each node before changing the rings. My former idea in [1] was to build the rings in advance on the undercloud, and also using introspection data to gather a set of disks on each node for the rings. However, this changes the current way of deploying significantly, and also requires more work in TripleO and Mistral (for example to trigger a ring build on the undercloud after the nodes have been started, but before the deployment triggers the Puppet run). I prefer smaller steps to keep everything stable for now, and therefore I changed my patches quite a bit. This is my updated proposal: 1. Two temporary undercloud Swift URLs (one PUT, one GET) will be computed before Mistral starts the deployments. A new Mistral action to create such URLs is required for this [3]. 2. Each overcloud node will try to fetch rings from the undercloud Swift deployment before updating it's set of rings locally using the temporary GET url. This guarantees that each node uses the same source set of builder files. This happens in step 2. [4] 3. puppet-swift runs like today, updating the rings if required. 4. Finally, at the end of the deployment (in step 5) the nodes will upload their modified rings to the undercloud using the temporary PUT urls. swift-recon will run before this, ensuring that all rings across all nodes are consistent. The two required patches [3][4] are not overly complex IMO, but they solve the problem of adding or replacing nodes without changing the current workflow significantly. It should be even easy to backport them if needed. I'll continue working on an improved way of deploying Swift rings (using introspection data), but using this approach it could be even done using todays workflow, feeding data into puppet-swift (probably with some updates to puppet-swift/tripleo-heat-templates to allow support for regions, zones, different disk layouts and the like). However, all of this could be built on top of these two patches. I'm curious about your thoughts and welcome any feedback or reviews! Thanks, -- Christian [1] http://lists.openstack.org/pipermail/openstack-dev/2016-August/100720.html [2] https://bugs.launchpad.net/tripleo/+bug/1609421 [3] https://review.openstack.org/#/c/413229/ [4] https://review.openstack.org/#/c/414460/ __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Heat][TripleO] How to run mistral workflows via templates
> we're trying to address in TripleO a couple of use cases for which we'd > like to trigger a Mistral workflow from a Heat template. > > One example where this would be useful is the creation of the Swift > rings, which need some data related to the Heat stack (like the list of > Swift nodes and their disks), so it can't be executed in advance, yet it > provides data which is needed to complete successfully the deployment of > the overcloud. > > Currently we can create a workflow from Heat, but we can't trigger its > execution and also we can't block Heat on the result of the execution. > > I was wondering if it would make sense to have a property for the > existing Workflow resource to let the user decide if the workflow should > *also* be triggered on CREATE/UPDATE? And if it would make sense to > block the Workflow resource until the execution result is returned in > that case? I think it needs to be triggered a bit later actually? For the Swift use case it needs to be executed after all instances are created (but preferably before starting any Puppet actions on the nodes), not when the CREATE/UPDATE itself actually starts. > Alternatively, would an ex-novo Execution resource make more sense? > > Or are there different ideas, approaches to the problem? As a workaround for now I'm using the signal URL and trigger it in a shell script on the nodes (the shell script is running anyways to fetch and validate the rings). To avoid multiple parallel workflow executions triggered by a dozen nodes I set a flag in the Mistral environment; further actions will immediately return then. I'd prefer a different and cleaner approach like you proposed but for me that's working well for the moment. -- Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [tripleo] Requesting FFE for improved Swift deployments
Hello, kindly asking for a FFE for a required setting to improve Swift-based TripleO deployments: https://review.openstack.org/#/c/358643/ This is required to land the last patch in a series of TripleO-doc patches: https://review.openstack.org/#/c/293311/ https://review.openstack.org/#/c/360353/ https://review.openstack.org/#/c/361032/ Current idea is to automate the described manual actions for Ocata. There was some discussion on the ML as well: http://lists.openstack.org/pipermail/openstack-dev/2016-August/102053.html If one is interested in testing this with tripleo-quickstart, here is a patch to automatically add extra blockdevices to the overcloud VMs: https://review.openstack.org/#/c/359630/ Thanks a lot! -- Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 04.08.16 15:39, Giulio Fidente wrote: > On 08/04/2016 01:26 PM, Christian Schwede wrote: >> On 04.08.16 10:27, Giulio Fidente wrote: >>> On 08/02/2016 09:36 PM, Christian Schwede wrote: >>>> Hello everyone, >>> >>> thanks Christian, >>> >>>> I'd like to improve the Swift deployments done by TripleO. There are a >>>> few problems today when deployed with the current defaults: >>>> >>>> 1. Adding new nodes (or replacing existing nodes) is not possible, >>>> because the rings are built locally on each host and a new node doesn't >>>> know about the "history" of the rings. Therefore rings might become >>>> different on the nodes, and that results in an unusable state >>>> eventually. >>> >>> one of the ideas for this was to use a tempurl in the undercloud swift >>> where to upload the rings built by a single overcloud node, not by the >>> undercloud >>> >>> so I proposed a new heat resource which would permit us to create a >>> swift tempurl in the undercloud during the deployment >>> >>> https://review.openstack.org/#/c/350707/ >>> >>> if we build the rings on the undercloud we can ignore this and use a >>> mistral action instead, as pointed by Steven >>> >>> the good thing about building rings in the overcloud is that it doesn't >>> force us to have a static node mapping for each and every deployment but >>> it makes hard to cope with heterogeneous environments >> >> That's true. However - we still need to collect the device data from all >> the nodes from the undercloud, push it to at least one overcloud mode, >> build/update the rings there, push it to the undercloud Swift and use >> that on all overcloud nodes. Or not? > > sure, let's build on the undercloud, when automated with mistral it > shouldn't make a big difference for the user > >> I was also thinking more about the static node mapping and how to avoid >> this. Could we add a host alias using the node UUIDs? That would never >> change, it's available from the introspection data and therefore could >> be used in the rings. >> >> http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid >> > > right, this is the mechanism I wanted to use to proviude per-node disk > maps, it's how it works for ceph disks as well I looked into this further and proposed a patch upstream: https://review.openstack.org/358643 This worked fine in my tests, an example /etc/hosts looks like this: http://paste.openstack.org/show/562206/ And based on that patch we could build the Swift rings even if the nodes are down and never been deployed, because the system uuid will never change and is unique. I updated my tripleo-swift-ring-tool and just run a successful deployment with the patch (also using the merged patches from Giulio). Let me know what you think about it - I think with that patch we could integrate the tripleo-swift-ring-tool. -- Christian >>>> 2. The rings are only using a single device, and it seems that this is >>>> just a directory and not a mountpoint with a real device. Therefore >>>> data >>>> is stored on the root device - even if you have 100TB disk space in the >>>> background. If not fixed manually your root device will run out of >>>> space >>>> eventually. >>> >>> for the disks instead I am thinking to add a create_resources wrapper in >>> puppet-swift: >>> >>> https://review.openstack.org/#/c/350790 >>> https://review.openstack.org/#/c/350840/ >>> >>> so that we can pass via hieradata per-node swift::storage::disks maps >>> >>> we have a mechanism to push per-node hieradata based on the system uuid, >>> we could extend the tool to capture the nodes (system) uuid and generate >>> per-node maps >> >> Awesome, thanks Giulio! >> >> I will test that today. So the tool could generate the mapping >> automatically, and we don't need to filter puppet facts on the nodes >> itself. Nice! > > and we could re-use the same tool to generate the ceph::osds disk maps > as well :) > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
On 04.08.16 10:27, Giulio Fidente wrote: > On 08/02/2016 09:36 PM, Christian Schwede wrote: >> Hello everyone, > > thanks Christian, > >> I'd like to improve the Swift deployments done by TripleO. There are a >> few problems today when deployed with the current defaults: >> >> 1. Adding new nodes (or replacing existing nodes) is not possible, >> because the rings are built locally on each host and a new node doesn't >> know about the "history" of the rings. Therefore rings might become >> different on the nodes, and that results in an unusable state eventually. > > one of the ideas for this was to use a tempurl in the undercloud swift > where to upload the rings built by a single overcloud node, not by the > undercloud > > so I proposed a new heat resource which would permit us to create a > swift tempurl in the undercloud during the deployment > > https://review.openstack.org/#/c/350707/ > > if we build the rings on the undercloud we can ignore this and use a > mistral action instead, as pointed by Steven > > the good thing about building rings in the overcloud is that it doesn't > force us to have a static node mapping for each and every deployment but > it makes hard to cope with heterogeneous environments That's true. However - we still need to collect the device data from all the nodes from the undercloud, push it to at least one overcloud mode, build/update the rings there, push it to the undercloud Swift and use that on all overcloud nodes. Or not? That leaves some room for new inconsistencies IMO: how do we ensure that the overcloud node uses the last ring to start with? Also, ring building has to be limited to one overcloud node, otherwise we might end up with multiple ringbuilding nodes? How can an operator manually modify the rings? The tool to build the rings on the undercloud could be further improved later, for example I'd like to be able to move data to new nodes slowly over time, and also query existing storage servers about the progress. Therefore we need some more functionality than currently available in the ringbuilding part in puppet-swift IMO. I think if we move this step to the undercloud we could solve a lot of these challenges in a consistent way. WDYT? I was also thinking more about the static node mapping and how to avoid this. Could we add a host alias using the node UUIDs? That would never change, it's available from the introspection data and therefore could be used in the rings. http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid >> 2. The rings are only using a single device, and it seems that this is >> just a directory and not a mountpoint with a real device. Therefore data >> is stored on the root device - even if you have 100TB disk space in the >> background. If not fixed manually your root device will run out of space >> eventually. > > for the disks instead I am thinking to add a create_resources wrapper in > puppet-swift: > > https://review.openstack.org/#/c/350790 > https://review.openstack.org/#/c/350840/ > > so that we can pass via hieradata per-node swift::storage::disks maps > > we have a mechanism to push per-node hieradata based on the system uuid, > we could extend the tool to capture the nodes (system) uuid and generate > per-node maps Awesome, thanks Giulio! I will test that today. So the tool could generate the mapping automatically, and we don't need to filter puppet facts on the nodes itself. Nice! > then, with the above puppet changes and having the per-node map and the > rings download url, we could feed them to the templates, replace with an > environment the rings building implementation and deploy without further > customizations > > what do you think? Yes, that sounds like a good plan to me. I'll continue working on the ringbuilder tool for now and see how I integrate this into the Mistral actions. -- Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO
Thanks Steven for your feedback! Please see my answers inline. On 02.08.16 23:46, Steven Hardy wrote: > On Tue, Aug 02, 2016 at 09:36:45PM +0200, Christian Schwede wrote: >> Hello everyone, >> >> I'd like to improve the Swift deployments done by TripleO. There are a >> few problems today when deployed with the current defaults: > > Thanks for digging into this, I'm aware this has been something of a > known-issue for some time, so it's great to see it getting addressed :) > > Some comments inline; > >> 1. Adding new nodes (or replacing existing nodes) is not possible, >> because the rings are built locally on each host and a new node doesn't >> know about the "history" of the rings. Therefore rings might become >> different on the nodes, and that results in an unusable state eventually. >> >> 2. The rings are only using a single device, and it seems that this is >> just a directory and not a mountpoint with a real device. Therefore data >> is stored on the root device - even if you have 100TB disk space in the >> background. If not fixed manually your root device will run out of space >> eventually. >> >> 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk >> is much more troublesome. Normally you would simply unmount a disk, and >> then replace the disk sometime later. But because mount_check is set to >> False in the storage servers data will be written to the root device in >> the meantime; and when you finally mount the disk again, you can't >> simply cleanup. >> >> 4. In general, it's not possible to change cluster layout (using >> different zones/regions/partition power/device weight, slowly adding new >> devices to avoid 25% of the data will be moved immediately when adding >> new nodes to a small cluster, ...). You could manually manage your >> rings, but they will be overwritten finally when updating your overcloud. >> >> 5. Missing erasure coding support (or storage policies in general) >> >> This sounds bad, however most of the current issues can be fixed using >> customized templates and some tooling to create the rings in advance on >> the undercloud node. >> >> The information about all the devices can be collected from the >> introspection data, and by using node placement the nodenames in the >> rings are known in advance if the nodes are not yet powered on. This >> ensures a consistent ring state, and an operator can modify the rings if >> needed and to customize the cluster layout. >> >> Using some customized templates we can already do the following: >> - disable rinbguilding on the nodes >> - create filesystems on the extra blockdevices >> - copy ringfiles from the undercloud, using pre-built rings >> - enable mount_check by default >> - (define storage policies if needed) >> >> I started working on a POC using tripleo-quickstart, some custom >> templates and a small Python tool to build rings based on the >> introspection data: >> >> https://github.com/cschwede/tripleo-swift-ring-tool >> >> I'd like to get some feedback on the tool and templates. >> >> - Does this make sense to you? > > Yes, I think the basic workflow described should work, and it's good to see > that you're passing the ring data via swift as this is consistent with how > we already pass some data to nodes via our DeployArtifacts interface: > > https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/deploy-artifacts.yaml > > Note however that there are no credentials to access the undercloud swift > on the nodes, so you'll need to pass a tempurl reference in (which is what > we do for deploy artifacts, obviously you will have credentials to create > the container & tempurl on the undercloud). Ah, that's very useful! I updated my POC; makes one less customized template and less code to support in the Python tool. Works as expected! > One slight concern I have is mandating the use of predictable placement - > it'd be nice to think about ways we might avoid that but the undercloud > centric approach seems OK for a first pass (in either case I think the > delivery via swift will be the same). Do you mean the predictable artifact filename? We could just add a randomized prefix to the filename IMO. >> - How (and where) could we integrate this upstream? > > So I think the DeployArtefacts interface may work for this, and we have a > helper script that can upload data to swift: > > https://github.com/openstack/tripleo-common/blob/master/scripts/upload-swift-artifacts > > This basically pushes a tarball to swift, creates a tempurl, then creates
[openstack-dev] [TripleO] Improving Swift deployments with TripleO
Hello everyone, I'd like to improve the Swift deployments done by TripleO. There are a few problems today when deployed with the current defaults: 1. Adding new nodes (or replacing existing nodes) is not possible, because the rings are built locally on each host and a new node doesn't know about the "history" of the rings. Therefore rings might become different on the nodes, and that results in an unusable state eventually. 2. The rings are only using a single device, and it seems that this is just a directory and not a mountpoint with a real device. Therefore data is stored on the root device - even if you have 100TB disk space in the background. If not fixed manually your root device will run out of space eventually. 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk is much more troublesome. Normally you would simply unmount a disk, and then replace the disk sometime later. But because mount_check is set to False in the storage servers data will be written to the root device in the meantime; and when you finally mount the disk again, you can't simply cleanup. 4. In general, it's not possible to change cluster layout (using different zones/regions/partition power/device weight, slowly adding new devices to avoid 25% of the data will be moved immediately when adding new nodes to a small cluster, ...). You could manually manage your rings, but they will be overwritten finally when updating your overcloud. 5. Missing erasure coding support (or storage policies in general) This sounds bad, however most of the current issues can be fixed using customized templates and some tooling to create the rings in advance on the undercloud node. The information about all the devices can be collected from the introspection data, and by using node placement the nodenames in the rings are known in advance if the nodes are not yet powered on. This ensures a consistent ring state, and an operator can modify the rings if needed and to customize the cluster layout. Using some customized templates we can already do the following: - disable rinbguilding on the nodes - create filesystems on the extra blockdevices - copy ringfiles from the undercloud, using pre-built rings - enable mount_check by default - (define storage policies if needed) I started working on a POC using tripleo-quickstart, some custom templates and a small Python tool to build rings based on the introspection data: https://github.com/cschwede/tripleo-swift-ring-tool I'd like to get some feedback on the tool and templates. - Does this make sense to you? - How (and where) could we integrate this upstream? - Templates might be included in tripleo-heat-templates? IMO the most important change would be to avoid overwriting rings on the overcloud. There is a good chance to mess up your cluster if the template to disable ring building isn't used and you already have working rings in place. Same for the mount_check option. I'm curious about your thoughts! Thanks, Christian -- Christian Schwede _ Red Hat GmbH Technopark II, Haus C, Werner-von-Siemens-Ring 11-15, 85630 Grasbrunn, Handelsregister: Amtsgericht Muenchen HRB 153243 Geschaeftsfuehrer: Mark Hegarty, Charlie Peters, Michael Cunningham, Charles Cachera __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [swift] On Object placement
Hello Jonathan, On 18.02.15 18:13, Halterman, Jonathan wrote: 1. Swift should allow authorized services to place a given number of object replicas onto a particular rack, and onto separate racks. This is already possible if you use zones and regions in your ring files. For example, if you have 2 racks, you could assign one zone to each of them and Swift places at least one replica on each rack. Because Swift takes care of the device weight you could also ensure that a specific rack gets two copies, and another rack only one. Presumably a deployment would/should match the DC layout, where racks could correspond to Azs. yes, that makes a lot of sense (to assign zones to racks), because in this case you can ensure that there aren't multiple replicas stored within the same rack. You can still access your data if a rack goes down (power, network, maintenance). However, this is only true as long as all primary nodes are accessible. If Swift stores data on a handoff node this data might be written to a different node first, and moved to the primary node later on. Note that placing objects on other than the primary nodes (for example using an authorized service you described) will only store the data on these nodes until the replicator moves the data to the primary nodes described by the ring. As far as I can see there is no way to ensure that an authorized service can decide where to place data, and that this data stays on the selected nodes. That would require a fundamental change within Swift. So - how can we influence where data is stored? In terms of placement based on a hash ring, I¹m thinking of perhaps restricting the placement of an object to a subset of the ring based on a zone. We can still hash an object somewhere on the ring, for the purposes of controlling locality, we just want it to be within (or without) a particular zone. Any ideas? You can't (at least not from the client side). The ring determines the placement and if you have more zones (or regions) than replicas you can't ensure an object replica is stored within a determined rack. Even if you store it on a handoff node it will be moved to the primary node sooner or later. Determining that an object is stored in a specific zone is not possible with the current architecture; you can only discover in which zone it will be placed finally (based on the ring). What you could do (especially if you have more racks than replicas) is to use storage policies and only assign three racks to each policy, and splitting them into three zones (if you store three replicas). For example, let's assume you have 5 racks, then you create 5 storage policies (SP) with the following assignment: Rack SP 1 2 3 4 5 0 x x x 1 x x x 2 x x x 3 x x x 4 x x x Doing this you can ensure the following: - Data is distributed somehow evenly across the cluster (if you use the storage policies also evenly distributed) - From a given SP you can ensure that a replica is stored in a specific rack; and because a SP is assigned to a container you can determine the SP based on the container metadata (name SP0 rack_1_2_3 and so on to make it even more simpler for the application to determine the racks). That could help in your case? 2. Swift should allow authorized services and administrators to learn which racks an object resides on, along with endpoints. You already mentioned the endpoint middleware, though it is currently not protected and unauthenticated access is allowed if enabled. This is good to know. We still need to learn which rack an object resides on though. This information is important in determining whether a swift object resides on the same rack as a VM. Well, that information is available using the /endpoint middleware? You know the server IPs in a rack, and compare that to the output from the endpoint middleware. You could easily add another small middleware in the pipeline to check authentication and grant or deny access to /endpoints based on the authentication. You can also get the node (and disk) if you have access to the ring files. There is a tool included in the Swift source code called swift-get-nodes; however you could simply reuse existing code to include it in your projects. I¹m guessing this would not work for in cloud services? Do you mean public cloud services? You always need access to the storage servers itself to access objects directly, and these should be accessible only by an internal, protected network (and only the proxy servers should have access to that network). Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe:
Re: [openstack-dev] [swift] On Object placement
Hello Jonathan, On 17.02.15 22:17, Halterman, Jonathan wrote: Various services desire the ability to control the location of data placed in Swift in order to minimize network saturation when moving data to compute, or in the case of services like Hadoop, to ensure that compute can be moved to wherever the data resides. Read/write latency can also be minimized by allowing authorized services to place one or more replicas onto the same rack (with other replicas being placed on separate racks). Fault tolerance can also be enhanced by ensuring that some replica(s) are placed onto separate racks. Breaking this down we come up with the following potential requirements: 1. Swift should allow authorized services to place a given number of object replicas onto a particular rack, and onto separate racks. This is already possible if you use zones and regions in your ring files. For example, if you have 2 racks, you could assign one zone to each of them and Swift places at least one replica on each rack. Because Swift takes care of the device weight you could also ensure that a specific rack gets two copies, and another rack only one. However, this is only true as long as all primary nodes are accessible. If Swift stores data on a handoff node this data might be written to a different node first, and moved to the primary node later on. Note that placing objects on other than the primary nodes (for example using an authorized service you described) will only store the data on these nodes until the replicator moves the data to the primary nodes described by the ring. As far as I can see there is no way to ensure that an authorized service can decide where to place data, and that this data stays on the selected nodes. That would require a fundamental change within Swift. 2. Swift should allow authorized services and administrators to learn which racks an object resides on, along with endpoints. You already mentioned the endpoint middleware, though it is currently not protected and unauthenticated access is allowed if enabled. You could easily add another small middleware in the pipeline to check authentication and grant or deny access to /endpoints based on the authentication. You can also get the node (and disk) if you have access to the ring files. There is a tool included in the Swift source code called swift-get-nodes; however you could simply reuse existing code to include it in your projects. Christian __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [swift] LTFS integration with OpenStack Swift for scenario like - Data Archival as a Service .
On 14.11.14 20:43, Tim Bell wrote: It would need to be tiered (i.e. migrate whole collections rather than files) and a local catalog would be needed to map containers to tapes. Timeouts would be an issue since we are often waiting hours for recall (to ensure that multiple recalls for the same tape are grouped). It is not an insolvable problem but it is not just a 'use LTFS' answer. There were some ad-hoc discussions during the last summit about using Swift (API) to access data that stored on tape. At the same time we talked about possible data migrations from one storage policy to another, and this might be an option to think about. Something like this: 1. Data is stored in a container with a Storage Policy (SP) that defines a time-based data migration to some other place 2. After some time, data is migrated to tape, and only some stubs (zero-byte objects) are left on disk. 3. If a client requests such an object the clients gets an error stating that the object is temporarily not available (unfortunately there is no suitable http response code for this yet) 4. At this time the object is scheduled to be restored from tape 5. Finally the object is read from tape and stored on disk again. Will be deleted again from disk after some time. Using this approach there are only smaller modifications inside Swift required, for example to send a notification to an external consumer to migrate data forth and back and to handle requests for empty stub files. The migration itself should be done by an external worker, that works with existing solutions from tape vendors. Just an idea, but might be worth to investigate further (because more and more people seem to be interested in this, and especially from the science community). Christian ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Swift] domain-level quotas
Hi Matthieu, Am 22.01.14 20:02, schrieb Matthieu Huin: The idea is to have a middleware checking a domain's current usage against a limit set in the configuration before allowing an upload. The domain id can be extracted from the token, then used to query keystone for a list of projects belonging to the domain. Swift would then compute the domain usage in a similar fashion as the way it is currently done for accounts, and proceed from there. the problem might be to compute the current usage of all accounts within a domain. It won't be a problem if you have only a few accounts in a domain, but with tens, hundreds or even thousands accounts in a domain there will be a performance impact because you need to iterate over all accounts (doing a HEAD on every account) and sum up the total usage. I think some performance tests would be helpful (doing a HEAD on all accounts repeatedly with some PUTs in-between) to see if the performance impact is an issue at all (since there will be a lot of caching involved). Christian ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Swift] Increase Swift ring partition power
Am 02.12.13 17:10, schrieb Gregory Holt: On Dec 2, 2013, at 9:48 AM, Christian Schwede christian.schw...@enovance.com wrote: That sounds great! Is someone already working on this (I know about the ongoing DiskFile refactoring) or even a blueprint available? There is https://blueprints.launchpad.net/swift/+spec/ring-doubling though I'm uncertain how up to date it is. Thanks for the link! I read all the linked entries, reviews and patches and it seems all of us wanted to use a similar approach. David put it in a nutshell: We can consider this to be the yearly event in which we try to crack the part_power problem. I'm going to write some docs and tests for my tool and will link it as related project afterwards. Christian ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Swift] Increase Swift ring partition power
Hello together, I'd like to discuss a way to increase the partition power of an existing Swift cluster. This is most likely interesting for smaller clusters that are growing beyond their original planed size. As discussed earlier [1] a rehashing is required after changing the partition power to make existing data available again. My idea is to increase the partition power by 1 and then assign the same devices to (old partition*2 and old_partition*2+1). For example: Assigned devices on older ring: |Partition 0:2 3 0 Partition 1:1 0 3| Assigned devices on new ring with partition power +1: |Partition 0:2 3 0 Partition 1:2 3 0 Partition 2:1 0 3 Partition 3:1 0 3 | The hash of an object doesn't change with a new partition, only the assigned partition. An object on partition 1 on the old ring will be assigned to partition 2 OR 3 on the ring with the increased partition power. Because of the fact that the used devices are the same for the new partitions no data movement to other devices or storage nodes is required (only locally). A longer example together with a small tool can be found at https://github.com/cschwede/swift-ring-tool Since the device distribution on the new ring might not be optimal it is possible to use a fresh distribution and migrate the ring with the increased partition power to a ring with a new distribution. So far this worked for smaller clusters (with a few hundred TB) as well as in local SAIO installations. I'd like to discuss this approach and see if it makes sense to continue work on this and adding this tool to swift, python-swiftclient or stackforge (or whatever else might be appropriate). Please let me know what you think. Best regards, Christian [1] http://lists.openstack.org/pipermail/openstack-operators/2013-January/002544.html ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Swift] Increase Swift ring partition power
Am 02.12.13 15:47, schrieb Gregory Holt: Achieving this transparently is part of the ongoing plans, starting with things like the DiskFile refactoring and SSync. The idea is to isolate the direct disk access from other servers/tools, something that (for instance) RSync has today. Once the isolation is there, it should be fairly straightforward to have incoming requests for a ring^20 partition look on the local disk in a directory structure that was originally created for a ring^19 partition, or even vice versa. Then, there will be no need to move data around just for a ring-doubling or halving, and no down time to do so. That sounds great! Is someone already working on this (I know about the ongoing DiskFile refactoring) or even a blueprint available? I was aware of the idea about multiple rings on the same policy but not about support for rings with a modified partition power. That said, if you want create a tool that allows such ring shifting in the interim, it should work with smaller clusters that don't mind downtime. I would prefer that it not become a core tool checked directly into swift/python-swiftclient, just because of the plans stated above that should one day make it obsolete. Yes, that makes a lot of sense. In fact the tool is already working; I think the best way is to enhance the docs and to list it as a related Swift project once I'm done with this. Christian ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Swift] Goals for Icehouse
Thanks John for the summary - and all contributors for their work! Others are looking in to how to grow clusters (changing the partition power) I'm interested who else is also working on this - I successfully increased partition power of several (smaller) clusters and would like to discuss my approach with others. Please feel free to contact me so we can work together on this :) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Swift] erasure codes, digging deeper
A solution to this might be to set the default policy as a configuration setting in the proxy. If you want a replicated swift cluster just allow this policy in the proxy and set it to default. The same for EC cluster, just set the allowed policy to EC. If you want both (and let your users decide which policy to use) simply configure a list of allowed policies with the first one in the list as the default policy in case they don't set a policy during container creation. Am 18.07.13 20:15, schrieb Chuck Thier: I think you are missing the point. What I'm talking about is who chooses what data is EC and what is not. The point that I am trying to make is that the operators of swift clusters should decide what data is EC, not the clients/users. How the data is stored should be totally transparent to the user. Now if we want to down the road offer user defined classes of storage (like how S3 does reduced redundancy), I'm cool with that, just that it should be orthogonal to the implementation of EC. -- Chuck On Thu, Jul 18, 2013 at 12:57 PM, John Dickinson m...@not.mn mailto:m...@not.mn wrote: Are you talking about the parameters for EC or the fact that something is erasure coded vs replicated? For the first, that's exactly what we're thinking: a deployer sets up one (or more) policies and calls them Alice, Bob, or whatever, and then the API client can set that on a particular container. This allows users who know what they are doing (ie those who know the tradeoffs and their data characteristics) to make good choices. It also allows deployers who want to have an automatic policy to set one up to migrate data. For example, a deployer may choose to run a migrator process that moved certain data from replicated to EC containers over time (and drops a manifest file in the replicated tier to point to the EC data so that the URL still works). Like existing features in Swift (eg large objects), this gives users the ability to flexibly store their data with a nice interface yet still have the ability to get at some of the pokey bits underneath. --John On Jul 18, 2013, at 10:31 AM, Chuck Thier cth...@gmail.com mailto:cth...@gmail.com wrote: I'm with Chmouel though. It seems to me that EC policy should be chosen by the provider and not the client. For public storage clouds, I don't think you can make the assumption that all users/clients will understand the storage/latency tradeoffs and benefits. On Thu, Jul 18, 2013 at 8:11 AM, John Dickinson m...@not.mn mailto:m...@not.mn wrote: Check out the slides I linked. The plan is to enable an EC policy that is then set on a container. A cluster may have a replication policy and one or more EC policies. Then the user will be able to choose the policy for a particular container. --John On Jul 18, 2013, at 2:50 AM, Chmouel Boudjnah chmo...@enovance.com mailto:chmo...@enovance.com wrote: On Thu, Jul 18, 2013 at 12:42 AM, John Dickinson m...@not.mn mailto:m...@not.mn wrote: * Erasure codes (vs replicas) will be set on a per-container basis I was wondering if there was any reasons why it couldn't be as per-account basis as this would allow an operator to have different type of an account and different pricing (i.e: tiered storage). Chmouel. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org mailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org mailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org mailto:OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev