Re: [openstack-dev] [TripleO] Fixing Swift rings when upscaling/replacing nodes in TripleO deployments

2017-01-05 Thread Christian Schwede
On 05.01.2017 17:03, Steven Hardy wrote:
> On Thu, Jan 05, 2017 at 02:56:15PM +, arkady.kanev...@dell.com wrote:
>> I have concern to rely on undercloud for overcloud swift.
>> Undercloud is not HA (yet) so it may not be operational when disk failed or 
>> swift overcloud node is added/deleted.
> 
> I think the proposal is only for a deploy-time dependency, after the
> overcloud is deployed there should be no dependency on the undercloud
> swift, because the ring data will have been copied to all the nodes.

Yes, exactly - there is no runtime dependency. The overcloud will
continue to work even if the undercloud is gone.

If you "loose" the undercloud (or more precisely, the overcloud rings
that are stored on the undercloud Swift) you can copy them from any
overcloud node and run an update.

Even if one deletes the rings from the undercloud, the deployment will
continue to work after an update - puppet-swift will simply continue to
use the already existing .builder files on the nodes.

Only if one deletes the rings on the undercloud and runs an update with
new/replaced nodes it will fail - the swift-recon check will raise an
error in step 5 because rings are inconsistent on the new/replaced
nodes. But the inconsistency is already the case today (in fact it's the
same way as it works today), except that there is no check and no
warning to the operator.

-- Christian

> During create/update operations you need the undercloud operational by
> definition, so I think this is probably OK?
> 
> Steve
>>
>> -Original Message-
>> From: Christian Schwede [mailto:cschw...@redhat.com] 
>> Sent: Thursday, January 05, 2017 6:14 AM
>> To: OpenStack Development Mailing List <openstack-dev@lists.openstack.org>
>> Subject: [openstack-dev] [TripleO] Fixing Swift rings when 
>> upscaling/replacing nodes in TripleO deployments
>>
>> Hello everyone,
>>
>> there was an earlier discussion on $subject last year [1] regarding a bug 
>> when upscaling or replacing nodes in TripleO [2].
>>
>> Shortly summarized: Swift rings are built on each node separately, and if 
>> adding or replacing nodes (or disks) this will break the rings because they 
>> are no longer consistent across the nodes. What's needed are the previous 
>> ring builder files on each node before changing the rings.
>>
>> My former idea in [1] was to build the rings in advance on the undercloud, 
>> and also using introspection data to gather a set of disks on each node for 
>> the rings.
>>
>> However, this changes the current way of deploying significantly, and also 
>> requires more work in TripleO and Mistral (for example to trigger a ring 
>> build on the undercloud after the nodes have been started, but before the 
>> deployment triggers the Puppet run).
>>
>> I prefer smaller steps to keep everything stable for now, and therefore I 
>> changed my patches quite a bit. This is my updated proposal:
>>
>> 1. Two temporary undercloud Swift URLs (one PUT, one GET) will be computed 
>> before Mistral starts the deployments. A new Mistral action to create such 
>> URLs is required for this [3].
>> 2. Each overcloud node will try to fetch rings from the undercloud Swift 
>> deployment before updating it's set of rings locally using the temporary GET 
>> url. This guarantees that each node uses the same source set of builder 
>> files. This happens in step 2. [4] 3. puppet-swift runs like today, updating 
>> the rings if required.
>> 4. Finally, at the end of the deployment (in step 5) the nodes will upload 
>> their modified rings to the undercloud using the temporary PUT urls. 
>> swift-recon will run before this, ensuring that all rings across all nodes 
>> are consistent.
>>
>> The two required patches [3][4] are not overly complex IMO, but they solve 
>> the problem of adding or replacing nodes without changing the current 
>> workflow significantly. It should be even easy to backport them if needed.
>>
>> I'll continue working on an improved way of deploying Swift rings (using 
>> introspection data), but using this approach it could be even done using 
>> todays workflow, feeding data into puppet-swift (probably with some updates 
>> to puppet-swift/tripleo-heat-templates to allow support for regions, zones, 
>> different disk layouts and the like). However, all of this could be built on 
>> top of these two patches.
>>
>> I'm curious about your thoughts and welcome any feedback or reviews!
>>
>> Thanks,
>>
>> -- Christian
>>
>>
>> [1]
>> http://lists.openstack.org/pipermail/openstack-dev/2016-August/100720.html
>

[openstack-dev] [TripleO] Fixing Swift rings when upscaling/replacing nodes in TripleO deployments

2017-01-05 Thread Christian Schwede
Hello everyone,

there was an earlier discussion on $subject last year [1] regarding a
bug when upscaling or replacing nodes in TripleO [2].

Shortly summarized: Swift rings are built on each node separately, and
if adding or replacing nodes (or disks) this will break the rings
because they are no longer consistent across the nodes. What's needed
are the previous ring builder files on each node before changing the rings.

My former idea in [1] was to build the rings in advance on the
undercloud, and also using introspection data to gather a set of disks
on each node for the rings.

However, this changes the current way of deploying significantly, and
also requires more work in TripleO and Mistral (for example to trigger a
ring build on the undercloud after the nodes have been started, but
before the deployment triggers the Puppet run).

I prefer smaller steps to keep everything stable for now, and therefore
I changed my patches quite a bit. This is my updated proposal:

1. Two temporary undercloud Swift URLs (one PUT, one GET) will be
computed before Mistral starts the deployments. A new Mistral action to
create such URLs is required for this [3].
2. Each overcloud node will try to fetch rings from the undercloud Swift
deployment before updating it's set of rings locally using the temporary
GET url. This guarantees that each node uses the same source set of
builder files. This happens in step 2. [4]
3. puppet-swift runs like today, updating the rings if required.
4. Finally, at the end of the deployment (in step 5) the nodes will
upload their modified rings to the undercloud using the temporary PUT
urls. swift-recon will run before this, ensuring that all rings across
all nodes are consistent.

The two required patches [3][4] are not overly complex IMO, but they
solve the problem of adding or replacing nodes without changing the
current workflow significantly. It should be even easy to backport them
if needed.

I'll continue working on an improved way of deploying Swift rings (using
introspection data), but using this approach it could be even done using
todays workflow, feeding data into puppet-swift (probably with some
updates to puppet-swift/tripleo-heat-templates to allow support for
regions, zones, different disk layouts and the like). However, all of
this could be built on top of these two patches.

I'm curious about your thoughts and welcome any feedback or reviews!

Thanks,

-- Christian


[1]
http://lists.openstack.org/pipermail/openstack-dev/2016-August/100720.html
[2] https://bugs.launchpad.net/tripleo/+bug/1609421
[3] https://review.openstack.org/#/c/413229/
[4] https://review.openstack.org/#/c/414460/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat][TripleO] How to run mistral workflows via templates

2016-12-16 Thread Christian Schwede
> we're trying to address in TripleO a couple of use cases for which we'd
> like to trigger a Mistral workflow from a Heat template.
> 
> One example where this would be useful is the creation of the Swift
> rings, which need some data related to the Heat stack (like the list of
> Swift nodes and their disks), so it can't be executed in advance, yet it
> provides data which is needed to complete successfully the deployment of
> the overcloud.
> 
> Currently we can create a workflow from Heat, but we can't trigger its
> execution and also we can't block Heat on the result of the execution.
> 
> I was wondering if it would make sense to have a property for the
> existing Workflow resource to let the user decide if the workflow should
> *also* be triggered on CREATE/UPDATE? And if it would make sense to
> block the Workflow resource until the execution result is returned in
> that case?

I think it needs to be triggered a bit later actually? For the Swift use
case it needs to be executed after all instances are created (but
preferably before starting any Puppet actions on the nodes), not when
the CREATE/UPDATE itself actually starts.

> Alternatively, would an ex-novo Execution resource make more sense?
> 
> Or are there different ideas, approaches to the problem?

As a workaround for now I'm using the signal URL and trigger it in a
shell script on the nodes (the shell script is running anyways to fetch
and validate the rings). To avoid multiple parallel workflow executions
triggered by a dozen nodes I set a flag in the Mistral environment;
further actions will immediately return then.

I'd prefer a different and cleaner approach like you proposed but for me
that's working well for the moment.

-- Christian


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] Requesting FFE for improved Swift deployments

2016-08-29 Thread Christian Schwede
Hello,

kindly asking for a FFE for a required setting to improve Swift-based
TripleO deployments:

https://review.openstack.org/#/c/358643/

This is required to land the last patch in a series of TripleO-doc patches:

https://review.openstack.org/#/c/293311/
https://review.openstack.org/#/c/360353/
https://review.openstack.org/#/c/361032/

Current idea is to automate the described manual actions for Ocata.
There was some discussion on the ML as well:

http://lists.openstack.org/pipermail/openstack-dev/2016-August/102053.html

If one is interested in testing this with tripleo-quickstart, here is a
patch to automatically add extra blockdevices to the overcloud VMs:

https://review.openstack.org/#/c/359630/

Thanks a lot!

-- Christian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO

2016-08-22 Thread Christian Schwede
On 04.08.16 15:39, Giulio Fidente wrote:
> On 08/04/2016 01:26 PM, Christian Schwede wrote:
>> On 04.08.16 10:27, Giulio Fidente wrote:
>>> On 08/02/2016 09:36 PM, Christian Schwede wrote:
>>>> Hello everyone,
>>>
>>> thanks Christian,
>>>
>>>> I'd like to improve the Swift deployments done by TripleO. There are a
>>>> few problems today when deployed with the current defaults:
>>>>
>>>> 1. Adding new nodes (or replacing existing nodes) is not possible,
>>>> because the rings are built locally on each host and a new node doesn't
>>>> know about the "history" of the rings. Therefore rings might become
>>>> different on the nodes, and that results in an unusable state
>>>> eventually.
>>>
>>> one of the ideas for this was to use a tempurl in the undercloud swift
>>> where to upload the rings built by a single overcloud node, not by the
>>> undercloud
>>>
>>> so I proposed a new heat resource which would permit us to create a
>>> swift tempurl in the undercloud during the deployment
>>>
>>> https://review.openstack.org/#/c/350707/
>>>
>>> if we build the rings on the undercloud we can ignore this and use a
>>> mistral action instead, as pointed by Steven
>>>
>>> the good thing about building rings in the overcloud is that it doesn't
>>> force us to have a static node mapping for each and every deployment but
>>> it makes hard to cope with heterogeneous environments
>>
>> That's true. However - we still need to collect the device data from all
>> the nodes from the undercloud, push it to at least one overcloud mode,
>> build/update the rings there, push it to the undercloud Swift and use
>> that on all overcloud nodes. Or not?
> 
> sure, let's build on the undercloud, when automated with mistral it
> shouldn't make a big difference for the user
> 
>> I was also thinking more about the static node mapping and how to avoid
>> this. Could we add a host alias using the node UUIDs? That would never
>> change, it's available from the introspection data and therefore could
>> be used in the rings.
>>
>> http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid
>>
> 
> right, this is the mechanism I wanted to use to proviude per-node disk
> maps, it's how it works for ceph disks as well

I looked into this further and proposed a patch upstream:

https://review.openstack.org/358643

This worked fine in my tests, an example /etc/hosts looks like this:

http://paste.openstack.org/show/562206/

And based on that patch we could build the Swift rings even if the nodes
are down and never been deployed, because the system uuid will never
change and is unique. I updated my tripleo-swift-ring-tool and just run
a successful deployment with the patch (also using the merged patches
from Giulio).

Let me know what you think about it - I think with that patch we could
integrate the tripleo-swift-ring-tool.

-- Christian

>>>> 2. The rings are only using a single device, and it seems that this is
>>>> just a directory and not a mountpoint with a real device. Therefore
>>>> data
>>>> is stored on the root device - even if you have 100TB disk space in the
>>>> background. If not fixed manually your root device will run out of
>>>> space
>>>> eventually.
>>>
>>> for the disks instead I am thinking to add a create_resources wrapper in
>>> puppet-swift:
>>>
>>> https://review.openstack.org/#/c/350790
>>> https://review.openstack.org/#/c/350840/
>>>
>>> so that we can pass via hieradata per-node swift::storage::disks maps
>>>
>>> we have a mechanism to push per-node hieradata based on the system uuid,
>>> we could extend the tool to capture the nodes (system) uuid and generate
>>> per-node maps
>>
>> Awesome, thanks Giulio!
>>
>> I will test that today. So the tool could generate the mapping
>> automatically, and we don't need to filter puppet facts on the nodes
>> itself. Nice!   
> 
> and we could re-use the same tool to generate the ceph::osds disk maps
> as well :)
> 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO

2016-08-04 Thread Christian Schwede
On 04.08.16 10:27, Giulio Fidente wrote:
> On 08/02/2016 09:36 PM, Christian Schwede wrote:
>> Hello everyone,
> 
> thanks Christian,
> 
>> I'd like to improve the Swift deployments done by TripleO. There are a
>> few problems today when deployed with the current defaults:
>>
>> 1. Adding new nodes (or replacing existing nodes) is not possible,
>> because the rings are built locally on each host and a new node doesn't
>> know about the "history" of the rings. Therefore rings might become
>> different on the nodes, and that results in an unusable state eventually.
> 
> one of the ideas for this was to use a tempurl in the undercloud swift
> where to upload the rings built by a single overcloud node, not by the
> undercloud
> 
> so I proposed a new heat resource which would permit us to create a
> swift tempurl in the undercloud during the deployment
> 
> https://review.openstack.org/#/c/350707/
> 
> if we build the rings on the undercloud we can ignore this and use a
> mistral action instead, as pointed by Steven
> 
> the good thing about building rings in the overcloud is that it doesn't
> force us to have a static node mapping for each and every deployment but
> it makes hard to cope with heterogeneous environments

That's true. However - we still need to collect the device data from all
the nodes from the undercloud, push it to at least one overcloud mode,
build/update the rings there, push it to the undercloud Swift and use
that on all overcloud nodes. Or not?

That leaves some room for new inconsistencies IMO: how do we ensure that
the overcloud node uses the last ring to start with? Also, ring building
has to be limited to one overcloud node, otherwise we might end up with
multiple ringbuilding nodes? How can an operator manually modify the rings?

The tool to build the rings on the undercloud could be further improved
later, for example I'd like to be able to move data to new nodes slowly
over time, and also query existing storage servers about the progress.
Therefore we need some more functionality than currently available in
the ringbuilding part in puppet-swift IMO.

I think if we move this step to the undercloud we could solve a lot of
these challenges in a consistent way. WDYT?

I was also thinking more about the static node mapping and how to avoid
this. Could we add a host alias using the node UUIDs? That would never
change, it's available from the introspection data and therefore could
be used in the rings.

http://docs.openstack.org/developer/tripleo-docs/advanced_deployment/node_specific_hieradata.html#collecting-the-node-uuid

>> 2. The rings are only using a single device, and it seems that this is
>> just a directory and not a mountpoint with a real device. Therefore data
>> is stored on the root device - even if you have 100TB disk space in the
>> background. If not fixed manually your root device will run out of space
>> eventually.
> 
> for the disks instead I am thinking to add a create_resources wrapper in
> puppet-swift:
> 
> https://review.openstack.org/#/c/350790
> https://review.openstack.org/#/c/350840/
>
> so that we can pass via hieradata per-node swift::storage::disks maps
> 
> we have a mechanism to push per-node hieradata based on the system uuid,
> we could extend the tool to capture the nodes (system) uuid and generate
> per-node maps

Awesome, thanks Giulio!

I will test that today. So the tool could generate the mapping
automatically, and we don't need to filter puppet facts on the nodes
itself. Nice!

> then, with the above puppet changes and having the per-node map and the
> rings download url, we could feed them to the templates, replace with an
> environment the rings building implementation and deploy without further
> customizations
> 
> what do you think?

Yes, that sounds like a good plan to me.

I'll continue working on the ringbuilder tool for now and see how I
integrate this into the Mistral actions.

-- Christian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] Improving Swift deployments with TripleO

2016-08-03 Thread Christian Schwede
Thanks Steven for your feedback! Please see my answers inline.

On 02.08.16 23:46, Steven Hardy wrote:
> On Tue, Aug 02, 2016 at 09:36:45PM +0200, Christian Schwede wrote:
>> Hello everyone,
>>
>> I'd like to improve the Swift deployments done by TripleO. There are a
>> few problems today when deployed with the current defaults:
> 
> Thanks for digging into this, I'm aware this has been something of a
> known-issue for some time, so it's great to see it getting addressed :)
> 
> Some comments inline;
> 
>> 1. Adding new nodes (or replacing existing nodes) is not possible,
>> because the rings are built locally on each host and a new node doesn't
>> know about the "history" of the rings. Therefore rings might become
>> different on the nodes, and that results in an unusable state eventually.
>>
>> 2. The rings are only using a single device, and it seems that this is
>> just a directory and not a mountpoint with a real device. Therefore data
>> is stored on the root device - even if you have 100TB disk space in the
>> background. If not fixed manually your root device will run out of space
>> eventually.
>>
>> 3. Even if a real disk is mounted in /srv/node, replacing a faulty disk
>> is much more troublesome. Normally you would simply unmount a disk, and
>> then replace the disk sometime later. But because mount_check is set to
>> False in the storage servers data will be written to the root device in
>> the meantime; and when you finally mount the disk again, you can't
>> simply cleanup.
>>
>> 4. In general, it's not possible to change cluster layout (using
>> different zones/regions/partition power/device weight, slowly adding new
>> devices to avoid 25% of the data will be moved immediately when adding
>> new nodes to a small cluster, ...). You could manually manage your
>> rings, but they will be overwritten finally when updating your overcloud.
>>
>> 5. Missing erasure coding support (or storage policies in general)
>>
>> This sounds bad, however most of the current issues can be fixed using
>> customized templates and some tooling to create the rings in advance on
>> the undercloud node.
>>
>> The information about all the devices can be collected from the
>> introspection data, and by using node placement the nodenames in the
>> rings are known in advance if the nodes are not yet powered on. This
>> ensures a consistent ring state, and an operator can modify the rings if
>> needed and to customize the cluster layout.
>>
>> Using some customized templates we can already do the following:
>> - disable rinbguilding on the nodes
>> - create filesystems on the extra blockdevices
>> - copy ringfiles from the undercloud, using pre-built rings
>> - enable mount_check by default
>> - (define storage policies if needed)
>>
>> I started working on a POC using tripleo-quickstart, some custom
>> templates and a small Python tool to build rings based on the
>> introspection data:
>>
>> https://github.com/cschwede/tripleo-swift-ring-tool
>>
>> I'd like to get some feedback on the tool and templates.
>>
>> - Does this make sense to you?
> 
> Yes, I think the basic workflow described should work, and it's good to see
> that you're passing the ring data via swift as this is consistent with how
> we already pass some data to nodes via our DeployArtifacts interface:
> 
> https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/deploy-artifacts.yaml
> 
> Note however that there are no credentials to access the undercloud swift
> on the nodes, so you'll need to pass a tempurl reference in (which is what
> we do for deploy artifacts, obviously you will have credentials to create
> the container & tempurl on the undercloud).

Ah, that's very useful! I updated my POC; makes one less customized
template and less code to support in the Python tool. Works as expected!

> One slight concern I have is mandating the use of predictable placement -
> it'd be nice to think about ways we might avoid that but the undercloud
> centric approach seems OK for a first pass (in either case I think the
> delivery via swift will be the same).

Do you mean the predictable artifact filename? We could just add a
randomized prefix to the filename IMO.

>> - How (and where) could we integrate this upstream?
> 
> So I think the DeployArtefacts interface may work for this, and we have a
> helper script that can upload data to swift:
> 
> https://github.com/openstack/tripleo-common/blob/master/scripts/upload-swift-artifacts
> 
> This basically pushes a tarball to swift, creates a tempurl, then creates 

[openstack-dev] [TripleO] Improving Swift deployments with TripleO

2016-08-02 Thread Christian Schwede
Hello everyone,

I'd like to improve the Swift deployments done by TripleO. There are a
few problems today when deployed with the current defaults:

1. Adding new nodes (or replacing existing nodes) is not possible,
because the rings are built locally on each host and a new node doesn't
know about the "history" of the rings. Therefore rings might become
different on the nodes, and that results in an unusable state eventually.

2. The rings are only using a single device, and it seems that this is
just a directory and not a mountpoint with a real device. Therefore data
is stored on the root device - even if you have 100TB disk space in the
background. If not fixed manually your root device will run out of space
eventually.

3. Even if a real disk is mounted in /srv/node, replacing a faulty disk
is much more troublesome. Normally you would simply unmount a disk, and
then replace the disk sometime later. But because mount_check is set to
False in the storage servers data will be written to the root device in
the meantime; and when you finally mount the disk again, you can't
simply cleanup.

4. In general, it's not possible to change cluster layout (using
different zones/regions/partition power/device weight, slowly adding new
devices to avoid 25% of the data will be moved immediately when adding
new nodes to a small cluster, ...). You could manually manage your
rings, but they will be overwritten finally when updating your overcloud.

5. Missing erasure coding support (or storage policies in general)

This sounds bad, however most of the current issues can be fixed using
customized templates and some tooling to create the rings in advance on
the undercloud node.

The information about all the devices can be collected from the
introspection data, and by using node placement the nodenames in the
rings are known in advance if the nodes are not yet powered on. This
ensures a consistent ring state, and an operator can modify the rings if
needed and to customize the cluster layout.

Using some customized templates we can already do the following:
- disable rinbguilding on the nodes
- create filesystems on the extra blockdevices
- copy ringfiles from the undercloud, using pre-built rings
- enable mount_check by default
- (define storage policies if needed)

I started working on a POC using tripleo-quickstart, some custom
templates and a small Python tool to build rings based on the
introspection data:

https://github.com/cschwede/tripleo-swift-ring-tool

I'd like to get some feedback on the tool and templates.

- Does this make sense to you?
- How (and where) could we integrate this upstream?
- Templates might be included in tripleo-heat-templates?

IMO the most important change would be to avoid overwriting rings on the
overcloud. There is a good chance to mess up your cluster if the
template to disable ring building isn't used and you already have
working rings in place. Same for the mount_check option.

I'm curious about your thoughts!

Thanks,

Christian


-- 
Christian Schwede
_

Red Hat GmbH
Technopark II, Haus C, Werner-von-Siemens-Ring 11-15, 85630 Grasbrunn,
Handelsregister: Amtsgericht Muenchen HRB 153243
Geschaeftsfuehrer: Mark Hegarty, Charlie Peters, Michael Cunningham,
Charles Cachera

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] On Object placement

2015-02-19 Thread Christian Schwede
Hello Jonathan,

On 18.02.15 18:13, Halterman, Jonathan wrote:
 1. Swift should allow authorized services to place a given number
 of object replicas onto a particular rack, and onto separate
 racks.
 
 This is already possible if you use zones and regions in your ring 
 files. For example, if you have 2 racks, you could assign one zone
 to each of them and Swift places at least one replica on each
 rack.
 
 Because Swift takes care of the device weight you could also ensure
 that a specific rack gets two copies, and another rack only one.
 
 Presumably a deployment would/should match the DC layout, where
 racks could correspond to Azs.

yes, that makes a lot of sense (to assign zones to racks), because in
this case you can ensure that there aren't multiple replicas stored
within the same rack. You can still access your data if a rack goes down
(power, network, maintenance).

 However, this is only true as long as all primary nodes are
 accessible. If Swift stores data on a handoff node this data might
 be written to a different node first, and moved to the primary node
 later on.
 
 Note that placing objects on other than the primary nodes (for
 example using an authorized service you described) will only store
 the data on these nodes until the replicator moves the data to the
 primary nodes described by the ring. As far as I can see there is
 no way to ensure that an authorized service can decide where to
 place data, and that this data stays on the selected nodes. That
 would require a fundamental change within Swift.
 
 So - how can we influence where data is stored? In terms of
 placement based on a hash ring, I¹m thinking of perhaps restricting
 the placement of an object to a subset of the ring based on a zone.
 We can still hash an object somewhere on the ring, for the purposes
 of controlling locality, we just want it to be within (or without) a
 particular zone. Any ideas?

You can't (at least not from the client side). The ring determines the
placement and if you have more zones (or regions) than replicas you
can't ensure an object replica is stored within a determined rack. Even
if you store it on a handoff node it will be moved to the primary node
sooner or later.
Determining that an object is stored in a specific zone is not possible
with the current architecture; you can only discover in which zone it
will be placed finally (based on the ring).

What you could do (especially if you have more racks than replicas) is
to use storage policies and only assign three racks to each policy, and
splitting them into three zones (if you store three replicas).
For example, let's assume you have 5 racks, then you create 5 storage
policies (SP) with the following assignment:

Rack
SP  1   2   3   4   5
0   x   x   x
1   x   x   x
2   x   x   x
3   x   x   x
4   x   x   x

Doing this you can ensure the following:
- Data is distributed somehow evenly across the cluster (if you use the
storage policies also evenly distributed)
- From a given SP you can ensure that a replica is stored in a specific
rack; and because a SP is assigned to a container you can determine the
SP based on the container metadata (name SP0 rack_1_2_3 and so on to
make it even more simpler for the application to determine the racks).

That could help in your case?


 2. Swift should allow authorized services and administrators to
 learn which racks an object resides on, along with endpoints.
 
 You already mentioned the endpoint middleware, though it is
 currently not protected and unauthenticated access is allowed if
 enabled.
 
 This is good to know. We still need to learn which rack an object
 resides on though. This information is important in determining
 whether a swift object resides on the same rack as a VM.

Well, that information is available using the /endpoint middleware? You
know the server IPs in a rack, and compare that to the output from the
endpoint middleware.

 You could easily add another small middleware in the pipeline to
 check authentication and grant or deny access to /endpoints based
 on the authentication. You can also get the node (and disk) if you
 have access to the ring files. There is a tool included in the
 Swift source code called swift-get-nodes; however you could
 simply reuse existing code to include it in your projects.
 
 I¹m guessing this would not work for in cloud services?

Do you mean public cloud services? You always need access to the storage
servers itself to access objects directly, and these should be
accessible only by an internal, protected network (and only the proxy
servers should have access to that network).

Christian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [swift] On Object placement

2015-02-18 Thread Christian Schwede
Hello Jonathan,

On 17.02.15 22:17, Halterman, Jonathan wrote:
 Various services desire the ability to control the location of data
 placed in Swift in order to minimize network saturation when moving data
 to compute, or in the case of services like Hadoop, to ensure that
 compute can be moved to wherever the data resides. Read/write latency
 can also be minimized by allowing authorized services to place one or
 more replicas onto the same rack (with other replicas being placed on
 separate racks). Fault tolerance can also be enhanced by ensuring that
 some replica(s) are placed onto separate racks. Breaking this down we
 come up with the following potential requirements:
 
 1. Swift should allow authorized services to place a given number of
 object replicas onto a particular rack, and onto separate racks.

This is already possible if you use zones and regions in your ring
files. For example, if you have 2 racks, you could assign one zone to
each of them and Swift places at least one replica on each rack.

Because Swift takes care of the device weight you could also ensure that
a specific rack gets two copies, and another rack only one.
However, this is only true as long as all primary nodes are accessible.
If Swift stores data on a handoff node this data might be written to a
different node first, and moved to the primary node later on.

Note that placing objects on other than the primary nodes (for example
using an authorized service you described) will only store the data on
these nodes until the replicator moves the data to the primary nodes
described by the ring.
As far as I can see there is no way to ensure that an authorized service
can decide where to place data, and that this data stays on the selected
nodes. That would require a fundamental change within Swift.

 2. Swift should allow authorized services and administrators to learn
 which racks an object resides on, along with endpoints.

You already mentioned the endpoint middleware, though it is currently
not protected and unauthenticated access is allowed if enabled. You
could easily add another small middleware in the pipeline to check
authentication and grant or deny access to /endpoints based on the
authentication.
You can also get the node (and disk) if you have access to the ring
files. There is a tool included in the Swift source code called
swift-get-nodes; however you could simply reuse existing code to
include it in your projects.

Christian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [swift] LTFS integration with OpenStack Swift for scenario like - Data Archival as a Service .

2014-11-17 Thread Christian Schwede
On 14.11.14 20:43, Tim Bell wrote:
 It would need to be tiered (i.e. migrate whole collections rather than
 files) and a local catalog would be needed to map containers to tapes.
 Timeouts would be an issue since we are often waiting hours for recall
 (to ensure that multiple recalls for the same tape are grouped). 
 
 It is not an insolvable problem but it is not just a 'use LTFS' answer.

There were some ad-hoc discussions during the last summit about using
Swift (API) to access data that stored on tape. At the same time we
talked about possible data migrations from one storage policy to
another, and this might be an option to think about.

Something like this:

1. Data is stored in a container with a Storage Policy (SP) that defines
a time-based data migration to some other place
2. After some time, data is migrated to tape, and only some stubs
(zero-byte objects) are left on disk.
3. If a client requests such an object the clients gets an error stating
that the object is temporarily not available (unfortunately there is no
suitable http response code for this yet)
4. At this time the object is scheduled to be restored from tape
5. Finally the object is read from tape and stored on disk again. Will
be deleted again from disk after some time.

Using this approach there are only smaller modifications inside Swift
required, for example to send a notification to an external consumer to
migrate data forth and back and to handle requests for empty stub files.
The migration itself should be done by an external worker, that works
with existing solutions from tape vendors.

Just an idea, but might be worth to investigate further (because more
and more people seem to be interested in this, and especially from the
science community).

Christian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Swift] domain-level quotas

2014-01-22 Thread Christian Schwede
Hi Matthieu,

Am 22.01.14 20:02, schrieb Matthieu Huin:
 The idea is to have a middleware checking a domain's current usage
 against a limit set in the configuration before allowing an upload.
 The domain id can be extracted from the token, then used to query
 keystone for a list of projects belonging to the domain. Swift would
 then compute the domain usage in a similar fashion as the way it is
 currently done for accounts, and proceed from there.

the problem might be to compute the current usage of all accounts within
a domain. It won't be a problem if you have only a few accounts in a
domain, but with tens, hundreds or even thousands accounts in a domain
there will be a performance impact because you need to iterate over all
accounts (doing a HEAD on every account) and sum up the total usage.

I think some performance tests would be helpful (doing a HEAD on all
accounts repeatedly with some PUTs in-between) to see if the performance
impact is an issue at all (since there will be a lot of caching involved).

Christian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Swift] Increase Swift ring partition power

2013-12-03 Thread Christian Schwede
Am 02.12.13 17:10, schrieb Gregory Holt:
 On Dec 2, 2013, at 9:48 AM, Christian Schwede
 christian.schw...@enovance.com wrote:

 That sounds great! Is someone already working on this (I know about
 the ongoing DiskFile refactoring) or even a blueprint available?

 There is https://blueprints.launchpad.net/swift/+spec/ring-doubling
 though I'm uncertain how up to date it is.

Thanks for the link! I read all the linked entries, reviews and patches
and it seems all of us wanted to use a similar approach.

David put it in a nutshell:

 We can consider this to be the yearly event in which we try to crack
 the part_power problem.

I'm going to write some docs and tests for my tool and will link it as
related project afterwards.

Christian
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Swift] Increase Swift ring partition power

2013-12-02 Thread Christian Schwede
Hello together,

I'd like to discuss a way to increase the partition power of an existing
Swift cluster.
This is most likely interesting for smaller clusters that are growing
beyond their original planed size.

As discussed earlier [1] a rehashing is required after changing the
partition power to make existing data available again.

My idea is to increase the partition power by 1 and then assign the same
devices to (old partition*2 and old_partition*2+1). For example:

Assigned devices on older ring:

|Partition 0:2 3 0
Partition 1:1 0 3|

Assigned devices on new ring with partition power +1:

|Partition 0:2 3 0
Partition 1:2 3 0
Partition 2:1 0 3
Partition 3:1 0 3

|

The hash of an object doesn't change with a new partition, only the
assigned partition. An object on partition 1 on the old ring will be
assigned to partition 2 OR 3 on the ring with the increased partition
power. Because of the fact that the used devices are the same for the
new partitions no data movement to other devices or storage nodes is
required (only locally).

A longer example together with a small tool can be found at
https://github.com/cschwede/swift-ring-tool

Since the device distribution on the new ring might not be optimal it is
possible to use a fresh distribution and migrate the ring with the
increased partition power to a ring with a new distribution.

So far this worked for smaller clusters (with a few hundred TB) as well
as in local SAIO installations.

I'd like to discuss this approach and see if it makes sense to continue
work on this and adding this tool to swift, python-swiftclient or
stackforge (or whatever else might be appropriate).

Please let me know what you think.

Best regards,

Christian


[1]
http://lists.openstack.org/pipermail/openstack-operators/2013-January/002544.html

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Swift] Increase Swift ring partition power

2013-12-02 Thread Christian Schwede
Am 02.12.13 15:47, schrieb Gregory Holt:
 Achieving this transparently is part of the ongoing plans, starting
 with things like the DiskFile refactoring and SSync. The idea is to
 isolate the direct disk access from other servers/tools, something
 that (for instance) RSync has today. Once the isolation is there, it
 should be fairly straightforward to have incoming requests for a
 ring^20 partition look on the local disk in a directory structure
 that was originally created for a ring^19 partition, or even vice
 versa. Then, there will be no need to move data around just for a
 ring-doubling or halving, and no down time to do so.

That sounds great! Is someone already working on this (I know about the
ongoing DiskFile refactoring) or even a blueprint available? I was aware
of the idea about multiple rings on the same policy but not about
support for rings with a modified partition power.

 That said, if you want create a tool that allows such ring shifting
 in the interim, it should work with smaller clusters that don't mind
 downtime. I would prefer that it not become a core tool checked
 directly into swift/python-swiftclient, just because of the plans
 stated above that should one day make it obsolete.

Yes, that makes a lot of sense. In fact the tool is already working; I
think the best way is to enhance the docs and to list it as a related
Swift project once I'm done with this.

Christian



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Swift] Goals for Icehouse

2013-11-20 Thread Christian Schwede

Thanks John for the summary - and all contributors for their work!

Others are looking in to how to grow clusters (changing the partition 
power)


I'm interested who else is also working on this - I successfully 
increased partition power of several (smaller) clusters and would like 
to discuss my approach with others. Please feel free to contact me so we 
can work together on this :)


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Swift] erasure codes, digging deeper

2013-07-18 Thread Christian Schwede
A solution to this might be to set the default policy as a configuration
setting in the proxy. If you want a replicated swift cluster just allow
this policy in the proxy and set it to default. The same for EC cluster,
just set the allowed policy to EC. If you want both (and let your users
decide which policy to use) simply configure a list of allowed policies
with the first one in the list as the default policy in case they don't
set a policy during container creation.

Am 18.07.13 20:15, schrieb Chuck Thier:
 I think you are missing the point.  What I'm talking about is who
 chooses what data is EC and what is not.  The point that I am trying to
 make is that the operators of swift clusters should decide what data is
 EC, not the clients/users.  How the data is stored should be totally
 transparent to the user.
 
 Now if we want to down the road offer user defined classes of storage
 (like how S3 does reduced redundancy), I'm cool with that, just that it
 should be orthogonal to the implementation of EC.
 
 --
 Chuck
 
 
 On Thu, Jul 18, 2013 at 12:57 PM, John Dickinson m...@not.mn
 mailto:m...@not.mn wrote:
 
 Are you talking about the parameters for EC or the fact that
 something is erasure coded vs replicated?
 
 For the first, that's exactly what we're thinking: a deployer sets
 up one (or more) policies and calls them Alice, Bob, or whatever,
 and then the API client can set that on a particular container.
 
 This allows users who know what they are doing (ie those who know
 the tradeoffs and their data characteristics) to make good choices.
 It also allows deployers who want to have an automatic policy to set
 one up to migrate data.
 
 For example, a deployer may choose to run a migrator process that
 moved certain data from replicated to EC containers over time (and
 drops a manifest file in the replicated tier to point to the EC data
 so that the URL still works).
 
 Like existing features in Swift (eg large objects), this gives users
 the ability to flexibly store their data with a nice interface yet
 still have the ability to get at some of the pokey bits underneath.
 
 --John
 
 
 
 On Jul 18, 2013, at 10:31 AM, Chuck Thier cth...@gmail.com
 mailto:cth...@gmail.com wrote:
 
  I'm with Chmouel though.  It seems to me that EC policy should be
 chosen by the provider and not the client.  For public storage
 clouds, I don't think you can make the assumption that all
 users/clients will understand the storage/latency tradeoffs and
 benefits.
 
 
  On Thu, Jul 18, 2013 at 8:11 AM, John Dickinson m...@not.mn
 mailto:m...@not.mn wrote:
  Check out the slides I linked. The plan is to enable an EC policy
 that is then set on a container. A cluster may have a replication
 policy and one or more EC policies. Then the user will be able to
 choose the policy for a particular container.
 
  --John
 
 
 
 
  On Jul 18, 2013, at 2:50 AM, Chmouel Boudjnah
 chmo...@enovance.com mailto:chmo...@enovance.com wrote:
 
   On Thu, Jul 18, 2013 at 12:42 AM, John Dickinson m...@not.mn
 mailto:m...@not.mn wrote:
  * Erasure codes (vs replicas) will be set on a per-container
 basis
  
   I was wondering if there was any reasons why it couldn't be as
   per-account basis as this would allow an operator to have different
   type of an account and different pricing (i.e: tiered storage).
  
   Chmouel.
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
 mailto:OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
 mailto:OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 mailto:OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev