Re: [Openstack-operators] large high-performance ephemeral storage

2018-06-13 Thread Joe Topjian
fio is fine with me. I'll lazily defer to your expertise on the right fio
commands to run for each case. :)

If we're going to test within the guest, that's going to introduce a new
set of variables, right? Should we settle on a standard flavor (maybe two
if we wanted to include both virtio and virtio-scsi) or should the results
make note of what local configuration was used?

On Wed, Jun 13, 2018 at 8:45 AM, Blair Bethwaite 
wrote:

> Hey Joe,
>
> Thanks! So shall we settle on fio as a standard IO micro benchmarking
> tool? Seems to me the minimum we want is throughput and IOPs oriented tests
> for both the guest OS workload profile and the some sort of large working
> set application workload. For the latter it is probably best to ignore
> multiple files and focus solely on queue depth for parallelism, some sort
> of mixed block size profile/s, and some sort of r/w mix (where write <=50%
> to acknowledge this is ephemeral storage so hopefully something is using it
> soon after storing). Thoughts?
>
> Cheers,
> Blair
>
> On Thu., 14 Jun. 2018, 00:24 Joe Topjian,  wrote:
>
>> Yes, you can! The kernel documentation for read/write limits actually
>> uses /dev/null in the examples :)
>>
>> But more seriously: while we have not architected specifically for high
>> performance, for the past few years, we have used a zpool of cheap spindle
>> disks and 1-2 SSD disks for caching. We have ZFS configured for
>> deduplication which helps for the base images but not so much for ephemeral.
>>
>> If you have a standard benchmark command in mind to run, I'd be happy to
>> post the results. Maybe others could do the same to create some type of
>> matrix?
>>
>> On Wed, Jun 13, 2018 at 8:18 AM, Blair Bethwaite <
>> blair.bethwa...@gmail.com> wrote:
>>
>>> Hi Jay,
>>>
>>> Ha, I'm sure there's some wisdom hidden behind the trolling here?
>>>
>>> Believe me, I have tried to push these sorts of use-cases toward volume
>>> or share storage, but in the research/science domain there is often more
>>> accessible funding available to throw at infrastructure stop-gaps than
>>> software engineering (parallelism is hard). PS: when I say ephemeral I
>>> don't necessarily mean they aren't doing backups and otherwise caring that
>>> they have 100+TB of data on a stand alone host.
>>>
>>> PS: I imagine you can set QoS limits on /dev/null these days via CPU
>>> cgroups...
>>>
>>> Cheers,
>>>
>>>
>>> On Thu., 14 Jun. 2018, 00:03 Jay Pipes,  wrote:
>>>
>>>> On 06/13/2018 09:58 AM, Blair Bethwaite wrote:
>>>> > Hi all,
>>>> >
>>>> > Wondering if anyone can share experience with architecting Nova KVM
>>>> > boxes for large capacity high-performance storage? We have some
>>>> > particular use-cases that want both high-IOPs and large capacity
>>>> local
>>>> > storage.
>>>> >
>>>> > In the past we have used bcache with an SSD based RAID0 write-through
>>>> > caching for a hardware (PERC) backed RAID volume. This seemed to work
>>>> > ok, but we never really gave it a hard time. I guess if we followed a
>>>> > similar pattern today we would use lvmcache (or are people still
>>>> using
>>>> > bcache with confidence?) with a few TB of NVMe and a NL-SAS array
>>>> with
>>>> > write cache.
>>>> >
>>>> > Is the collective wisdom to use LVM based instances for these
>>>> use-cases?
>>>> > Putting a host filesystem with qcow2 based disk images on it can't
>>>> help
>>>> > performance-wise... Though we have not used LVM based instance
>>>> storage
>>>> > before, are there any significant gotchas? And furthermore, is it
>>>> > possible to use set IO QoS limits on these?
>>>>
>>>> I've found /dev/null to be the fastest ephemeral storage system, bar
>>>> none.
>>>>
>>>> Not sure if you can set QoS limits on it though.
>>>>
>>>> Best,
>>>> -jay
>>>>
>>>> ___
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators@lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>
>>>
>>> ___
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] large high-performance ephemeral storage

2018-06-13 Thread Joe Topjian
Yes, you can! The kernel documentation for read/write limits actually uses
/dev/null in the examples :)

But more seriously: while we have not architected specifically for high
performance, for the past few years, we have used a zpool of cheap spindle
disks and 1-2 SSD disks for caching. We have ZFS configured for
deduplication which helps for the base images but not so much for ephemeral.

If you have a standard benchmark command in mind to run, I'd be happy to
post the results. Maybe others could do the same to create some type of
matrix?

On Wed, Jun 13, 2018 at 8:18 AM, Blair Bethwaite 
wrote:

> Hi Jay,
>
> Ha, I'm sure there's some wisdom hidden behind the trolling here?
>
> Believe me, I have tried to push these sorts of use-cases toward volume or
> share storage, but in the research/science domain there is often more
> accessible funding available to throw at infrastructure stop-gaps than
> software engineering (parallelism is hard). PS: when I say ephemeral I
> don't necessarily mean they aren't doing backups and otherwise caring that
> they have 100+TB of data on a stand alone host.
>
> PS: I imagine you can set QoS limits on /dev/null these days via CPU
> cgroups...
>
> Cheers,
>
>
> On Thu., 14 Jun. 2018, 00:03 Jay Pipes,  wrote:
>
>> On 06/13/2018 09:58 AM, Blair Bethwaite wrote:
>> > Hi all,
>> >
>> > Wondering if anyone can share experience with architecting Nova KVM
>> > boxes for large capacity high-performance storage? We have some
>> > particular use-cases that want both high-IOPs and large capacity local
>> > storage.
>> >
>> > In the past we have used bcache with an SSD based RAID0 write-through
>> > caching for a hardware (PERC) backed RAID volume. This seemed to work
>> > ok, but we never really gave it a hard time. I guess if we followed a
>> > similar pattern today we would use lvmcache (or are people still using
>> > bcache with confidence?) with a few TB of NVMe and a NL-SAS array with
>> > write cache.
>> >
>> > Is the collective wisdom to use LVM based instances for these
>> use-cases?
>> > Putting a host filesystem with qcow2 based disk images on it can't help
>> > performance-wise... Though we have not used LVM based instance storage
>> > before, are there any significant gotchas? And furthermore, is it
>> > possible to use set IO QoS limits on these?
>>
>> I've found /dev/null to be the fastest ephemeral storage system, bar none.
>>
>> Not sure if you can set QoS limits on it though.
>>
>> Best,
>> -jay
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [Openstack] Certifying SDKs

2017-12-15 Thread Joe Topjian
Hi all,

I've been meaning to reply to this thread. Volodymyr, your reply reminded
me :)

I agree with what you said that the SDK should support everything that the
API supports. In that way, one could simply review the API reference docs
and create a checklist for each possible action. I've often thought about
doing this for Gophercloud so devs/users can see its current state of
what's supported and what's missing.

But Melvin highlighted the word "guaranteed", so I think he's looking for
the most common scenarios/actions rather than an exhaustive list. For that,
I can recommend the suite of Terraform acceptance tests. I've added a test
each time a user has either reported a bug or requested a feature, so
they're scenarios that I know are being used "in the wild".

You can find these tests here:
https://github.com/terraform-providers/terraform-provider-openstack/tree/master/openstack

Each file that begins with "resource" and ends in "_test.go" will contain
various scenarios at the bottom. For example, compute instances:
https://github.com/terraform-providers/terraform-provider-openstack/blob/master/openstack/resource_openstack_compute_instance_v2_test.go#L637-L1134

This contains tests for:

* Basic launch of an instance
* Able to add and remove security groups from an existing instance
* Able to boot from a new volume or an existing volume
* Able to edit metadata of an instance.
* Able to create an instance with multiple ephemeral disks
* Able to create an instance with multiple NICs, some of which are on the
same network, some of which are defined as ports.

Terraform is not an SDK, but it's a direct consumer of Gophercloud and is
more user-facing, so I think it's quite applicable here. The caveat being
that if Terraform or Gophercloud does not support something, it's not
available as a test. :)

Melvin, if this is of interest, I can either post a raw list of these
tests/scenarios here or edit the sheet directly.

Thanks,
Joe


On Fri, Dec 15, 2017 at 12:43 AM, Volodymyr Litovka  wrote:

> Hi Melvin,
>
> isn't SDK the same as Openstack REST API? In my opinion (can be erroneous,
> though), SDK should just support everything that API supports, providing
> some basic checks of parameters (e.g. verify compliancy of passed parameter
> to IP address format, etc) before calling API (in order to decrease load of
> Openstack by eliminating obviously broken requests).
>
> Thanks.
>
>
> On 12/11/17 8:35 AM, Melvin Hillsman wrote:
>
> Hey everyone,
>
> On the path to potentially certifying SDKs we would like to gather a list
> of scenarios folks would like to see "guaranteed" by an SDK.
>
> Some examples - boot instance from image, boot instance from volume,
> attach volume to instance, reboot instance; very much like InterOp works to
> ensure OpenStack clouds provide specific functionality.
>
> Here is a document we can share to do this - https://docs.google.com/
> spreadsheets/d/1cdzFeV5I4Wk9FK57yqQmp5JJdGfKzEOdB3Vtt9vnVJM/edit#gid=0
>
> --
> Kind regards,
>
> Melvin Hillsman
> mrhills...@gmail.com
> mobile: (832) 264-2646
>
>
> ___
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to : openst...@lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
> --
> Volodymyr Litovka
>   "Vision without Execution is Hallucination." -- Thomas Edison
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] mitaka/xenial libvirt issues

2017-11-27 Thread Joe Topjian
We think we've pinned the qemu errors down to a mismatched group ID on a
handful of compute nodes.

The slow systemd/libvirt is still unsolved, but at the moment that does not
actually be the cause of the qemu errors.

On Mon, Nov 27, 2017 at 8:04 AM, Joe Topjian <j...@topjian.net> wrote:

> Hi all,
>
> To my knowledge, we don't use tunneled migrations. This issue is also
> happening with snapshots, so it's not restricted to just migrations.
>
> I haven't yet tried the apparmor patches that George mentioned. I plan on
> applying them once I get another report of a problematic instance.
>
> Thank you for the suggetions, though :)
> Joe
>
> On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.ur...@crystone.com>
> wrote:
>
>> Hello,
>>
>> The seems to assume tunnelled migrations, the live_migration_flag is
>> removed in later version but is there in Mitaka.
>>
>> Do you have the VIR_MIGRATE_TUNNELLED flag set for
>> [libvirt]live_migration_flag in nova.conf?
>>
>>
>> Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds
>>
>> Best regards
>>
>> On 11/26/2017 01:01 PM, Sean Redmond wrote:
>>
>> Hi,
>>
>> I think it maybe related to this:
>>
>> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>>
>> Thanks
>>
>> On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <j...@topjian.net> wrote:
>>
>>> OK, thanks. We'll definitely look at downgrading in a test environment.
>>>
>>> To add some further info to this problem, here are some log entries.
>>> When an instance fails to snapshot or fails to migrate, we see:
>>>
>>> libvirtd[27939]: Cannot start job (modify, none) for domain
>>> instance-4fe4; current job is (modify, none) owned by (27942
>>> remoteDispatchDomainBlockJobAbort, 0 ) for (69116s, 0s)
>>>
>>> libvirtd[27939]: Cannot start job (none, migration out) for domain
>>> instance-4fe4; current job is (modify, none) owned by (27942
>>> remoteDispatchDomainBlockJobAbort, 0 ) for (69361s, 0s)
>>>
>>>
>>> The one piece of this that I'm currently fixated on is the length of
>>> time it takes libvirt to start. I'm not sure if it's causing the above,
>>> though. When starting libvirt through systemd, it takes much longer to
>>> process the iptables and ebtables rules than if we start libvirtd on the
>>> command-line directly.
>>>
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -L libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -L libvirt-P-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -F libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -X libvirt-J-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -F libvirt-P-vnet5'
>>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>>> nat -X libvirt-P-vnet5'
>>>
>>> We're talking about a difference between 5 minutes and 5 seconds
>>> depending on where libvirt was started. This doesn't seem normal to me.
>>>
>>> In general, is anyone aware of systemd performing restrictions of some
>>> kind on processes which create subprocesses? Or something like that? I've
>>> tried comparing cgroups and the various limits within systemd between my
>>> shell session and the libvirt-bin.service session and can't find anything
>>> immediately noticeable. Maybe it's apparmor?
>>>
>>> Thanks,
>>> Joe
>>>
>>> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csarg...@gmail.com>
>>> wrote:
>>>
>>>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
>>>> guarantee that, sorry - I would suggest its worth trying pinning both
>>>> initially.
>>>>
>>>> Chris
>>>>
>>>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Thanks - we will definitely look into this. To confirm: did you also
>>>>> downgrade libvirt as well or was it all qemu?
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>>
>>>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com>
>>>>> wrote

Re: [Openstack-operators] mitaka/xenial libvirt issues

2017-11-27 Thread Joe Topjian
Hi all,

To my knowledge, we don't use tunneled migrations. This issue is also
happening with snapshots, so it's not restricted to just migrations.

I haven't yet tried the apparmor patches that George mentioned. I plan on
applying them once I get another report of a problematic instance.

Thank you for the suggetions, though :)
Joe

On Mon, Nov 27, 2017 at 2:10 AM, Tobias Urdin <tobias.ur...@crystone.com>
wrote:

> Hello,
>
> The seems to assume tunnelled migrations, the live_migration_flag is
> removed in later version but is there in Mitaka.
>
> Do you have the VIR_MIGRATE_TUNNELLED flag set for
> [libvirt]live_migration_flag in nova.conf?
>
>
> Might be a long shot, but I've removed VIR_MIGRATE_TUNNELLED in our clouds
>
> Best regards
>
> On 11/26/2017 01:01 PM, Sean Redmond wrote:
>
> Hi,
>
> I think it maybe related to this:
>
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1647389
>
> Thanks
>
> On Thu, Nov 23, 2017 at 6:20 PM, Joe Topjian <j...@topjian.net> wrote:
>
>> OK, thanks. We'll definitely look at downgrading in a test environment.
>>
>> To add some further info to this problem, here are some log entries. When
>> an instance fails to snapshot or fails to migrate, we see:
>>
>> libvirtd[27939]: Cannot start job (modify, none) for domain
>> instance-4fe4; current job is (modify, none) owned by (27942
>> remoteDispatchDomainBlockJobAbort, 0 ) for (69116s, 0s)
>>
>> libvirtd[27939]: Cannot start job (none, migration out) for domain
>> instance-4fe4; current job is (modify, none) owned by (27942
>> remoteDispatchDomainBlockJobAbort, 0 ) for (69361s, 0s)
>>
>>
>> The one piece of this that I'm currently fixated on is the length of time
>> it takes libvirt to start. I'm not sure if it's causing the above, though.
>> When starting libvirt through systemd, it takes much longer to process the
>> iptables and ebtables rules than if we start libvirtd on the command-line
>> directly.
>>
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -L libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -L libvirt-P-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -F libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -X libvirt-J-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -F libvirt-P-vnet5'
>> virFirewallApplyRule:839 : Applying rule '/sbin/ebtables --concurrent -t
>> nat -X libvirt-P-vnet5'
>>
>> We're talking about a difference between 5 minutes and 5 seconds
>> depending on where libvirt was started. This doesn't seem normal to me.
>>
>> In general, is anyone aware of systemd performing restrictions of some
>> kind on processes which create subprocesses? Or something like that? I've
>> tried comparing cgroups and the various limits within systemd between my
>> shell session and the libvirt-bin.service session and can't find anything
>> immediately noticeable. Maybe it's apparmor?
>>
>> Thanks,
>> Joe
>>
>> On Thu, Nov 23, 2017 at 11:03 AM, Chris Sarginson <csarg...@gmail.com>
>> wrote:
>>
>>> I think we may have pinned libvirt-bin as well, (1.3.1), but I can't
>>> guarantee that, sorry - I would suggest its worth trying pinning both
>>> initially.
>>>
>>> Chris
>>>
>>> On Thu, 23 Nov 2017 at 17:42 Joe Topjian <j...@topjian.net> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Thanks - we will definitely look into this. To confirm: did you also
>>>> downgrade libvirt as well or was it all qemu?
>>>>
>>>> Thanks,
>>>> Joe
>>>>
>>>> On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com>
>>>> wrote:
>>>>
>>>>> We hit the same issue a while back (I suspect), which we seemed to
>>>>> resolve by pinning QEMU and related packages at the following version (you
>>>>> might need to hunt down the debs manually):
>>>>>
>>>>> 1:2.5+dfsg-5ubuntu10.5
>>>>>
>>>>> I'm certain there's a launchpad bug for Ubuntu qemu regarding this,
>>>>> but don't have it to hand.
>>>>>
>>>>> Hope this helps,
>>>>> Chris
>>>>>
>>>>> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote:
>>>>>
>>

Re: [Openstack-operators] mitaka/xenial libvirt issues

2017-11-23 Thread Joe Topjian
Hi Chris,

Thanks - we will definitely look into this. To confirm: did you also
downgrade libvirt as well or was it all qemu?

Thanks,
Joe

On Thu, Nov 23, 2017 at 9:16 AM, Chris Sarginson <csarg...@gmail.com> wrote:

> We hit the same issue a while back (I suspect), which we seemed to resolve
> by pinning QEMU and related packages at the following version (you might
> need to hunt down the debs manually):
>
> 1:2.5+dfsg-5ubuntu10.5
>
> I'm certain there's a launchpad bug for Ubuntu qemu regarding this, but
> don't have it to hand.
>
> Hope this helps,
> Chris
>
> On Thu, 23 Nov 2017 at 15:33 Joe Topjian <j...@topjian.net> wrote:
>
>> Hi all,
>>
>> We're seeing some strange libvirt issues in an Ubuntu 16.04 environment.
>> It's running Mitaka, but I don't think this is a problem with OpenStack
>> itself.
>>
>> We're in the process of upgrading this environment from Ubuntu 14.04 with
>> the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS
>> share) to a new 16.04 compute node (fresh install), so there's a change
>> between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only
>> happening on the 16.04/1.3.1 nodes.
>>
>> We're getting occasional reports of instances not able to be snapshotted.
>> Upon investigation, the snapshot process quits early with a libvirt/qemu
>> lock timeout error. We then see that the instance's xml file has
>> disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot
>> the instance to get things back to a normal state. Trying to live-migrate
>> the instance to another node causes the same thing to happen.
>>
>> However, at some random time, either the snapshot or the migration will
>> work without error. I haven't been able to reproduce this issue on my own
>> and haven't been able to figure out the root cause by inspecting instances
>> reported to me.
>>
>> One thing that has stood out is the length of time it takes for libvirt
>> to start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5
>> minutes before a simple "virsh list" will work. The command will hang
>> otherwise. If I increase libvirt's logging level, I can see that during
>> this period of time, libvirt is working on iptables and ebtables (looks
>> like it's shelling out commands).
>>
>> But if I run "libvirtd -l" straight on the command line, all of this
>> completes within 5 seconds (including all of the shelling out).
>>
>> My initial thought is that systemd is doing some type of throttling
>> between the system and user slice, but I've tried comparing slice
>> attributes and, probably due to my lack of understanding of systemd, can't
>> find anything to prove this.
>>
>> Is anyone else running into this problem? Does anyone know what might be
>> the cause?
>>
>> Thanks,
>> Joe
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] mitaka/xenial libvirt issues

2017-11-23 Thread Joe Topjian
Hi all,

We're seeing some strange libvirt issues in an Ubuntu 16.04 environment.
It's running Mitaka, but I don't think this is a problem with OpenStack
itself.

We're in the process of upgrading this environment from Ubuntu 14.04 with
the Mitaka cloud archive to 16.04. Instances are being live migrated (NFS
share) to a new 16.04 compute node (fresh install), so there's a change
between libvirt versions (1.2.2 to 1.3.1). The problem we're seeing is only
happening on the 16.04/1.3.1 nodes.

We're getting occasional reports of instances not able to be snapshotted.
Upon investigation, the snapshot process quits early with a libvirt/qemu
lock timeout error. We then see that the instance's xml file has
disappeared from /etc/libvirt/qemu and must restart libvirt and hard-reboot
the instance to get things back to a normal state. Trying to live-migrate
the instance to another node causes the same thing to happen.

However, at some random time, either the snapshot or the migration will
work without error. I haven't been able to reproduce this issue on my own
and haven't been able to figure out the root cause by inspecting instances
reported to me.

One thing that has stood out is the length of time it takes for libvirt to
start. If I run "/etc/init.d/libvirt-bin start", it takes at least 5
minutes before a simple "virsh list" will work. The command will hang
otherwise. If I increase libvirt's logging level, I can see that during
this period of time, libvirt is working on iptables and ebtables (looks
like it's shelling out commands).

But if I run "libvirtd -l" straight on the command line, all of this
completes within 5 seconds (including all of the shelling out).

My initial thought is that systemd is doing some type of throttling between
the system and user slice, but I've tried comparing slice attributes and,
probably due to my lack of understanding of systemd, can't find anything to
prove this.

Is anyone else running into this problem? Does anyone know what might be
the cause?

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Successful nova-network to Neutron Migration

2017-05-20 Thread Joe Topjian
Yep, absolutely:

https://github.com/cybera/novanet2neutron

The changes are all rolled into one commit:

https://github.com/cybera/novanet2neutron/commit/31a6ef0eaebb84b94f4cc97d4f0acfffb4eed251

Note that we hard-coded the network and subnet names in there, so if you
end up using this, you'll want to change that.

Thanks,
Joe

On Sat, May 20, 2017 at 2:54 PM, Belmiro Moreira <
moreira.belmiro.email.li...@gmail.com> wrote:

> Hi Joe,
> congrats.
>
> Can you also make available your scripts changes for IPv6?
> The more the better for any site that is still working in the migration,
> like us :)
>
> thanks,
> Belmiro
>
> On Sat, May 20, 2017 at 6:51 PM, Joe Topjian <j...@topjian.net> wrote:
>
>> Hi all,
>>
>> There probably aren't a lot of people in this situation nowadays, but for
>> those that are, I wanted to report a successful nova-network to Neutron
>> migration.
>>
>> We used NeCTAR's migration scripts which can be found here:
>>
>> https://github.com/NeCTAR-RC/novanet2neutron
>>
>> These scripts allowed us to do an in-place upgrade with almost no
>> downtime. There was probably an hour or two of network downtime, but all
>> instances stayed up and running. There were also a handful of instances
>> that needed a hard reboot and some that had to give up their Floating IP to
>> Neutron. All acceptable, IMO.
>>
>> We modified them to suit our environment, specifically by adding support
>> for IPv6 and Floating IPs. In addition, we leaned on our existing Puppet
>> environment to deploy certain  Nova and Neutron settings in phases.
>>
>> But we wouldn't have been able to do this migration without these
>> scripts, so to Sam and the rest of the NeCTAR crew: thank you all very much!
>>
>> Joe
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [scientific] Resource reservation requirements (Blazar) - Forum session

2017-04-03 Thread Joe Topjian
On Mon, Apr 3, 2017 at 8:20 AM, Jay Pipes <jaypi...@gmail.com> wrote:

> On 04/01/2017 08:32 PM, Joe Topjian wrote:
>
>> On Sat, Apr 1, 2017 at 5:21 PM, Matt Riedemann <mriede...@gmail.com
>> <mailto:mriede...@gmail.com>> wrote:
>>
>> On 4/1/2017 8:36 AM, Blair Bethwaite wrote:
>>
>> Hi all,
>>
>> The below was suggested for a Forum session but we don't yet have
>> a
>> submission or name to chair/moderate. I, for one, would certainly
>> be
>> interested in providing input. Do we have any owners out there?
>>
>> Resource reservation requirements:
>> ==
>> The Blazar project [https://wiki.openstack.org/wiki/Blazar
>> <https://wiki.openstack.org/wiki/Blazar>] has been
>> revived following Barcelona and will soon release a new version.
>> Now
>> is a good time to get involved and share requirements with the
>> community. Our development priorities are described through
>> Blueprints
>> on Launchpad: https://blueprints.launchpad.net/blazar
>> <https://blueprints.launchpad.net/blazar>
>>
>> In particular, support for pre-emptible instances could be
>> combined
>> with resource reservation to maximize utilization on unreserved
>> resources.+1
>>
>>
>> Regarding resource reservation, please see this older Nova spec
>> which is related:
>>
>> https://review.openstack.org/#/c/389216/
>> <https://review.openstack.org/#/c/389216/>
>>
>> And see the points that Jay Pipes makes in that review. Before
>> spending a lot of time reviving the project, I'd encourage people to
>> read and digest the points made in that review and if there
>> responses or other use cases then let's discuss them *before*
>> bringing a service back from the dead and assume it will be
>> integrated into the other projects.
>>
>> This is appreciated. I'll describe the way I've seen Blazar used and I
>> believe it's quite different than the above slot reservation as well as
>> spot instance support, but please let me know if I am incorrect or if
>> there have been other discussions about this use-case elsewhere:
>>
>> A research group has a finite amount of specialized hardware and there
>> are more people wanting to use this hardware than what's currently
>> available. Let's use high performance GPUs as an example. The group is
>> OK with publishing the amount of hardware they have available (normally
>> this is hidden as best as possible). By doing this, a researcher can use
>> Blazar as sort of a community calendar, see that there are 3 GPU nodes
>> available for the week of April 3, and reserve them for that time period.
>>
>
> Yeah, I totally understand this use case.
>
> However, implementing the above in any useful fashion requires that Blazar
> be placed *above* Nova and essentially that the cloud operator turns off
> access to Nova's  POST /servers API call for regular users. Because if not,
> the information that Blazar acts upon can be simply circumvented by any
> user at any time.
>
> In other words, your "3 GPU nodes available for the week of April 3" can
> change at any time by a user that goes and launches instances that consumes
> those 3 GPU nodes.
>
> If you have a certain type of OpenStack deployment that isn't multi-user
> and where the only thing that launches instances is an
> automation/orchestration tool (in other words, an NFV MANO system), the
> reservation concepts works great -- because you don't have pesky users that
> can sidestep the system and actually launch instances that would impact
> reserved consumables.
>
> However, if you *do* have normal users of your cloud -- as most scientific
> deployments must have -- then I'm afraid the only way to make this work is
> to have users *only* use the Blazar API to reserve instances and
> essentially shut off the normal Nova POST /servers API.
>
> Does that make sense?
>

Ah, yes, indeed it does. Thanks, Jay.
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [scientific] Resource reservation requirements (Blazar) - Forum session

2017-04-01 Thread Joe Topjian
On Sat, Apr 1, 2017 at 5:21 PM, Matt Riedemann  wrote:

> On 4/1/2017 8:36 AM, Blair Bethwaite wrote:
>
>> Hi all,
>>
>> The below was suggested for a Forum session but we don't yet have a
>> submission or name to chair/moderate. I, for one, would certainly be
>> interested in providing input. Do we have any owners out there?
>>
>> Resource reservation requirements:
>> ==
>> The Blazar project [https://wiki.openstack.org/wiki/Blazar] has been
>> revived following Barcelona and will soon release a new version. Now
>> is a good time to get involved and share requirements with the
>> community. Our development priorities are described through Blueprints
>> on Launchpad: https://blueprints.launchpad.net/blazar
>>
>> In particular, support for pre-emptible instances could be combined
>> with resource reservation to maximize utilization on unreserved
>> resources.+1
>>
>
> Regarding resource reservation, please see this older Nova spec which is
> related:
>
> https://review.openstack.org/#/c/389216/
>
> And see the points that Jay Pipes makes in that review. Before spending a
> lot of time reviving the project, I'd encourage people to read and digest
> the points made in that review and if there responses or other use cases
> then let's discuss them *before* bringing a service back from the dead and
> assume it will be integrated into the other projects.


This is appreciated. I'll describe the way I've seen Blazar used and I
believe it's quite different than the above slot reservation as well as
spot instance support, but please let me know if I am incorrect or if there
have been other discussions about this use-case elsewhere:

A research group has a finite amount of specialized hardware and there are
more people wanting to use this hardware than what's currently available.
Let's use high performance GPUs as an example. The group is OK with
publishing the amount of hardware they have available (normally this is
hidden as best as possible). By doing this, a researcher can use Blazar as
sort of a community calendar, see that there are 3 GPU nodes available for
the week of April 3, and reserve them for that time period.


>
>> Is Blazar the right project to discuss reservations of finite
>> consumable resources like software licenses?
>>
>>   Blazar would like to ultimately support many different kinds of
>> resources (volumes, floating IPs, etc.). Software licenses can be
>> another type.
>> ==
>> (https://etherpad.openstack.org/p/BOS-UC-brainstorming-scientific-wg)
>>
>> Cheers,
>>
>>
> John Garbutt also has a WIP backlog spec in Nova related to pre-emtiple
> instances:
>
> https://review.openstack.org/#/c/438640/
>
> --
>
> Thanks,
>
> Matt
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] backup to object store - tool recommendations

2017-03-27 Thread Joe Topjian
We use rclone quite a bit. It works great and has a wealth of features:

http://rclone.org/


On Mon, Mar 27, 2017 at 7:50 AM, Nick Jones 
wrote:

> On 27 Mar 2017, at 12:59, Marcus Furlong  wrote:
>
>
> On 27 March 2017 at 22:39, Blair Bethwaite 
> wrote:
>
> Hi all,
>
> Does anyone have any recommendations for good tools to perform
> file-system/tree backups and restores to/from a (Ceph RGW-based)
> object store (Swift or S3 APIs)? Happy to hear about both FOSS and
> commercial options please.
>
>
> [..]
>
>
> I've used duplicity before and it seems to provide most of the
> features you are looking for (not sure about xattrs though):
>
>   http://duplicity.nongnu.org/
>
> S3 and Swift are both supported.
>
>
> +1 for Duplicity - we’ve used it with a degree of success for backups of
> various internal systems.
>
> I can also recommend CloudBerry Backup: https://www.
> cloudberrylab.com/backup.aspx
>
> And for personal stuff on MacOS and Windows Arq is great:
> https://www.arqbackup.com
>
> Cheers.
>
> —
>
> -Nick
>
> DataCentred Limited registered in England and Wales no. 05611763
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [neutron] Modify Default Quotas

2017-03-23 Thread Joe Topjian
We run a similar kind of script.

I think in most cases, a Floating IP means a publicly routable IP, and
those are now scarce resources. Because of that, I agree with what's been
mentioned about a conservative floating IP quota.

Since the other resource types aren't restricted by external availability,
they could easily be a higher value. Of course, a small floating IP quota
might restrict what a user can do with the other resources.

The only network resource I've had a user request an increase on is
security groups and rules. Users manage security groups and rules in a lot
of different ways. Some are very conservative and some make new groups for
*everything*.

On Thu, Mar 23, 2017 at 5:46 PM, Pierre Riteau  wrote:

> We’ve encountered the same issue in our cloud. I wouldn’t be surprised if
> it was quite common for systems with many tenants that are not active all
> the time.
>
> You may be interested by this OSOps script: https://git.openstack.org/
> cgit/openstack/osops-tools-generic/tree/neutron/orphan_tool/delete_orphan_
> floatingips.py
> The downside with this script is that it may delete a floating IP that was
> just allocated, if it runs just before the user attaches it to their
> instance.
>
> We have chosen to write a script that releases floating IPs held by
> tenants only if the tenant is inactive for a period of time. We define
> inactive by not having run any instance during this period.
> It is not a silver bullet though, because a tenant running only one
> instance can still keep 49 floating IPs unused, but we found that it helps
> a lot because most of the unused IPs were held by inactive tenants.
>
> Ideally Neutron would be able to track when a floating IP was last
> attached and release it automatically after a configurable period of time.
>
> > On 23 Mar 2017, at 12:47, Saverio Proto  wrote:
> >
> > Hello,
> >
> > floating IPs is the real issue.
> >
> > When using horizon it is very easy for users to allocate floating ips
> > but it is also very difficult to release them.
> >
> > In our production cloud we had to change the default from 50 to 2. We
> > have to be very conservative with floatingips quota because our
> > experience is that the user will never release a floating IP.
> >
> > A good starting point is to set the quota for the floatingips at the
> > the same quota for nova instances.
> >
> > Saverio
> >
> >
> > 2017-03-22 16:46 GMT+01:00 Morales, Victor :
> >> Hey there,
> >>
> >>
> >>
> >> I noticed that Ihar started working on a change to increase the default
> >> quotas values in Neutron[1].  Personally, I think that makes sense to
> change
> >> it but I’d like to complement it.  So, based on your experience, what
> should
> >> be the most common quota value for networks, subnets, ports, security
> >> groups, security rules, routers and Floating IPs per tenant?
> >>
> >>
> >>
> >> Regards/Saludos
> >>
> >> Victor Morales
> >>
> >> irc: electrocucaracha
> >>
> >>
> >>
> >> [1] https://review.openstack.org/#/c/444030
> >>
> >>
> >> ___
> >> OpenStack-operators mailing list
> >> OpenStack-operators@lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >>
> >
> > ___
> > OpenStack-operators mailing list
> > OpenStack-operators@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Flavors

2017-03-15 Thread Joe Topjian
Follow-up thought:

> This concept has never been questioned anywhere I can search, so I have a
feeling I'm missing something big here. Maybe other ways are too
complicated to implement?

This topic does get brought up from time to time, but in different areas
under different names. Off of the top of my head, a past discussion about
quota management and how one could limit the amount of SSD disk space a
user could use while giving them larger access to a spindle disk.

Being able to manage different characteristics about a resource (disk size
vs disk IOPS) is a complicated thing to do and it's certainly not a solved
problem. I don't want to say something like "flavors are just the accepted
norm" because that would be doing them a huge injustice, but I did want to
follow-up and say that you're not alone if you've hit issues with the way
resources are bundled. :)

On Wed, Mar 15, 2017 at 10:31 PM, Joe Topjian <j...@topjian.net> wrote:

> Another benefit of flavors is that they provide ease of use. While there
> are users who are confident enough to spec out each instance they launch, I
> work with a lot of users who would feel overwhelmed if they had to do this.
> Providing a set of recommended instance specs can go a long way to lowering
> the barrier of usage.
>
> On Wed, Mar 15, 2017 at 10:19 PM, Mike Lowe <joml...@iu.edu> wrote:
>
>> How would you account for heterogeneous node types? Flavors by convention
>> put the hardware generation in the name as the digit.
>>
>> Sent from my iPad
>>
>> On Mar 15, 2017, at 11:42 PM, Kris G. Lindgren <klindg...@godaddy.com>
>> wrote:
>>
>> So how do you bill for someone when you have a 24 core, 256GB ram, with
>> 3TB of disk machine - and someone creates a 1 core, 512MB ram, 2.9TB disk –
>> flavor?  Are you going to charge them same amount as if they created a 24
>> core, 250GB instances with 1TB of disk?  Because both of those flavors make
>> it practically impossible to use that hardware for another VM.  Thus, to
>> you they have exactly the same cost.
>>
>>
>>
>> With free-for all flavor sizes your bin packing goes to shit and you are
>> left with inefficiently used hardware.  With free for all flavor sizes how
>> can you make sure that your large ram instances go to sku’s optimized to
>> handle those large ram VM’s?
>>
>>
>>
>> ___
>>
>> Kris Lindgren
>>
>> Senior Linux Systems Engineer
>>
>> GoDaddy
>>
>>
>>
>> *From: *Matthew Kaufman <mkfmn...@gmail.com>
>> *Date: *Wednesday, March 15, 2017 at 5:42 PM
>> *To: *"Fox, Kevin M" <kevin@pnnl.gov>
>> *Cc: *OpenStack Operators <openstack-operators@lists.openstack.org>
>> *Subject: *Re: [Openstack-operators] Flavors
>>
>>
>>
>> Screw the short answer -- that is annoying to read, and it doesn't
>> simplify BILLING from a CapEx/OpEx perspective, so please - wtf?
>>
>> Anyway, Vladimir - I love your question and have always wanted the same
>> thing.
>>
>>
>>
>> On Wed, Mar 15, 2017 at 6:10 PM, Fox, Kevin M <kevin@pnnl.gov> wrote:
>>
>> I think the really short answer is something like: It greatly simplifies
>> scheduling and billing.
>> --
>>
>> *From:* Vladimir Prokofev [v...@prokofev.me]
>> *Sent:* Wednesday, March 15, 2017 2:41 PM
>> *To:* OpenStack Operators
>> *Subject:* [Openstack-operators] Flavors
>>
>> A question of curiosity - why do we even need flavors?
>>
>>
>>
>> I do realise that we need a way to provide instance configuration, but
>> why use such a rigid construction? Wouldn't it be more flexible to provide
>> instance configuration as a set of parameters(metadata), and if you need
>> some presets - well, use a preconfigured set of them as a flavor in your
>> front-end(web/CLI client parameters)?
>>
>>
>>
>> Suppose commercial customer has an instance with high storage IO load.
>> Currently they have only one option - upsize instance to a flavor that
>> provides higher IOPS. But ususally provider has a limited amount of flavors
>> for purchase, and they upscale everything for a price. So instead of paying
>> only for IOPS customers are pushed to pay for whole package. This is good
>> from revenue point of view, but bad for customer's bank account and
>> marketing(i.e. product architecure limits).
>>
>> This applies to every resource - vCPU, RAM, storage, networking, etc -
>> everything is controlled by flavor.
>>
>>

Re: [Openstack-operators] Flavors

2017-03-15 Thread Joe Topjian
Another benefit of flavors is that they provide ease of use. While there
are users who are confident enough to spec out each instance they launch, I
work with a lot of users who would feel overwhelmed if they had to do this.
Providing a set of recommended instance specs can go a long way to lowering
the barrier of usage.

On Wed, Mar 15, 2017 at 10:19 PM, Mike Lowe  wrote:

> How would you account for heterogeneous node types? Flavors by convention
> put the hardware generation in the name as the digit.
>
> Sent from my iPad
>
> On Mar 15, 2017, at 11:42 PM, Kris G. Lindgren 
> wrote:
>
> So how do you bill for someone when you have a 24 core, 256GB ram, with
> 3TB of disk machine - and someone creates a 1 core, 512MB ram, 2.9TB disk –
> flavor?  Are you going to charge them same amount as if they created a 24
> core, 250GB instances with 1TB of disk?  Because both of those flavors make
> it practically impossible to use that hardware for another VM.  Thus, to
> you they have exactly the same cost.
>
>
>
> With free-for all flavor sizes your bin packing goes to shit and you are
> left with inefficiently used hardware.  With free for all flavor sizes how
> can you make sure that your large ram instances go to sku’s optimized to
> handle those large ram VM’s?
>
>
>
> ___
>
> Kris Lindgren
>
> Senior Linux Systems Engineer
>
> GoDaddy
>
>
>
> *From: *Matthew Kaufman 
> *Date: *Wednesday, March 15, 2017 at 5:42 PM
> *To: *"Fox, Kevin M" 
> *Cc: *OpenStack Operators 
> *Subject: *Re: [Openstack-operators] Flavors
>
>
>
> Screw the short answer -- that is annoying to read, and it doesn't
> simplify BILLING from a CapEx/OpEx perspective, so please - wtf?
>
> Anyway, Vladimir - I love your question and have always wanted the same
> thing.
>
>
>
> On Wed, Mar 15, 2017 at 6:10 PM, Fox, Kevin M  wrote:
>
> I think the really short answer is something like: It greatly simplifies
> scheduling and billing.
> --
>
> *From:* Vladimir Prokofev [v...@prokofev.me]
> *Sent:* Wednesday, March 15, 2017 2:41 PM
> *To:* OpenStack Operators
> *Subject:* [Openstack-operators] Flavors
>
> A question of curiosity - why do we even need flavors?
>
>
>
> I do realise that we need a way to provide instance configuration, but why
> use such a rigid construction? Wouldn't it be more flexible to provide
> instance configuration as a set of parameters(metadata), and if you need
> some presets - well, use a preconfigured set of them as a flavor in your
> front-end(web/CLI client parameters)?
>
>
>
> Suppose commercial customer has an instance with high storage IO load.
> Currently they have only one option - upsize instance to a flavor that
> provides higher IOPS. But ususally provider has a limited amount of flavors
> for purchase, and they upscale everything for a price. So instead of paying
> only for IOPS customers are pushed to pay for whole package. This is good
> from revenue point of view, but bad for customer's bank account and
> marketing(i.e. product architecure limits).
>
> This applies to every resource - vCPU, RAM, storage, networking, etc -
> everything is controlled by flavor.
>
>
>
> This concept has never been questioned anywhere I can search, so I have a
> feeling I'm missing something big here. Maybe other ways are too
> complicated to implement?
>
>
>
> So does anyone has any idea - why such rigid approach as flavors instead
> of something more flexible?
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] libvirt freezing when loading Nova instance nwfilters

2017-02-22 Thread Joe Topjian
We ran into the "virsh nwfilter-list hanging indefinitely" thing back in
early January. I spent hours and I almost went insane trying to figure it
out. We weren't upgrading nodes, though, it just sort of happened.

I have no idea if the following was the correct way of handling this, but
this ultimately got nova-compute back up and running:

I ran:

$ ss -ax

on the hypervisor and saw that some monitor sockets had a Recv-Q of
non-zero. On the processes related to those sockets, I ran:

$ strace -p 

and saw no activity. Compared to sockets with zero Recv-Q, strace showed
activity. By now, I figured my only options were a full hypervisor reboot
or to kill the instances with no activity. Since those instances would be
killed from a full reboot anyway, I did a "virsh destroy" on the instances.
Once they were destroyed, nova-compute was able to start cleanly.

We had this happen on 3 hypervisors. Each one had between 1 and 3 of these
types of instances, so not a lot at all. Once they were destroyed,
nova-compute began working again on all 3.

We later had a user report that he noticed some problems with his instance
(not one of the ones destroyed) and thought it might have to do with the
leap second. No idea if that's true, but the timing kind of works out.

Hope that helps,
Joe


On Wed, Feb 22, 2017 at 8:33 AM, Edmund Rhudy (BLOOMBERG/ 120 PARK) <
erh...@bloomberg.net> wrote:

> I recently witnessed a strange issue with libvirt when upgrading one of
> our clusters from Kilo to Liberty. I'm not really looking for a specific
> diagnosis here because of the large number of confounding factors and the
> relative ease of remediating it, but I'm interested to hear if anyone else
> has witnessed this particular problem.
>
> Background is we had a number of Kilo-based clusters, all running Ubuntu
> 14.04.4 with OpenStack installed from the Ubuntu cloud archive. The upgrade
> process to Liberty involved upgrading the OpenStack components and their
> dependencies (including libvirt), then afterward upgrading all remaining
> packages via dist-upgrade (and staging a kernel upgrade from 3.13 to 4.4,
> to take effect on the next reboot). 7 clusters had all been upgraded
> successfully using this strategy.
>
> One cluster, however, decided to get a bit weird. After the upgrade, 4
> hypervisors showed that nova-compute was refusing to come up properly and
> was showing as enabled/down in nova service-list. Upon further
> investigation, nova-compute was starting up but was getting jammed on
> loading nwfilters. When I ran "virsh nwfilter-list", the command stalled
> indefinitely. Killing nova-compute and restarting libvirt-bin service
> allowed the command to work again, but it did not list any of the
> nova-instance-instance-* nwfilters. Once nova-compute was started, it tried
> to start loading the instance-specific filters and libvirt would wedge. I
> spent a while tinkering with the affected systems but could not find any
> way of correcting the issue other than rebooting the hypervisor, after
> which everything was fine.
>
> Has anyone ever seen anything like this? libvirt was upgraded from 1.2.12
> to 1.2.16. Hundreds of hypervisors had already received this exact same
> upgrade without showing this problem, and I have no idea how I could
> reproduce it. I'm interested to hear if anyone else has ever run into this
> and if they figured out what the root cause was, though I've already braced
> myself for tumbleweeds.
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Encrypted Cinder Volume Deployment

2017-02-05 Thread Joe Topjian
Just an update on this:

I've confirmed that specifying a fixed_key in both Cinder and Nova works
quite easily. However, if the key is changed, volumes created with the
original fixed_key are irrecoverable, and there doesn't seem to be a way to
safely rotate fixed keys...

I spent some time trying to set up Barbican in a Packstack-deployed AIO
environment, but was unable to do so, so I couldn't test any other form of
encrypted block storage volumes. Because of time constraints, I'll have to
table this for another time.

If anyone comes across this in the mailing list archives and has an update,
do post :)

Thanks,
Joe

On Mon, Jan 23, 2017 at 8:58 PM, Joe Topjian <j...@topjian.net> wrote:

> Hi Kris,
>
> I came across that as well and I believe it has been fixed and ensures
> existing volumes are accessible:
>
> https://github.com/openstack/nova/blob/8c3f775743914fe083371a31433ef5
> 563015b029/releasenotes/notes/bug-1633518-0646722faac1a4b9.yaml
>
> Definitely worthwhile to bring up :)
>
> Joe
>
> On Mon, Jan 23, 2017 at 12:53 PM, Kris G. Lindgren <klindg...@godaddy.com>
> wrote:
>
>> Slightly off topic,
>>
>>
>>
>> But I remember a discussion involving encrypted volumes and nova(?) and
>> there was an issue where an issue/bug where nova was using the wrong key –
>> like it got hashed wrong and was using the badly hashed key/password vs’s
>> what was configured.
>>
>>
>>
>>
>>
>> ___________
>>
>> Kris Lindgren
>>
>> Senior Linux Systems Engineer
>>
>> GoDaddy
>>
>>
>>
>> *From: *Joe Topjian <j...@topjian.net>
>> *Date: *Monday, January 23, 2017 at 12:41 PM
>> *To: *"openstack-operators@lists.openstack.org" <
>> openstack-operators@lists.openstack.org>
>> *Subject: *[Openstack-operators] Encrypted Cinder Volume Deployment
>>
>>
>>
>> Hi all,
>>
>>
>>
>> I'm investigating the options for configuring Cinder with encrypted
>> volumes and have a few questions.
>>
>>
>>
>> The Cinder environment is currently running Kilo which will be upgraded
>> to something between M-O later this year. The Kilo release supports the
>> fixed_key setting. I see fixed_key is still supported, but has been
>> abstracted into Castellan.
>>
>>
>>
>> Question: If I configure Kilo with a fixed key, will existing volumes
>> still be able to work with that same fixed key in an M, N, O release?
>>
>>
>>
>> Next, fixed_key is discouraged because of it being a single key for all
>> tenants. My understanding is that Barbican provides a way for each tenant
>> to generate their own key.
>>
>>
>>
>> Question: If I deploy with fixed_key (either now or in a later release),
>> can I move from a master key to Barbican without bricking all existing
>> volumes?
>>
>>
>>
>> Are there any other issues to be aware of? I've done a bunch of Googling
>> and searching on bugs.launchpad.net and am pretty satisfied with the
>> current state of support. My intention is to provide users with simple
>> native encrypted volume support - not so much supporting uploaded volumes,
>> bootable volumes, etc.
>>
>>
>>
>> But what I want to make sure of is that I'm not in a position where in
>> order to upgrade, a bunch of volumes become irrecoverable.
>>
>>
>>
>> Thanks,
>>
>> Joe
>>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Encrypted Cinder Volume Deployment

2017-01-23 Thread Joe Topjian
Hi Kris,

I came across that as well and I believe it has been fixed and ensures
existing volumes are accessible:

https://github.com/openstack/nova/blob/8c3f775743914fe083371a31433ef5563015b029/releasenotes/notes/bug-1633518-0646722faac1a4b9.yaml

Definitely worthwhile to bring up :)

Joe

On Mon, Jan 23, 2017 at 12:53 PM, Kris G. Lindgren <klindg...@godaddy.com>
wrote:

> Slightly off topic,
>
>
>
> But I remember a discussion involving encrypted volumes and nova(?) and
> there was an issue where an issue/bug where nova was using the wrong key –
> like it got hashed wrong and was using the badly hashed key/password vs’s
> what was configured.
>
>
>
>
>
> ___
>
> Kris Lindgren
>
> Senior Linux Systems Engineer
>
> GoDaddy
>
>
>
> *From: *Joe Topjian <j...@topjian.net>
> *Date: *Monday, January 23, 2017 at 12:41 PM
> *To: *"openstack-operators@lists.openstack.org" <
> openstack-operators@lists.openstack.org>
> *Subject: *[Openstack-operators] Encrypted Cinder Volume Deployment
>
>
>
> Hi all,
>
>
>
> I'm investigating the options for configuring Cinder with encrypted
> volumes and have a few questions.
>
>
>
> The Cinder environment is currently running Kilo which will be upgraded to
> something between M-O later this year. The Kilo release supports the
> fixed_key setting. I see fixed_key is still supported, but has been
> abstracted into Castellan.
>
>
>
> Question: If I configure Kilo with a fixed key, will existing volumes
> still be able to work with that same fixed key in an M, N, O release?
>
>
>
> Next, fixed_key is discouraged because of it being a single key for all
> tenants. My understanding is that Barbican provides a way for each tenant
> to generate their own key.
>
>
>
> Question: If I deploy with fixed_key (either now or in a later release),
> can I move from a master key to Barbican without bricking all existing
> volumes?
>
>
>
> Are there any other issues to be aware of? I've done a bunch of Googling
> and searching on bugs.launchpad.net and am pretty satisfied with the
> current state of support. My intention is to provide users with simple
> native encrypted volume support - not so much supporting uploaded volumes,
> bootable volumes, etc.
>
>
>
> But what I want to make sure of is that I'm not in a position where in
> order to upgrade, a bunch of volumes become irrecoverable.
>
>
>
> Thanks,
>
> Joe
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Encrypted Cinder Volume Deployment

2017-01-23 Thread Joe Topjian
Hi all,

I'm investigating the options for configuring Cinder with encrypted volumes
and have a few questions.

The Cinder environment is currently running Kilo which will be upgraded to
something between M-O later this year. The Kilo release supports the
fixed_key setting. I see fixed_key is still supported, but has been
abstracted into Castellan.

Question: If I configure Kilo with a fixed key, will existing volumes still
be able to work with that same fixed key in an M, N, O release?

Next, fixed_key is discouraged because of it being a single key for all
tenants. My understanding is that Barbican provides a way for each tenant
to generate their own key.

Question: If I deploy with fixed_key (either now or in a later release),
can I move from a master key to Barbican without bricking all existing
volumes?

Are there any other issues to be aware of? I've done a bunch of Googling
and searching on bugs.launchpad.net and am pretty satisfied with the
current state of support. My intention is to provide users with simple
native encrypted volume support - not so much supporting uploaded volumes,
bootable volumes, etc.

But what I want to make sure of is that I'm not in a position where in
order to upgrade, a bunch of volumes become irrecoverable.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] VM monitoring suggestions

2016-11-17 Thread Joe Topjian
We have some custom scripts that run on the hypervisors which poll:

virsh dominfo
virsh domiflist
etc

The memory stats with "virsh dommemstat" are, AFAIK, not accurate since
there's nothing triggering kvm / the vm to release unused memory. But all
other virsh stuff works well for us.

We don't record "load", but we do record CPU time.

The "nova diagnostics" command can also be helpful. We have a custom policy
in place to allow users to query their own instances. I think a few others
are doing this as well -- there was a past discussion about it.

Hope that helps,
Joe

On Thu, Nov 17, 2016 at 9:57 AM, Jean-Philippe Methot <
jp.met...@planethoster.info> wrote:

> Hi,
>
> We are currently exploring monitoring solutions for the VMs we deploy for
> our customers in production. What I have been asked to deploy would be
> something akin to how you can see openvz container usage: you get memory
> usage, bandwidth, load and so forth for each container.
>
> I know that ceilometer may be an option, but I believe operators use all
> kind of tools for their own ressource usage monitoring. So what do you
> people use?
>
> (For this use case, we're looking for something that can be used without
> installing an agent in the VM, which makes it impossible to get a VM's load
> metric. I would be satisfied with cpu/memory/network/io metrics though.)
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [nova] Does anyone use the os-diagnostics API?

2016-10-12 Thread Joe Topjian
Hi Matt, Tim,

Thanks for asking. We’ve used the API in the past as a way of getting the
> usage data out of Nova. We had problems running ceilometer at scale and
> this was a way of retrieving the data for our accounting reports. We
> created a special policy configuration to allow authorised users query this
> data without full admin rights.
>

We do this as well.


> From the look of the new spec, it would be fairly straightforward to adapt
> the process to use the new format as all the CPU utilisation data is there.
>

I agree.
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Murano in Production

2016-09-26 Thread Joe Topjian
Hi Serg,

We were indeed hitting that bug, but the cert wasn't self-signed. It was
easier for us to manually patch the Ubuntu Cloud package of Murano with the
stable/mitaka fix linked in that bug report than trying to debug where
OpenSSL/python/requests/etc was going awry.

We might redeploy Murano strictly using virtualenv's and pip so we stay on
the latest stable patches.

Thanks,
Joe

On Mon, Sep 26, 2016 at 11:03 PM, Serg Melikyan 
wrote:

> Hi Joe,
>
> >Also, is it safe to say that communication between agent/engine only, and
> will only, happen during app deployment?
>
> murano-agent & murano-engine keep active connection to the Rabbit MQ
> broker but message exchange happens only during deployment of the app.
>
> >One thing we just ran into, though, was getting the agent/engine rmq
> config to work with SSL
>
> We had related bug fixed in Newton, can you confirm that you are *not*
> hitting bug #1578421 [0]
>
> References:
> [0] https://bugs.launchpad.net/murano/+bug/1578421
>
>
>
>
> On Mon, Sep 26, 2016 at 1:43 PM, Andrew Woodward  wrote:
> > In Fuel we deploy haproxy to all of the nodes that are part of the
> > VIP/endpoint service (This is usually part of the controller role) Then
> the
> > vips (internal or public) can be active on any member of the group.
> > Corosync/Pacemaker is used to move the VIP address (as apposed to
> > keepalived) in our case both haproxy, and the vip live in a namespace and
> > haproxy is always running on all of these nodes bound to 0/0.
> >
> > In the case of murano-rabbit we take the same approach as we do for
> galera,
> > all of the members are listed in the balancer, but with the others as
> > backup's this makes them inactive until the first node is down. This
> allow
> > the vip to move to any of the proxies in the cluster, and continue to
> direct
> > traffic to the same node util that rabbit instance is also un-available
> >
> > isten mysqld
> >   bind 192.168.0.2:3306
> >   mode  tcp
> >   option  httpchk
> >   option  tcplog
> >   option  clitcpka
> >   option  srvtcpka
> >   stick on  dst
> >   stick-table  type ip size 1
> >   timeout client  28801s
> >   timeout server  28801s
> >   server node-1 192.168.0.4:3307  check port 49000 inter 20s fastinter
> 2s
> > downinter 2s rise 3 fall 3
> >   server node-3 192.168.0.6:3307 backup check port 49000 inter 20s
> fastinter
> > 2s downinter 2s rise 3 fall 3
> >   server node-4 192.168.0.5:3307 backup check port 49000 inter 20s
> fastinter
> > 2s downinter 2s rise 3 fall 3
> >
> > listen murano_rabbitmq
> >   bind 10.110.3.3:55572
> >   balance  roundrobin
> >   mode  tcp
> >   option  tcpka
> >   timeout client  48h
> >   timeout server  48h
> >   server node-1 192.168.0.4:55572  check inter 5000 rise 2 fall 3
> >   server node-3 192.168.0.6:55572 backup check inter 5000 rise 2 fall 3
> >   server node-4 192.168.0.5:55572 backup check inter 5000 rise 2 fall 3
> >
> >
> > On Fri, Sep 23, 2016 at 7:30 AM Mike Lowe  wrote:
> >>
> >> Would you mind sharing an example snippet from HA proxy config?  I had
> >> struggled in the past with getting this part to work.
> >>
> >>
> >> > On Sep 23, 2016, at 12:13 AM, Serg Melikyan 
> >> > wrote:
> >> >
> >> > Hi Joe,
> >> >
> >> > I can share some details on how murano is configured as part of the
> >> > default Mirantis OpenStack configuration and try to explain why it's
> >> > done in that way as it's done, I hope it helps you in your case.
> >> >
> >> > As part of Mirantis OpenStack second instance of the RabbitMQ is
> >> > getting deployed specially for the murano, but it's configuration is
> >> > different than for the RabbitMQ instance used by the other OpenStack
> >> > components.
> >> >
> >> > Why to use separate instance of the RabbitMQ?
> >> > 1. Prevent possibility to get access to the RabbitMQ supporting
> >> > whole cloud infrastructure by limiting access on the networking level
> >> > rather than rely on authentication/authorization
> >> > 2. Prevent possibility of DDoS by limiting access on the
> >> > networking level to the infrastructure RabbitMQ
> >> >
> >> > Given that second RabbitMQ instance is used only for the murano-agent
> >> > <-> murano-engine communications and murano-agent is running on the
> >> > VMs we had to make couple of changes in the deployment of the RabbitMQ
> >> > (bellow I am referencing RabbitMQ as RabbitMQ instance used by Murano
> >> > for m-agent <-> m-engine communications):
> >> >
> >> > 1. RabbitMQ is not clustered, just separate instance running on each
> >> > controller node
> >> > 2. RabbitMQ is exposed on the Public VIP where all OpenStack APIs are
> >> > exposed
> >> > 3. It's has different port number than default
> >> > 4. HAProxy is used, RabbitMQ is hidden behind it and HAProxy is always
> >> > pointing to the RabbitMQ on the current primary controller
> >> >
> >> > Note: How murano-agent is working? Murano-engine creates queue 

Re: [Openstack-operators] Murano in Production

2016-09-23 Thread Joe Topjian
Hi Serg,

Thank you for sharing this information :)

If I'm understanding correctly, the main reason you're using a
non-clustered / corosync setup is because that's how most other components
in Mirantis OpenStack are configured? Is there anything to be aware of in
how Murano communicates over the agent/engine rmq in a clustered rmq setup?

Also, is it safe to say that communication between agent/engine only, and
will only, happen during app deployment? Meaning, if the rmq server goes
down (let's even say it goes away permanently for exaggeration), short of
some errors in the agent log, nothing else bad will come out of it?

With regard to a different port and a publicly accessible address, I agree
and we'll be deploying this same way.

One thing we just ran into, though, was getting the agent/engine rmq config
to work with SSL. For some reason the murano/openstack configuration (done
via oslo) had no problems recognizing our SSL cert, but the agent/engine
did not like it at all. The Ubuntu Cloud packages have not been updated for
a bit so we ended up patching for the "insecure" option both in engine and
agent templates (btw: very nice that the agent can be installed via
cloud-init -- I really didn't want to manage a second set of images just to
have the agent pre-installed).

Thank you again,
Joe

On Thu, Sep 22, 2016 at 10:13 PM, Serg Melikyan 
wrote:

> Hi Joe,
>
> I can share some details on how murano is configured as part of the
> default Mirantis OpenStack configuration and try to explain why it's
> done in that way as it's done, I hope it helps you in your case.
>
> As part of Mirantis OpenStack second instance of the RabbitMQ is
> getting deployed specially for the murano, but it's configuration is
> different than for the RabbitMQ instance used by the other OpenStack
> components.
>
> Why to use separate instance of the RabbitMQ?
>  1. Prevent possibility to get access to the RabbitMQ supporting
> whole cloud infrastructure by limiting access on the networking level
> rather than rely on authentication/authorization
>  2. Prevent possibility of DDoS by limiting access on the
> networking level to the infrastructure RabbitMQ
>
> Given that second RabbitMQ instance is used only for the murano-agent
> <-> murano-engine communications and murano-agent is running on the
> VMs we had to make couple of changes in the deployment of the RabbitMQ
> (bellow I am referencing RabbitMQ as RabbitMQ instance used by Murano
> for m-agent <-> m-engine communications):
>
> 1. RabbitMQ is not clustered, just separate instance running on each
> controller node
> 2. RabbitMQ is exposed on the Public VIP where all OpenStack APIs are
> exposed
> 3. It's has different port number than default
> 4. HAProxy is used, RabbitMQ is hidden behind it and HAProxy is always
> pointing to the RabbitMQ on the current primary controller
>
> Note: How murano-agent is working? Murano-engine creates queue with
> uniq name and put configuration tasks to that queue which are later
> getting picked up by murano-agent when VM is booted and murano-agent
> is configured to use created queue through cloud-init.
>
> #1 Clustering
>
> * Given that per 1 app deployment from we create 1-N VMs and send 1-M
> configuration tasks, where in most of the cases N and M are less than
> 3.
> * Even if app deployment will be failed due to cluster failover it's
> can be always re-deployed by the user.
> * Controller-node failover most probably will lead to limited
> accessibility of the Heat, Nova & Neutron API and application
> deployment will fail regardless of the not executing configuration
> task on the VM.
>
> #2 Exposure on the Public VIP
>
> One of the reasons behind choosing RabbitMQ as transport for
> murano-agent communications was connectivity from the VM - it's much
> easier to implement connectivity *from* the VM than *to* VM.
>
> But even in the case when you are connecting to the broker from the VM
> you should have connectivity and public interface where all other
> OpenStack APIs are exposed is most natural way to do that.
>
> #3 Different from the default port number
>
> Just to avoid confusion from the RabbitMQ used for the infrastructure,
> even given that they are on the different networks.
>
> #4 HAProxy
>
> In case of the default Mirantis OpenStack configuration is used mostly
> to support non-clustered RabbitMQ setup and exposure on the Public
> VIP, but also helpful in case of more complicated setups.
>
> P.S. I hope my answers helped, let me know if I can cover something in
> more details.
> --
> Serg Melikyan, Development Manager at Mirantis, Inc.
> http://mirantis.com | smelik...@mirantis.com
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Murano in Production

2016-09-18 Thread Joe Topjian
Good call.

I think Matt bringing up Trove is worthwhile, too. If we were to consider
deploying Trove in the future, and now that I've learned it also has an
agent/rabbit setup, there's definitely more weight behind a second
agent-only Rabbit cluster.

On Sun, Sep 18, 2016 at 9:15 PM, Sam Morrison <sorri...@gmail.com> wrote:

> You could also use https://www.rabbitmq.com/maxlength.html to mitigate
> overflowing on the trove vhost side.
>
>
> Sam
>
>
> On 19 Sep 2016, at 1:07 PM, Joe Topjian <j...@topjian.net> wrote:
>
> Thanks for everyone's input. I think I'm going to go with a single Rabbit
> cluster and separate by vhosts. Our environment is nowhere as large as
> NeCTAR or TWC, so I can definitely understand concern about Rabbit blowing
> the cloud up. We can be a little bit more flexible.
>
> As a precaution, though, I'm going to route everything through a new
> HAProxy frontend. At first, it'll just point to the same Rabbit cluster,
> but if we need to create a separate cluster, we'll swap the backend out.
> That should enable existing Murano agents to continue working.
>
> If this crashes and burns on us, I'll be more than happy to report
> failure. :)
>
> On Sun, Sep 18, 2016 at 7:38 PM, Silence Dogood <m...@nycresistor.com>
> wrote:
>
>> I'd love to see your results on this .  Very interesting stuff.
>>
>> On Sep 17, 2016 1:37 AM, "Joe Topjian" <j...@topjian.net> wrote:
>>
>>> Hi all,
>>>
>>> We're planning to deploy Murano to one of our OpenStack clouds and I'm
>>> debating the RabbitMQ setup.
>>>
>>> For background: the Murano agent that runs on instances requires access
>>> to RabbitMQ. Murano is able to be configured with two RabbitMQ services:
>>> one for traditional OpenStack communication and one for the Murano/Agent
>>> communication.
>>>
>>> From a security/segregation point of view, would vhost separation on our
>>> existing RabbitMQ cluster be sufficient? Or is it recommended to have an
>>> entirely separate cluster?
>>>
>>> As you can imagine, I'd like to avoid having to manage *two* RabbitMQ
>>> clusters. :)
>>>
>>> Thanks,
>>> Joe
>>>
>>> ___
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Murano in Production

2016-09-18 Thread Joe Topjian
Thanks for everyone's input. I think I'm going to go with a single Rabbit
cluster and separate by vhosts. Our environment is nowhere as large as
NeCTAR or TWC, so I can definitely understand concern about Rabbit blowing
the cloud up. We can be a little bit more flexible.

As a precaution, though, I'm going to route everything through a new
HAProxy frontend. At first, it'll just point to the same Rabbit cluster,
but if we need to create a separate cluster, we'll swap the backend out.
That should enable existing Murano agents to continue working.

If this crashes and burns on us, I'll be more than happy to report failure.
:)

On Sun, Sep 18, 2016 at 7:38 PM, Silence Dogood <m...@nycresistor.com>
wrote:

> I'd love to see your results on this .  Very interesting stuff.
>
> On Sep 17, 2016 1:37 AM, "Joe Topjian" <j...@topjian.net> wrote:
>
>> Hi all,
>>
>> We're planning to deploy Murano to one of our OpenStack clouds and I'm
>> debating the RabbitMQ setup.
>>
>> For background: the Murano agent that runs on instances requires access
>> to RabbitMQ. Murano is able to be configured with two RabbitMQ services:
>> one for traditional OpenStack communication and one for the Murano/Agent
>> communication.
>>
>> From a security/segregation point of view, would vhost separation on our
>> existing RabbitMQ cluster be sufficient? Or is it recommended to have an
>> entirely separate cluster?
>>
>> As you can imagine, I'd like to avoid having to manage *two* RabbitMQ
>> clusters. :)
>>
>> Thanks,
>> Joe
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Murano in Production

2016-09-16 Thread Joe Topjian
Hi all,

We're planning to deploy Murano to one of our OpenStack clouds and I'm
debating the RabbitMQ setup.

For background: the Murano agent that runs on instances requires access to
RabbitMQ. Murano is able to be configured with two RabbitMQ services: one
for traditional OpenStack communication and one for the Murano/Agent
communication.

>From a security/segregation point of view, would vhost separation on our
existing RabbitMQ cluster be sufficient? Or is it recommended to have an
entirely separate cluster?

As you can imagine, I'd like to avoid having to manage *two* RabbitMQ
clusters. :)

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] cURL call at the end of provisioning and deprovisioning

2016-08-15 Thread Joe Topjian
Hi Will,

What about notification events? I'm not sure what the best source of
documentation is for events, but googling "OpenStack Notification Events"
yields a bunch of information that should lead you in the right direction.

With events, you would write a custom scripts / daemon that polls rabbit
for the latest events and acts upon certain ones (instance create, instance
destroy).

Thanks,
Joe

On Mon, Aug 15, 2016 at 8:08 AM, William Josefsson <
william.josef...@gmail.com> wrote:

> Yes thanks Tomas. I thought of the nova boot --user-data myscript.sh
> option, but I'm not sure how that would manage a deprovisioning. nova
> delete myvm, at this point I also wanna run a delete script than to
> delete the A-record, or is that possible? thx! Will
>
> On Mon, Aug 15, 2016 at 7:59 PM, Tomáš  Vondra 
> wrote:
> > Hi Will!
> > You may want one of two things:
> > a) OpenStack Designate, which is a DNS as a Service system integrated
> with Nova. It will add a record for every instance in a zone you delegate
> to it.
> > b) Cloud-init. It runs in nearly every OpenStack Linux OS image and is
> configured by /etc/cloud/cloud.cfg. There you will see a module that does a
> curl call at the end of the boot process. You don't have to modify the
> image, it takes configuration as a user-data script.
> > Tomas
> >
> >
> >  Původní zpráva 
> > Odesílatel: William Josefsson 
> > Odesláno: 14. srpna 2016 17:54:45 SELČ
> > Komu: openstack-operators@lists.openstack.org
> > Předmět: [Openstack-operators] cURL call at the end of provisioning and
> deprovisioning
> >
> > Hi list,
> >
> > I wanted to make a cURL call once an instance provisioning finished,
> > and also when an instance gets deprovisioned. I will make a http call
> > to PowerDNS for registration and deregistration of A record.
> >
> > Is there any way to run a common for a instance once it is finished
> > provisioning and when it gets deprovisioned?
> >
> > thx, will
> >
> > ___
> > OpenStack-operators mailing list
> > OpenStack-operators@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] PCI Passthrough issues

2016-07-19 Thread Joe Topjian
Hi Blair,

We only updated qemu. We're running the version of libvirt from the Kilo
cloudarchive.

We've been in production with our K80s for around two weeks now and have
had several users report success.

Thanks,
Joe

On Tue, Jul 19, 2016 at 5:06 PM, Blair Bethwaite 
wrote:

> Hilariously (or not!) we finally hit the same issue last week once
> folks actually started trying to do something (other than build and
> load drivers) with the K80s we're passing through. This
>
> https://devtalk.nvidia.com/default/topic/850833/pci-passthrough-kvm-for-cuda-usage/
> is the best discussion of the issue I've found so far, haven't tracked
> down an actual bug yet though. I wonder whether it has something to do
> with the memory size of the device, as we've been happy for a long
> time with other NVIDIA GPUs (GRID K1, K2, M2070, ...).
>
> Jon, when you grabbed Mitaka Qemu, did you also update libvirt? We're
> just working through this and have tried upgrading both but are
> hitting some issues with Nova and Neutron on the compute nodes,
> thinking it may libvirt related but debug isn't helping much yet.
>
> Cheers,
>
> On 8 July 2016 at 00:54, Jonathan Proulx  wrote:
> > On Thu, Jul 07, 2016 at 11:13:29AM +1000, Blair Bethwaite wrote:
> > :Jon,
> > :
> > :Awesome, thanks for sharing. We've just run into an issue with SRIOV
> > :VF passthrough that sounds like it might be the same problem (device
> > :disappearing after a reboot), but haven't yet investigated deeply -
> > :this will help with somewhere to start!
> >
> > :By the way, the nouveau mention was because we had missed it on some
> > :K80 hypervisors recently and seen passthrough apparently work, but
> > :then the NVIDIA drivers would not build in the guest as they claimed
> > :they could not find a supported device (despite the GPU being visible
> > :on the PCI bus).
> >
> > Definitely sage advice!
> >
> > :I have also heard passing mention of requiring qemu
> > :2.3+ but don't have any specific details of the related issue.
> >
> > I didn't do a bisection but with qemu 2.2 (from ubuntu cloudarchive
> > kilo) I was sad and with 2.5 (from ubuntu cloudarchive mitaka but
> > installed on a kilo hypervisor) I am working.
> >
> > Thanks,
> > -Jon
> >
> >
> > :Cheers,
> > :
> > :On 7 July 2016 at 08:13, Jonathan Proulx  wrote:
> > :> On Wed, Jul 06, 2016 at 12:32:26PM -0400, Jonathan D. Proulx wrote:
> > :> :
> > :> :I do have an odd remaining issue where I can run cuda jobs in the vm
> > :> :but snapshots fail and after pause (for snapshotting) the pci device
> > :> :can't be reattached (which is where i think it deletes the snapshot
> > :> :it took).  Got same issue with 3.16 and 4.4 kernels.
> > :> :
> > :> :Not very well categorized yet, but I'm hoping it's because the VM I
> > :> :was hacking on had it's libvirt.xml written out with the older qemu
> > :> :maybe?  It had been through a couple reboots of the physical system
> > :> :though.
> > :> :
> > :> :Currently building a fresh instance and bashing more keys...
> > :>
> > :> After an ugly bout of bashing I've solve my failing snapshot issue
> > :> which I'll post here in hopes of saving someonelse
> > :>
> > :> Short version:
> > :>
> > :> add "/dev/vfio/vfio rw," to  /etc/apparmor.d/abstractions/libvirt-qemu
> > :> add "ulimit -l unlimited" to /etc/init/libvirt-bin.conf
> > :>
> > :> Longer version:
> > :>
> > :> What was happening.
> > :>
> > :> * send snapshot request
> > :> * instance pauses while snapshot is pending
> > :> * instance attempt to resume
> > :> * fails to reattach pci device
> > :>   * nova-compute.log
> > :> Exception during message handling: internal error: unable to
> execute QEMU command 'device_add': Device initialization failedcompute.log
> > :>
> > :>   * qemu/.log
> > :> vfio: failed to open /dev/vfio/vfio: Permission denied
> > :> vfio: failed to setup container for group 48
> > :> vfio: failed to get group 48
> > :> * snapshot disappears
> > :> * instance resumes but without passed through device (hard reboot
> > :> reattaches)
> > :>
> > :> seeing permsission denied I though would be an easy fix but:
> > :>
> > :> # ls -l /dev/vfio/vfio
> > :> crw-rw-rw- 1 root root 10, 196 Jul  6 14:05 /dev/vfio/vfio
> > :>
> > :> so I'm guessing I'm in apparmor hell, I try adding "/dev/vfio/vfio
> > :> rw," to  /etc/apparmor.d/abstractions/libvirt-qemu rebooting the
> > :> hypervisor and trying again which gets me a different libvirt error
> > :> set:
> > :>
> > :> VFIO_MAP_DMA: -12
> > :> vfio_dma_map(0x5633a5fa69b0, 0x0, 0xa, 0x7f4e7be0) = -12
> (Cannot allocate memory)
> > :>
> > :> kern.log (and thus dmesg) showing:
> > :> vfio_pin_pages: RLIMIT_MEMLOCK (65536) exceeded
> > :>
> > :> Getting rid of this one required inserting 'ulimit -l unlimited' into
> > :> /etc/init/libvirt-bin.conf in the 'script' section:
> > :>
> > :> 
> > :> script
> > :> [ -r /etc/default/libvirt-bin ] && . 

Re: [Openstack-operators] How to create floating ip pool use nova network? thanks

2016-07-07 Thread Joe Topjian
In Kilo (I haven't verified Liberty or Mitaka), you can manage nova-network
floating IP pools with:

nova-manage floating --help
nova-manage floating create --help

Hope that helps!
Joe


On Wed, Jul 6, 2016 at 8:23 PM, 云淡风轻 <821696...@qq.com> wrote:

> hi everyone,
>
> How to create floating ip pool use nova network? thanks
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] PCI Passthrough issues

2016-07-06 Thread Joe Topjian
Hi Jon,

We were also running into issues with the K80s.

For our GPU nodes, we've gone with a 4.2 or 4.4 kernel. PCI Passthrough
works much better in those releases. (I ran into odd issues with 4.4 and
NFS, downgraded to 4.2 after a few hours of banging my head, problems went
away, not a scientific solution :)

After that, make sure vfio is loaded:

$ lsmod | grep vfio

Then start with the "deviceQuery" CUDA sample. We've found deviceQuery to
be a great check to see if the instance has full/correct access to the
card. If deviceQuery prints a report within 1-2 seconds, all is well. If
there is a lag, something is off.

In our case for the K80s, that final "something" was qemu. We came across
this[1] wiki page (search for K80) and started digging into qemu. tl;dr:
upgrading to the qemu packages found in the Ubuntu Mitaka cloud archive
solved our issues.

Hope that helps,
Joe

1: https://pve.proxmox.com/wiki/Pci_passthrough


On Wed, Jul 6, 2016 at 7:27 AM, Jonathan D. Proulx 
wrote:

> Hi All,
>
> Trying to spass through some Nvidia K80 GPUs to soem instance and have
> gotten to the place where Nova seems to be doing the right thing gpu
> instances scheduled on the 1 gpu hypervisor I have and for inside the
> VM I see:
>
> root@gpu-x1:~# lspci | grep -i k80
> 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>
> And I can install nvdia-361 driver and get
>
> # ls /dev/nvidia*
> /dev/nvidia0  /dev/nvidiactl  /dev/nvidia-uvm  /dev/nvidia-uvm-tools
>
> Once I load up cuda-7.5 and build the exmaples none fo the run
> claiming there's no cuda device.
>
> # ./matrixMul
> [Matrix Multiply Using CUDA] - Starting...
> cudaGetDevice returned error no CUDA-capable device is detected (code 38),
> line(396)
> cudaGetDeviceProperties returned error no CUDA-capable device is detected
> (code 38), line(409)
> MatrixA(160,160), MatrixB(320,160)
> cudaMalloc d_A returned error no CUDA-capable device is detected (code
> 38), line(164)
>
> I'm not familiar with cuda really but I did get some example code
> running on the physical system for burn in over the weekend (sicne
> reinstaleld so no nvidia driver on hypervisor).
>
> Following various online examples  for setting up pass through I set
> the kernel boot line on the hypervisor to:
>
> # cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic
> root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0
> console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci
> nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci
> nomdmonddf nomdmonisw
>
> Puzzled that I apparently have the device but it is apparently
> nonfunctional, where do I even look from here?
>
> -Jon
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] How are folks providing GPU instance types?

2016-05-11 Thread Joe Topjian
Just wanted to add a few notes (I apologize for the brevity):

* The wiki page is indeed the best source of information to get started.
* I found that I didn't have to use EFI-based images. I wonder why that is?
* PCI devices and IDs can be found by running the following on a compute
node:

$ lspci -nn | grep -i nvidia
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [GRID
K1] [10de:0ff2] (rev a1)
85:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [GRID
K1] [10de:0ff2] (rev a1)
86:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [GRID
K1] [10de:0ff2] (rev a1)
87:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [GRID
K1] [10de:0ff2] (rev a1)

In which 10de becomes the vendor ID and 0ff2 becomes the product ID.

* My nova.conf looks like this:

pci_alias={"vendor_id":"10de", "product_id":"0ff2", "name":"gpu"}
scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler
scheduler_available_filters=nova.scheduler.filters.all_filters
scheduler_available_filters=nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter
scheduler_default_filters=RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,PciPassthroughFilter

* My /etc/default/grub on the compute node has the following entries:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt
rd.modules-load=vfio-pci"
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt rd.modules-load=vfio-pci"

* I use the following to create a flavor with access to a single GPU:

nova flavor-create g1.large auto 8192 20 4 --ephemeral 20 --swap 2048
nova flavor-key g1.large set "pci_passthrough:alias"="gpu:1"

For NVIDIA cards in particular, it might take a few attempts to install the
correct driver version, CUDA tools version, etc to get things working
correctly. NVIDIA has a bundle of CUDA examples, one of which is
"/usr/local/cuda-7.5/samples/1_Utilities/deviceQuery". Running this will
confirm if the instance can successfully access the GPU.

Hope this helps!
Joe


On Tue, May 10, 2016 at 8:58 AM, Tomas Vondra  wrote:

> Nordquist, Peter L  writes:
>
> > You will also have to enable iommu on your hypervisors to have libvirt
> expose the capability to Nova for PCI
> > passthrough.  I use Centos 7 and had to set 'iommu=pt intel_iommu=on' for
> my kernel parameters.  Along with
> > this, you'll have to start using EFI for your VMs by installing OVMF on
> your Hypervisors and configuring
> > your images appropriately.  I don't have a link handy for this but the
> gist is that Legacy bootloaders have a
> > much more complicated process to initialize the devices being passed to
> the VM where EFI is much easier.
>
> Hi!
> What I found out the hard way under the Xen hypervisor is that the GPU you
> are passing through must not be the primary GPU of the system. Otherwise,
> you get memory corruption as soon as something appears on the console. If
> not sooner :-). Test if your motherboards are capable of running on the
> integrated VGA even if some other graphics card is connected. Or blacklist
> it for the kernel.
> Tomas
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RAID / stripe block storage volumes

2016-03-07 Thread Joe Topjian
On Mon, Mar 7, 2016 at 12:33 AM, Tim Bell <tim.b...@cern.ch> wrote:

> From: joe <j...@topjian.net>
> Date: Monday 7 March 2016 at 07:53
> To: openstack-operators <openstack-operators@lists.openstack.org>
> Subject: Re: [Openstack-operators] RAID / stripe block storage volumes
>
> We ($work) have been researching this topic for the past few weeks and I
> wanted to give an update on what we've found.
>
> First, we've found that both Rackspace and Azure advocate the use of
> RAID'ing block storage volumes from within an instance for both performance
> and resilience [1][2][3]. I only mention this to add to the earlier Amazon
> AWS information and not to imply that more people should share this view.
>
> Second, we discovered virtio-scsi [4]. By adding the following properties
> to an image, the disks will now appear as SCSI disks, including the more
> common /dev/sdx naming:
>
> hw_disk_bus_model=virtio-scsi
> hw_scsi_model=virtio-scsi
> hw_disk_bus=scsi
>
> What's notable is that, in our testing, ZFS pools and Gluster replicas are
> more likely to see the volume disconnect/fail with virtio-scsi. mdadm has
> always been fairly dependable, so there hasn't been a change there. We're
> still testing, but virtio-scsi looks promising.
>
>
> We found significantly slower (~20%) from the virtio SCSI on bonnie++. I
> had been thinking it would be better.
>
> What were your performance experiences ?
>
> Tim
>

That's one area we're still testing. We're seeing a 15% increase in reads
for 4k - 1m blocks but anywhere from 3-20% decrease in all types of writing
activity. Something seems off... or at least that there should be a reason.


>
> 1:
> https://support.rackspace.com/how-to/configuring-a-software-raid-on-a-linux-general-purpose-cloud-server/
> 2: https://support.rackspace.com/how-to/cloud-block-storage-faq/
> 3:
> https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-configure-raid/
> 4: https://wiki.openstack.org/wiki/LibvirtVirtioScsi
>
> On Mon, Feb 8, 2016 at 7:18 PM, Joe Topjian <j...@topjian.net> wrote:
>
>> Yep. Don't get me wrong -- I agree 100% with everything you've said
>> throughout this thread. Applications that have native replication are
>> awesome. Swift is crazy awesome. :)
>>
>> I understand that some may see the use of mdadm, Cinder-assisted
>> replication, etc as supporting "pet" environments, and I agree to some
>> extent. But I do think there are applicable use-cases where those services
>> could be very helpful.
>>
>> As one example, I know of large cloud-based environments which handle
>> very large data sets and are entirely stood up through configuration
>> management systems. However, due to the sheer size of data being handled,
>> rebuilding or resyncing a portion of the environment could take hours.
>> Failing over to a replicated volume is instant.In addition, being able to
>> both stripe and replicate goes a very long way in making the most out of
>> commodity block storage environments (for example, avoiding packing
>> problems and such).
>>
>> Should these types of applications be reading / writing directly to
>> Swift, HDFS, or handling replication themselves? Sure, in a perfect world.
>> Does Gluster fill all gaps I've mentioned? Kind of.
>>
>> I guess I'm just trying to survey the options available for applications
>> and environments that would otherwise be very flexible and resilient if it
>> wasn't for their awkward use of storage. :)
>>
>> On Mon, Feb 8, 2016 at 6:18 PM, Robert Starmer <rob...@kumul.us> wrote:
>>
>>> Besides, wouldn't it be better to actually do application layer backup
>>> restore, or application level distribution for replication?  That
>>> architecture at least let's the application determine and deal with corrupt
>>> data transmission rather than the DRBD like model where you corrupt one
>>> data-set, you corrupt them all...
>>>
>>> Hence my comment about having some form of object storage (SWIFT is
>>> perhaps even a good example of this architeccture, the proxy replicates,
>>> checks MD5, etc. to verify good data, rather than just replicating blocks
>>> of data).
>>>
>>>
>>>
>>> On Mon, Feb 8, 2016 at 7:15 PM, Robert Starmer <rob...@kumul.us> wrote:
>>>
>>>> I have not run into anyone replicating volumes or creating redundancy
>>>> at the VM level (beyond, as you point out, HDFS, etc.).
>>>>
>>>> R
>>>>
>>>> On Mon, Feb 8, 2016 at 6:54 PM, Joe Topjian <j...@topjian.net> wrote

Re: [Openstack-operators] RAID / stripe block storage volumes

2016-03-06 Thread Joe Topjian
We ($work) have been researching this topic for the past few weeks and I
wanted to give an update on what we've found.

First, we've found that both Rackspace and Azure advocate the use of
RAID'ing block storage volumes from within an instance for both performance
and resilience [1][2][3]. I only mention this to add to the earlier Amazon
AWS information and not to imply that more people should share this view.

Second, we discovered virtio-scsi [4]. By adding the following properties
to an image, the disks will now appear as SCSI disks, including the more
common /dev/sdx naming:

hw_disk_bus_model=virtio-scsi
hw_scsi_model=virtio-scsi
hw_disk_bus=scsi

What's notable is that, in our testing, ZFS pools and Gluster replicas are
more likely to see the volume disconnect/fail with virtio-scsi. mdadm has
always been fairly dependable, so there hasn't been a change there. We're
still testing, but virtio-scsi looks promising.

1:
https://support.rackspace.com/how-to/configuring-a-software-raid-on-a-linux-general-purpose-cloud-server/
2: https://support.rackspace.com/how-to/cloud-block-storage-faq/
3:
https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-configure-raid/
4: https://wiki.openstack.org/wiki/LibvirtVirtioScsi

On Mon, Feb 8, 2016 at 7:18 PM, Joe Topjian <j...@topjian.net> wrote:

> Yep. Don't get me wrong -- I agree 100% with everything you've said
> throughout this thread. Applications that have native replication are
> awesome. Swift is crazy awesome. :)
>
> I understand that some may see the use of mdadm, Cinder-assisted
> replication, etc as supporting "pet" environments, and I agree to some
> extent. But I do think there are applicable use-cases where those services
> could be very helpful.
>
> As one example, I know of large cloud-based environments which handle very
> large data sets and are entirely stood up through configuration management
> systems. However, due to the sheer size of data being handled, rebuilding
> or resyncing a portion of the environment could take hours. Failing over to
> a replicated volume is instant.In addition, being able to both stripe and
> replicate goes a very long way in making the most out of commodity block
> storage environments (for example, avoiding packing problems and such).
>
> Should these types of applications be reading / writing directly to Swift,
> HDFS, or handling replication themselves? Sure, in a perfect world. Does
> Gluster fill all gaps I've mentioned? Kind of.
>
> I guess I'm just trying to survey the options available for applications
> and environments that would otherwise be very flexible and resilient if it
> wasn't for their awkward use of storage. :)
>
> On Mon, Feb 8, 2016 at 6:18 PM, Robert Starmer <rob...@kumul.us> wrote:
>
>> Besides, wouldn't it be better to actually do application layer backup
>> restore, or application level distribution for replication?  That
>> architecture at least let's the application determine and deal with corrupt
>> data transmission rather than the DRBD like model where you corrupt one
>> data-set, you corrupt them all...
>>
>> Hence my comment about having some form of object storage (SWIFT is
>> perhaps even a good example of this architeccture, the proxy replicates,
>> checks MD5, etc. to verify good data, rather than just replicating blocks
>> of data).
>>
>>
>>
>> On Mon, Feb 8, 2016 at 7:15 PM, Robert Starmer <rob...@kumul.us> wrote:
>>
>>> I have not run into anyone replicating volumes or creating redundancy at
>>> the VM level (beyond, as you point out, HDFS, etc.).
>>>
>>> R
>>>
>>> On Mon, Feb 8, 2016 at 6:54 PM, Joe Topjian <j...@topjian.net> wrote:
>>>
>>>> This is a great conversation and I really appreciate everyone's input.
>>>> Though, I agree, we wandered off the original question and that's my fault
>>>> for mentioning various storage backends.
>>>>
>>>> For the sake of conversation, let's just say the user has no knowledge
>>>> of the underlying storage technology. They're presented with a Block
>>>> Storage service and the rest is up to them. What known, working options
>>>> does the user have to build their own block storage resilience? (Ignoring
>>>> "obvious" solutions where the application has native replication, such as
>>>> Galera, elasticsearch, etc)
>>>>
>>>> I have seen references to Cinder supporting replication, but I'm not
>>>> able to find a lot of information about it. The support matrix[1] lists
>>>> very few drivers that actually implement replication -- is this true or is
>>>> there a trove o

Re: [Openstack-operators] [kolla] Question about how Operators deploy

2016-02-12 Thread Joe Topjian
2 VIPs as well.

On Fri, Feb 12, 2016 at 8:27 AM, Matt Fischer  wrote:

> We also use 2 VIPs. public and internal, with admin being a CNAME for
> internal.
>
> On Fri, Feb 12, 2016 at 7:28 AM, Fox, Kevin M  wrote:
>
>> We usually use two vips.
>>
>> Thanks,
>> Kevin
>>
>> --
>> *From:* Steven Dake (stdake)
>> *Sent:* Friday, February 12, 2016 6:04:45 AM
>> *To:* openstack-operators@lists.openstack.org
>> *Subject:* [Openstack-operators] [kolla] Question about how Operators
>> deploy
>>
>> Hi folks,
>>
>> Unfortunately I won't be able to make it to the Operator midcycle because
>> of budget constraints or I would find the answer to this question there.
>> The Kolla upstream is busy sorting out external ssl termination and a
>> question arose in the Kolla community around operator requirements for
>> publicURL vs internalURL VIP management.
>>
>> At present, Kolla creates 3 Haproxy containers across 3 HA nodes with one
>> VIP managed by keepalived.  The VIP is used for internal communication
>> only.  Our PUBLIC_URL is set to a DNS name, and we expect the Operator to
>> sort out how to map that DNS name to the internal VIP used by Kolla.  The
>> way I do this in my home lab is to use NAT to NAT my public_URL from the
>> internet (hosted by dyndns) to my internal VIP that haproxies to my 3 HA
>> control nodes.  This is secure assuming someone doesn't bust through my NAT.
>>
>> An alternative has been suggested which is to use TWO vips.  One for
>> internal_url, one for public_url.  Then the operator would only be
>> responsible for selecting where to to allocate the public_url endpoint's
>> VIP.  I think this allows more flexibility without necessarily requiring
>> NAT while still delivering a secure solution.
>>
>> Not having ever run an OpenStack cloud in production, how do the
>> Operators want it?  Our deciding factor here is what Operators want, not
>> what is necessarily currently in the code base.  We still have time to make
>> this work differently for Mitaka, but I need feedback/advice quickly.
>>
>> The security guide seems to imply two VIPs are the way to Operate: (big
>> diagram):
>> http://docs.openstack.org/security-guide/networking/architecture.html
>>
>> The IRC discussion is here for reference:
>>
>> http://eavesdrop.openstack.org/irclogs/%23kolla/%23kolla.2016-02-12.log.html#t2016-02-12T12:09:08
>>
>> Thanks in Advance!
>> -steve
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] User Survey - Deadline Feb 24th

2016-02-09 Thread Joe Topjian
Isn't it similar to asking if you use Fedora, CentOS, or RHEL?

My understanding is that Juniper offers a paid/supported version of
Contrail while OpenContrail is the open source version.

On Tue, Feb 9, 2016 at 2:50 AM, Edgar Magana 
wrote:

> Tom,
>
> For the "Which OpenStack Network (Neutron) drivers are you using?” section.
>
> What is the difference between using Juniper and OpenContrail?. There
> should be only OpenContrail or something like OpenContrail/Juniper.
>
> Thanks,
>
> Edgar
>
>
>
>
> On 2/9/16, 1:33 AM, "Tom Fifield"  wrote:
>
> >Hi all,
> >
> >If you run OpenStack, build apps on it, or have customers with OpenStack
> >deployments, please take a few minutes to respond to the latest User
> >Survey or pass it along to your friends.
> >
> >Since 2013, the user survey has provided significant insight into what
> >people are deploying and how they're using OpenStack. You can see the
> >most recent results in the October 2015 report[1].
> >
> >
> >Please follow the link and instructions below to complete the User
> >Survey by ***February 24, 2016 at 23:00 UTC***. If you already completed
> >the survey, there's no need to start over. You can simply log back in to
> >update your Deployment Profile, as well as take the opportunity to
> >provide additional input. You need to do this to keep your past survey
> >responses active, but we hope you'll do it because we've made the survey
> >shorter and with more interesting questions ;)
> >
> >
> >
> >Take the Survey ( http://www.openstack.org/user-survey )
> >
> >
> >All of the information you provide is confidential to the Foundation and
> >User Committee and will be aggregated anonymously unless you clearly
> >indicate we can publish your organization’s profile.
> >
> >Remember you can hear directly from users and see the aggregate survey
> >findings by attending the next OpenStack Summit, April 25-29 in Austin
> >(http://www.openstack.org/summit).
> >
> >Thank you again for your support.
> >
> >
> >-Tom
> >
> >PS, an RT of https://twitter.com/OpenStack/status/695751431542874112
> >helps too :)
> >
> >[1]
> >
> https://www.openstack.org/user-survey/survey-2016-q1/landing?BackURL=%2Fuser-survey%2Fsurvey-2016-q1%2F
> >
> >___
> >OpenStack-operators mailing list
> >OpenStack-operators@lists.openstack.org
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] User Survey - Deadline Feb 24th

2016-02-09 Thread Joe Topjian
I'm not following. There's a paid-for / enterprise version of Contrail from
Juniper:

http://www.juniper.net/us/en/products-services/sdn/contrail/

So I think two separate entries are valid.

On Tue, Feb 9, 2016 at 9:20 AM, Edgar Magana <edgar.mag...@workday.com>
wrote:

> No, that is not the case. There is not Enterprise version of OpenContrail.
> Therefore, this question could split the responses because it is confusing.
>
> Edgar
>
> From: Joe Topjian <j...@topjian.net>
> Date: Tuesday, February 9, 2016 at 8:17 AM
> To: Edgar Magana <edgar.mag...@workday.com>
> Cc: Tom Fifield <t...@openstack.org>, "openst...@lists.openstack.org" <
> openst...@lists.openstack.org>, OpenStack Operators <
> openstack-operators@lists.openstack.org>, "commun...@lists.openstack.org"
> <commun...@lists.openstack.org>
> Subject: Re: [Openstack-operators] User Survey - Deadline Feb 24th
>
> Isn't it similar to asking if you use Fedora, CentOS, or RHEL?
>
> My understanding is that Juniper offers a paid/supported version of
> Contrail while OpenContrail is the open source version.
>
> On Tue, Feb 9, 2016 at 2:50 AM, Edgar Magana <edgar.mag...@workday.com>
> wrote:
>
>> Tom,
>>
>> For the "Which OpenStack Network (Neutron) drivers are you using?”
>> section.
>>
>> What is the difference between using Juniper and OpenContrail?. There
>> should be only OpenContrail or something like OpenContrail/Juniper.
>>
>> Thanks,
>>
>> Edgar
>>
>>
>>
>>
>> On 2/9/16, 1:33 AM, "Tom Fifield" <t...@openstack.org> wrote:
>>
>> >Hi all,
>> >
>> >If you run OpenStack, build apps on it, or have customers with OpenStack
>> >deployments, please take a few minutes to respond to the latest User
>> >Survey or pass it along to your friends.
>> >
>> >Since 2013, the user survey has provided significant insight into what
>> >people are deploying and how they're using OpenStack. You can see the
>> >most recent results in the October 2015 report[1].
>> >
>> >
>> >Please follow the link and instructions below to complete the User
>> >Survey by ***February 24, 2016 at 23:00 UTC***. If you already completed
>> >the survey, there's no need to start over. You can simply log back in to
>> >update your Deployment Profile, as well as take the opportunity to
>> >provide additional input. You need to do this to keep your past survey
>> >responses active, but we hope you'll do it because we've made the survey
>> >shorter and with more interesting questions ;)
>> >
>> >
>> >
>> >Take the Survey ( http://www.openstack.org/user-survey )
>> >
>> >
>> >All of the information you provide is confidential to the Foundation and
>> >User Committee and will be aggregated anonymously unless you clearly
>> >indicate we can publish your organization’s profile.
>> >
>> >Remember you can hear directly from users and see the aggregate survey
>> >findings by attending the next OpenStack Summit, April 25-29 in Austin
>> >(http://www.openstack.org/summit).
>> >
>> >Thank you again for your support.
>> >
>> >
>> >-Tom
>> >
>> >PS, an RT of https://twitter.com/OpenStack/status/695751431542874112
>> >helps too :)
>> >
>> >[1]
>> >
>> https://www.openstack.org/user-survey/survey-2016-q1/landing?BackURL=%2Fuser-survey%2Fsurvey-2016-q1%2F
>> >
>> >___
>> >OpenStack-operators mailing list
>> >OpenStack-operators@lists.openstack.org
>> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RAID / stripe block storage volumes

2016-02-08 Thread Joe Topjian
Yep. Don't get me wrong -- I agree 100% with everything you've said
throughout this thread. Applications that have native replication are
awesome. Swift is crazy awesome. :)

I understand that some may see the use of mdadm, Cinder-assisted
replication, etc as supporting "pet" environments, and I agree to some
extent. But I do think there are applicable use-cases where those services
could be very helpful.

As one example, I know of large cloud-based environments which handle very
large data sets and are entirely stood up through configuration management
systems. However, due to the sheer size of data being handled, rebuilding
or resyncing a portion of the environment could take hours. Failing over to
a replicated volume is instant.In addition, being able to both stripe and
replicate goes a very long way in making the most out of commodity block
storage environments (for example, avoiding packing problems and such).

Should these types of applications be reading / writing directly to Swift,
HDFS, or handling replication themselves? Sure, in a perfect world. Does
Gluster fill all gaps I've mentioned? Kind of.

I guess I'm just trying to survey the options available for applications
and environments that would otherwise be very flexible and resilient if it
wasn't for their awkward use of storage. :)

On Mon, Feb 8, 2016 at 6:18 PM, Robert Starmer <rob...@kumul.us> wrote:

> Besides, wouldn't it be better to actually do application layer backup
> restore, or application level distribution for replication?  That
> architecture at least let's the application determine and deal with corrupt
> data transmission rather than the DRBD like model where you corrupt one
> data-set, you corrupt them all...
>
> Hence my comment about having some form of object storage (SWIFT is
> perhaps even a good example of this architeccture, the proxy replicates,
> checks MD5, etc. to verify good data, rather than just replicating blocks
> of data).
>
>
>
> On Mon, Feb 8, 2016 at 7:15 PM, Robert Starmer <rob...@kumul.us> wrote:
>
>> I have not run into anyone replicating volumes or creating redundancy at
>> the VM level (beyond, as you point out, HDFS, etc.).
>>
>> R
>>
>> On Mon, Feb 8, 2016 at 6:54 PM, Joe Topjian <j...@topjian.net> wrote:
>>
>>> This is a great conversation and I really appreciate everyone's input.
>>> Though, I agree, we wandered off the original question and that's my fault
>>> for mentioning various storage backends.
>>>
>>> For the sake of conversation, let's just say the user has no knowledge
>>> of the underlying storage technology. They're presented with a Block
>>> Storage service and the rest is up to them. What known, working options
>>> does the user have to build their own block storage resilience? (Ignoring
>>> "obvious" solutions where the application has native replication, such as
>>> Galera, elasticsearch, etc)
>>>
>>> I have seen references to Cinder supporting replication, but I'm not
>>> able to find a lot of information about it. The support matrix[1] lists
>>> very few drivers that actually implement replication -- is this true or is
>>> there a trove of replication docs that I just haven't been able to find?
>>>
>>> Amazon AWS publishes instructions on how to use mdadm with EBS[2]. One
>>> might interpret that to mean mdadm is a supported solution within EC2 based
>>> instances.
>>>
>>> There are also references to DRBD and EC2, though I could not find
>>> anything as "official" as mdadm and EC2.
>>>
>>> Does anyone have experience (or know users) doing either? (specifically
>>> with libvirt/KVM, but I'd be curious to know in general)
>>>
>>> Or is it more advisable to create multiple instances where data is
>>> replicated instance-to-instance rather than a single instance with multiple
>>> volumes and have data replicated volume-to-volume (by way of a single
>>> instance)? And if so, why? Is a lack of stable volume-to-volume replication
>>> a limitation of certain hypervisors?
>>>
>>> Or has this area just not been explored in depth within OpenStack
>>> environments yet?
>>>
>>> 1: https://wiki.openstack.org/wiki/CinderSupportMatrix
>>> 2: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/raid-config.html
>>>
>>>
>>> On Mon, Feb 8, 2016 at 4:10 PM, Robert Starmer <rob...@kumul.us> wrote:
>>>
>>>> I'm not against Ceph, but even 2 machines (and really 2 machines with
>>>> enough storage to be meaningful, e.g. not the all blade environments I've
>>>> buil

Re: [Openstack-operators] RAID / stripe block storage volumes

2016-02-08 Thread Joe Topjian
Hi Robert,

Can you elaborate on "multiple underlying storage services"?

The reason I asked the initial question is because historically we've made
our block storage service resilient to failure. Historically we also made
our compute environment resilient to failure, too, but over time, we've
seen users become more educated to cope with compute failure. As a result,
we've been able to become more lenient with regard to building resilient
compute environments.

We've been discussing how possible it would be to translate that same idea
to block storage. Rather than have a large HA storage cluster (whether
Ceph, Gluster, NetApp, etc), is it possible to offer simple single LVM
volume servers and push the failure handling on to the user?

Of course, this doesn't work for all types of use cases and environments.
We still have projects which require the cloud to own most responsibility
for failure than the users.

But for environments were we offer general purpose / best effort compute
and storage, what methods are available to help the user be resilient to
block storage failures?

Joe

On Mon, Feb 8, 2016 at 12:09 PM, Robert Starmer <rob...@kumul.us> wrote:

> I've always recommended providing multiple underlying storage services to
> provide this rather than adding the overhead to the VM.  So, not in any of
> my systems or any I've worked with.
>
> R
>
>
>
> On Fri, Feb 5, 2016 at 5:56 PM, Joe Topjian <j...@topjian.net> wrote:
>
>> Hello,
>>
>> Does anyone have users RAID'ing or striping multiple block storage
>> volumes from within an instance?
>>
>> If so, what was the experience? Good, bad, possible but with caveats?
>>
>> Thanks,
>> Joe
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RAID / stripe block storage volumes

2016-02-08 Thread Joe Topjian
gt;> consider everything disposable. The one gap I've seen is that there are
>> plenty of folks who don't deploy SWIFT, and without some form of object
>> store, there's still the question of where you place your datasets so that
>> they can be quickly recovered (and how do you keep them up to date if you
>> do have one).  With VMs, there's the concept that you can recover quickly
>> because the "dataset" e.g. your OS, is already there for you, and in plenty
>> of small environments, that's only as true as the glance repository (guess
>> what's usually backing that when there's no SWIFT around...).
>>
>> So I see the issue as a holistic one. How do you show operators/users
>> that they should consider everything disposable if we only look at the
>> current running instance as the "thing"   Somewhere you still likely need
>> some form of distributed resilience (and yes, I can see using the
>> distributed Canonical, Centos, RedHat, Fedora, Debian, etc. mirrors as your
>> distributed Image backup but what about the database content, etc.).
>>
>> Robert
>>
>> On Mon, Feb 8, 2016 at 1:44 PM, Ned Rhudy (BLOOMBERG/ 731 LEX) <
>> erh...@bloomberg.net> wrote:
>>
>>> In our environments, we offer two types of storage. Tenants can either
>>> use Ceph/RBD and trade speed/latency for reliability and protection against
>>> physical disk failures, or they can launch instances that are realized as
>>> LVs on an LVM VG that we create on top of a RAID 0 spanning all but the OS
>>> disk on the hypervisor. This lets the users elect to go all-in on speed and
>>> sacrifice reliability for applications where replication/HA is handled at
>>> the app level, if the data on the instance is sourced from elsewhere, or if
>>> they just don't care much about the data.
>>>
>>> There are some further changes to our approach that we would like to
>>> make down the road, but in general our users seem to like the current
>>> system and being able to forgo reliability or speed as their circumstances
>>> demand.
>>>
>>> From: j...@topjian.net
>>> Subject: Re: [Openstack-operators] RAID / stripe block storage volumes
>>>
>>> Hi Robert,
>>>
>>> Can you elaborate on "multiple underlying storage services"?
>>>
>>> The reason I asked the initial question is because historically we've
>>> made our block storage service resilient to failure. Historically we also
>>> made our compute environment resilient to failure, too, but over time,
>>> we've seen users become more educated to cope with compute failure. As a
>>> result, we've been able to become more lenient with regard to building
>>> resilient compute environments.
>>>
>>> We've been discussing how possible it would be to translate that same
>>> idea to block storage. Rather than have a large HA storage cluster (whether
>>> Ceph, Gluster, NetApp, etc), is it possible to offer simple single LVM
>>> volume servers and push the failure handling on to the user?
>>>
>>> Of course, this doesn't work for all types of use cases and
>>> environments. We still have projects which require the cloud to own most
>>> responsibility for failure than the users.
>>>
>>> But for environments were we offer general purpose / best effort compute
>>> and storage, what methods are available to help the user be resilient to
>>> block storage failures?
>>>
>>> Joe
>>>
>>> On Mon, Feb 8, 2016 at 12:09 PM, Robert Starmer <rob...@kumul.us> wrote:
>>>
>>>> I've always recommended providing multiple underlying storage services
>>>> to provide this rather than adding the overhead to the VM.  So, not in any
>>>> of my systems or any I've worked with.
>>>>
>>>> R
>>>>
>>>>
>>>>
>>>> On Fri, Feb 5, 2016 at 5:56 PM, Joe Topjian <j...@topjian.net> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Does anyone have users RAID'ing or striping multiple block storage
>>>>> volumes from within an instance?
>>>>>
>>>>> If so, what was the experience? Good, bad, possible but with caveats?
>>>>>
>>>>> Thanks,
>>>>> Joe
>>>>>
>>>>> ___
>>>>> OpenStack-operators mailing list
>>>>> OpenStack-operators@lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>
>>>>>
>>>>
>>> ___
>>> OpenStack-operators mailing 
>>> listOpenStack-operators@lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>>
>>> ___
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] RAID / stripe block storage volumes

2016-02-05 Thread Joe Topjian
Hello,

Does anyone have users RAID'ing or striping multiple block storage volumes
from within an instance?

If so, what was the experience? Good, bad, possible but with caveats?

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Storage backend for glance

2016-01-27 Thread Joe Topjian
Yup, it's definitely possible. All Glance nodes will need to share the same
database as well as the same file system. Common ways of sharing the file
system are to mount /var/lib/glance/images either from NFS (like you
mentioned) or Gluster.

I've done both in the past with no issues. The usual caveats with shared
file systems apply: file permissions, ownership, and such. Other than that,
you shouldn't have any problems.

Hope that helps,
Joe

On Wed, Jan 27, 2016 at 3:23 PM, Sławek Kapłoński 
wrote:

> Hello,
>
> I want to install Openstack with at least two glance nodes (to have HA)
> but with local filesystem as glance storage. Is it possible to use
> something like that in setup with two glance nodes? Maybe someone of You
> already have something like that?
> I'm asking because AFAIK is image will be stored on one glance server
> and nova-compute will ask other glance host to download image then image
> will not be available to download and instance will be in ERROR state.
> So maybe someone used it somehow in similar setup (maybe some NFS or
> something like that?). What are You experience with it?
>
> --
> Best regards / Pozdrawiam
> Sławek Kapłoński
> sla...@kaplonski.pl
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Galera setup testing

2015-12-11 Thread Joe Topjian
We do something similar: Instead of McRouter, we use the repcached patches
to replicate data between two memcached nodes. We then use HAProxy as a
single entry point for memcached requests.

We've been doing this for 6+ months and it's been working great. It's
effectively solved the issue I described in this thread last year:

http://lists.openstack.org/pipermail/openstack-operators/2014-August/004881.html

On Fri, Dec 11, 2015 at 6:06 AM, Bajin, Joseph  wrote:

> At this point, we use Keystone and UUID’s for our setup, but we don’t
> store the UUID tokens in the Database.  We use Memcache to do that.
> Actually we use McRouter and Memcache to make sure any node in our control
> plane can validate that token.
>
> —Joe
>
> From: Ajaya Agrawal 
> Date: Friday, December 11, 2015 at 2:25 AM
> To: Matt Fischer 
> Cc: "openstack-operators@lists.openstack.org" <
> openstack-operators@lists.openstack.org>
> Subject: Re: [Openstack-operators] Galera setup testing
>
> Thanks Matt. That surely is helpful. If you could share some numbers or
> problems you faced when you were storing UUID tokens in database, it would
> be awesome. In my test setup with Keystone Kilo, Fernet token creation and
> validation were way slower than UUID tokens. But UUID tokens come with a
> huge cost to database which is the pain point. I have never run Keystone
> with UUID tokens in Prod setup. So I am looking for perspective on Keystone
> with UUID in prod setup.
>
> Thanks to other people who also chimed in with advice.
>
> Cheers,
> Ajaya
>
> On Mon, Dec 7, 2015 at 8:34 PM, Matt Fischer  wrote:
>
>> On Mon, Dec 7, 2015 at 3:54 AM, Ajaya Agrawal  wrote:
>>
>>> Hi everyone,
>>>
>>> We are deploying Openstack and planning to run multi-master Galera setup
>>> in production. My team is responsible for running a highly available
>>> Keystone. I have two questions when it comes to Galera with Keystone.
>>>
>>> 1. How do you test if a Galera cluster is setup properly?
>>> 2. Is there any Galera test specific to Keystone which you have found
>>> useful?
>>>
>>>
>> For 1 you could say that the clustercheck script which ships with
>> puppet-galera and is forked from
>> https://github.com/olafz/percona-clustercheck is a really simple check
>> that galera is up and the cluster is sync'd. It's main goal however is to
>> provide status to haproxy.
>>
>> One thing you want to check is the turnaround time on operations, for
>> example, creating a user on a node and then immediately using them on
>> another node. We found that this is likely to sometimes (but rarely) fail.
>> The solution is two-fold, first, don't store tokens in mysql. Second,
>> designate one node as the primary in haproxy.
>>
>> Other than that we've gotten good at reading the wsrep_ cluster status
>> info, but to be honest, once we removed tokens from the db, we've been in
>> way better shape.
>>
>>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Cinder API with multiple regions not working.

2015-12-11 Thread Joe Topjian
Hi Salman,

That's awesome news. Glad it's working. :)

Joe

On Fri, Dec 11, 2015 at 3:12 PM, Salman Toor <salman.t...@it.uu.se> wrote:

> Hi,
>
> It is working by setting exact names of the services.
>
>
> *[root@smog: ~]* # openstack service list
>
> +--+--+--+
>
> | ID   | Name | Type |
>
> +--+--+--+
>
> | 1fcae9bd76304853a3168c39c7fe8e6b | nova | compute  |
>
> | 2c7828120c294d3f82e3a17835babb85 | neutron  | network  |
>
> | 3804fcd8f9494d30b589b55fe6abb811 | nova-hpc2n   | compute  |
>
> | 478eff4e96464ae8a958ba29f750b14c | glance   | image|
>
> | 61d8baeb4ee74c7798a60758b2f4171f | cinderv2 | volumev2 |
>
> | 75f89962c7864507b07055fbfc98053e | cinder   | volume   |
>
> | 7bd5d667ec4b4d65b6c1b0de8b303fe3 | cinder   | volume   |
>
> | 97f977f8a7a04bae89da167fd25dc06c | glance-hpc2n | image|
>
> | 9d2a7ef6b36c45b096e552bf73cb89ae | cinderv2 | volumev2 |
>
> | dccd39b92ab547ddaf9047b38620145a | swift| object-store |
>
> | ebb1660d1d9746759a48de921521bfad | keystone | identity |
>
> +--+--+--+
>
> *[root@smog: ~]* # openstack endpoint list
>
>
> +--+---+--+--+
>
> | ID   | Region| Service Name | Service
> Type |
>
>
> +--+---+--+--+
>
> | 3000cb23c6ab4ee5b68876ee08257338 | regionOne | cinderv2 | volumev2
>   |
>
> | e21e935ed861484e976d1b93e0fda0f0 | regionOne | nova | compute
>   |
>
> | cdfb008c47a0472ab3a93e6ee07e9ba4 | regionOne | neutron  | network
>   |
>
> | e2693bcaf3da4be2810f04acd7995d7f | regionOne | cinder   | volume
>   |
>
> | 1400c42cbf154f63b0c3e8d64352d1f2 | HPC2N | cinderv2 | volumev2
>   |
>
> | 04acd1666704433d991cf9a75957c815 | HPC2N | glance-hpc2n | image
>   |
>
> | cfd1c766c56744309767cd84034f9bfb | regionOne | swift|
> object-store |
>
> | abe46f0d86064056927a2177a705787c | HPC2N | nova-hpc2n   | compute
>   |
>
> | 52bb09199ec84ef09642151348eab695 | HPC2N | cinder   | volume
>   |
>
> | ac522bcd576b4d2b9f5adfb5405730be | regionOne | keystone | identity
>   |
>
> | bad524f18fd74ec3b7fb6647ea661686 | regionOne | glance   | image
>   |
>
> +--+---+--+--+
>
>
>
> But different names are working perfectly fine with glance and nova.
>
> Anyways thanks for all your time and efforts. If I manage to reproduce it
> on devstack, will report you back.
>
> Regards..
> Salman
>
>
>
>
> --
> *From:* Joe Topjian [j...@topjian.net]
> *Sent:* Friday, December 11, 2015 7:16 PM
> *To:* Salman Toor
> *Cc:* openstack-operators@lists.openstack.org
> *Subject:* Re: [Openstack-operators] Cinder API with multiple regions not
> working.
>
> Hi Salman,
>
> I'm stumped.
>
> I was able to confirm that Keystone acts a little differently when you
> have multiple regional endpoints connected to the same "service_name" and
> "service_type" than if you have multiple regional endpoints, each with
> their own "service_name" but same "service_type"'s.
>
> For example, you have:
>
> regionOne, cinder, volume
> regionOne, cinderv2, volumev2
> HPC2N, cinderhpc2n, volume
> HPC2N, cinderv2hpc2n, volumev2
>
> When you do "openstack endpoint show volume", you should not get any
> output. But if you do "openstack endpoint show cinderhpc2n", then you will
> see output. Swap the service_type and service_names and the pattern
> continues.
>
> However, I am still unable to reproduce your issue.
>
> My initial tests used the following format:
>
> regionOne, cinder, volume
> regionOne, cinderv2, volumev2
> HPC2N, cinder, volume
> HPC2N, cinderv2, volumev2
>
> With this format, you will run into the UUID ordering issue I described
> previously. But again, I'm not able to reproduce the error you're seeing.
>
> I really hate to give this answer, but at this point, I'd recommend
> setting up a devstack environment and comparing the configuration files
> under /etc/nova and /etc/cinder. This problem is not occurring for me in
> devstack, so there has to be some difference between the two that is
> causing this issue.
>
> Once you have dev

Re: [Openstack-operators] Cinder API with multiple regions not working.

2015-12-11 Thread Joe Topjian
C2N | cinderhpc2n   | volume
> |
> | cfd1c766c56744309767cd84034f9bfb | regionOne | swift |
> object-store |
> | abe46f0d86064056927a2177a705787c | HPC2N | nova-hpc2n| compute
> |
> | ac522bcd576b4d2b9f5adfb5405730be | regionOne | keystone  |
> identity |
> | bad524f18fd74ec3b7fb6647ea661686 | regionOne | glance| image
> |
>
> +--+---+---+--+
> *[root@controller: ~/openstack]* # systemctl restart
> openstack-nova-api.service
>
> Even with all the services didn’t help.
>
> *[root@controller: ~/openstack]* # nova volume-attach
> 800d8ba0-cc17-4877-894a-89adecfb5eb7 85ab8b8a-c75c-45a1-9f51-44ed75ba3210
> /dev/vdb
> ERROR (ClientException): The server has either erred or is incapable of
> performing the requested operation. (HTTP 500) (Request-ID:
> req-f0d9372d-863b-43f6-8281-5c9196a866af)
> *[root@controller: ~/openstack]* # openstack service delete cinderhpc2n
> *[root@controller: ~/openstack]* # openstack service delete cinderhpc2nv2
> *[root@controller: ~/openstack]* #
> *[root@controller: ~/openstack]* #
> *[root@controller: ~/openstack]* #
> *[root@controller: ~/openstack]* #
> *[root@controller: ~/openstack]* # nova volume-attach
> 800d8ba0-cc17-4877-894a-89adecfb5eb7 85ab8b8a-c75c-45a1-9f51-44ed75ba3210
> /dev/vdb
> +--+--+
> | Property | Value|
> +--+--+
> | device   | /dev/vdb |
> | id   | 85ab8b8a-c75c-45a1-9f51-44ed75ba3210 |
> | serverId | 800d8ba0-cc17-4877-894a-89adecfb5eb7 |
> | volumeId | 85ab8b8a-c75c-45a1-9f51-44ed75ba3210 |
>
> +--+--+
>
>
> —— nova-api.log ——
> 2015-12-11 09:43:08.534 16529 TRACE nova.api.openstack EndpointNotFound:
> internalURL endpoint for volume service named cinder in regionOne region
> not found
> ——
>
> ——nova.conf
> [cinder]
> os_region_name = regionOne
> ——
>
> —— cinder.conf
> [default]
> os_region_name = regionOne
> ——
>
> Regards..
> Salman.
>
>
>
>
>
> PhD, Scientific Computing
> Researcher, IT Department,
> Uppsala University.
> Senior Cloud Architect,
> SNIC.
> Cloud Application Expert,
> UPPMAX.
> Visiting Researcher,
> Helsinki Institute of Physics (HIP).
> salman.t...@it.uu.se
> http://www.it.uu.se/katalog/salto690
>
> On 10 Dec 2015, at 23:35, Joe Topjian <j...@topjian.net> wrote:
>
> Hi Salman,
>
> This has turned into a bit of fun -- I'm seeing a lot of wacky things.
>
> First, I'm pretty this issue is local to multi-regions and doesn't have to
> do with having both Cinder v1 and v2 in the Keystone catalog. I changed my
> catalog to only have Cinder v2 and I still see multi-region issues. If I
> had more time, I would figure out how to forcefully make nova-api use
> Cinder v1 to confirm it's not a v2 issue, but I'm pretty confident that it
> is not.
>
> Second, strange things happen depending on the UUID of the endpoints.
> Let's say I create two cinder v2 regions:
>
> openstack endpoint create --region RegionOne volumev2 --publicurl
> http://10.1.0.112:8776/v2/%\(tenant_id\)s --internalurl
> http://10.1.0.112:8776/v2/%\(tenant_id\)s --adminurl
> http://10.1.0.112:8776/v2/%\(tenant_id\)s
>
> +--+-+
> | Field| Value   |
> +--+-+
> | adminurl | http://10.1.0.112:8776/v2/%(tenant_id)s |
> | id   | a46a5f86b0134944b66c25a7802f7b32|
> | internalurl  | http://10.1.0.112:8776/v2/%(tenant_id)s |
> | publicurl| http://10.1.0.112:8776/v2/%(tenant_id)s |
> | region   | RegionOne   |
> | service_id   | 9cb6eeba4ae5484080ef1a5272b03367|
> | service_name | cinderv2|
> | service_type | volumev2|
> +--+-+
>
> openstack endpoint create --region RegionTwo volumev2 --publicurl
> http://10.1.0.113:8776/v2/%\(tenant_id\)s --internalurl
> http://10.1.0.113:8776/v2/%\(tenant_id\)s --adminurl
> http://10.1.0.113:8776/v2/%\(tenant_id\)s
>
> +--+-+
> | Field| Value   |
> +--+-+
> | adminurl | http://10.1.0.113:8776/v2/%(tenant_id)s |
> | id   | e9cd7a3fd8734b12a77154d73990261d|
> | internalurl  | http://10.1.0.113:8776/v2/%(tenan

Re: [Openstack-operators] Cinder API with multiple regions not working.

2015-12-09 Thread Joe Topjian
Hi Salman,

Someone mentioned this same issue yesterday in relation to Terraform (maybe
a colleague of yours?), so given the two occurrences, I thought I'd look
into this.

I have a Liberty environment readily available, so I created a second set
of volume and volumev2 endpoints for a fictional region. Everything worked
as expected, so I started reviewing the config files and saw that
/etc/cinder/cinder.conf had an option

[DEFAULT]
os_region_name = RegionOne

I commented that out, but things still worked.

Then in /etc/nova/nova.conf, I saw:

[cinder]
os_region_name = RegionOne

commenting this out caused volume attachments to hang indefinitely because
nova was trying to contact cinder at RegionTwo (I'm assuming this is the
first catalog entry that was returned).

Given this is a Liberty environment, it's not accurately reproducing your
problem, but could you check and see if you have that option set in
nova.conf?

I have a Kilo environment in the process of building. Once it has finished,
I'll see if I can reproduce your error there.

Thanks,
Joe

On Wed, Dec 9, 2015 at 4:35 AM, Salman Toor  wrote:

> Hi,
>
> I am using Kilo release on CentOS. We have recently enabled multiple
> regions and it seems that Cinder have some problems with  multiple
> endpoints.
>
> Thinks are working fine with nova but cinder is behaving strange. Here are
> my endpoints
>
>
> 
> *[root@controller: ~]* # openstack service list
> +--++--+
> | ID   | Name   | Type |
> +--++--+
> | 0a33e6f259794ff2a99e626be37c0c2b | cinderv2-hpc2n | volumev2 |
> | 1fcae9bd76304853a3168c39c7fe8e6b | nova   | compute  |
> | 2c7828120c294d3f82e3a17835babb85 | neutron| network  |
> | 3804fcd8f9494d30b589b55fe6abb811 | nova-hpc2n | compute  |
> | 478eff4e96464ae8a958ba29f750b14c | glance | image|
> | 4a5a771d915e43c28e66538b8bc6e625 | cinder | volume   |
> | 72d1be82b2e5478dbf0f3fb9e7ba969d | cinderv2   | volumev2 |
> | 97f977f8a7a04bae89da167fd25dc06c | glance-hpc2n   | image|
> | a985795b49e2440db82970b81248c86e | cinder-hpc2n   | volume   |
> | dccd39b92ab547ddaf9047b38620145a | swift  | object-store |
> | ebb1660d1d9746759a48de921521bfad | keystone   | identity |
> +--++--+
>
> *[root@controller: ~]* # openstack endpoint
> show a985795b49e2440db82970b81248c86e
> +--+--+
> | Field| Value|
> +--+--+
> | adminurl | http://:8776/v1/%(tenant_id)s |
> | enabled  | True |
> | id   | d4003e91ddf24cfb9fa497da81b01a18 |
> | internalurl  | http://:8776/v1/%(tenant_id)s |
> | publicurl| http://:8776/v1/%(tenant_id)s |
> | region   | HPC2N|
> | service_id   | a985795b49e2440db82970b81248c86e |
> | service_name | cinder-hpc2n |
> | service_type | volume   |
> +--+--+
>
> *[root@controller: ~]* # openstack endpoint
> show 4a5a771d915e43c28e66538b8bc6e625
> +--++
> | Field| Value  |
> +--++
> | adminurl | http://:8776/v1/%(tenant_id)s|
> | enabled  | True   |
> | id   | 5f19c0b535674dbd9e318c7b6d61b3bc   |
> | internalurl  | http://:8776/v1/%(tenant_id)s|
> | publicurl| http://:8776/v1/%(tenant_id)s |
> | region   | regionOne  |
> | service_id   | 4a5a771d915e43c28e66538b8bc6e625   |
> | service_name | cinder |
> | service_type | volume |
> +--++
>
> And same for v2 endpoints
>
> 
>
> ——— nova-api.log ———
>
> achmentController object at 0x598a3d0>>, body: {"volumeAttachment":
> {"device": "", "volumeId": "93d96eab-e3fd-4131-9549-ed51e7299da2"}}
> _process_stack
> /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:780
> 2015-12-08 12:52:05.847 3376 INFO
> nova.api.openstack.compute.contrib.volumes
> [req-d6e1380d-c6bc-4911-b2c6-251bc8b4c62c a62c20fdf99c443a924f0d50a51514b1
> 3c9d997982e04c6db0e02b82fa18fdd8 - - -] Attach volume
> 93d96eab-e3fd-4131-9549-ed51e7299da2 to instance
> 3a4c8722-52a7-48f2-beb7-db8938698a0d 

Re: [Openstack-operators] Hypervisor Tuning Guide

2015-12-08 Thread Joe Topjian
Update on the Hypervisor Tuning Guide!

The plan mentioned earlier is still in effect and is in the midst of Step
2. All etherpad notes have been migrated to the OpenStack wiki and I've
recently finished cleaning them up. You can see the current work here[1].

For those who may be wondering "why yet another guide?", I guarantee that 9
out of 10 people who read the guide in its current state will learn
something new. Imagine how much you could learn if it was complete?

If you're interested in contributing, please see the How To Contribute
section[2]. In short: just add what you know to the wiki.

Here's a list of items that would be great to have:

* Information about Hypervisors other than libvirt/KVM.
* Information about operating systems other than Linux.
* Real options, settings, and values that you have found to be successful
in production.
* And ongoing: Continue to expand and elaborate on existing areas.

As mentioned before, there is no definitive timeline for this guide.
There's no plan to have formal meetings or anything like that at the
moment, either. Just an occasional poke to add what you know. However, if
you'd like to see this guide fall under a more formal schedule and would
like to lead that effort, please get in contact with me.

Thanks,
Joe

1: https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide
2:
https://wiki.openstack.org/wiki/Documentation/HypervisorTuningGuide#How_to_Contribute


On Tue, Oct 27, 2015 at 9:02 PM, Joe Topjian <j...@topjian.net> wrote:

> We had a great Hypervisor Tuning Guide session yesterday!
>
> We agreed on an initial structure to the guide that will include four core
> sections (CPU, Memory, Network, and Disk) and common subsections to each.
> The etherpad[1] has this structure defined and during the session, we went
> through and added some brief notes about what should be included.
>
> Another agreement was that this guide should be detailed. It should have
> specific actions such as "change the following sysctl setting to nnn"
> rather than being more broad and generic such as "make sure you aren't
> swapping". One disadvantage of this is the guide might become out of date
> sooner than if it was more broad. We felt this was an acceptable tradeoff.
>
> Our current plan is the following:
>
> 1. We're going to leave the etherpad active for the next two weeks to
> allow people to continue adding notes at their leisure. I'll send a
> reminder about this a few days before the deadline.
>
> 2. We'll then transfer the etherpad notes to the OpenStack wiki and begin
> creating a rough draft of the guide. Brief notes will be elaborated on and
> supporting documentation will be added. Areas that have no information will
> be highlighted for help. Everyone is encouraged to edit the wiki during
> this time.
>
> 3. Once a decent rough draft has been created, we'll look into creating a
> formal OpenStack document.
>
> We're all very busy, so there are no definitive timelines for completing
> steps 2 and 3. At a minimum, we'll continue to touch base with this during
> the Summits and mid-cycles. If there's enough interest, we could try to
> schedule a large block of time to do a doc sprint during one of these
> events.
>
> Thanks,
> Joe
>
> 1: https://etherpad.openstack.org/p/TYO-ops-hypervisor-tuning-guide
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Horizon Kilo bug with nova-network and floating IPs

2015-11-27 Thread Joe Topjian
Hi all,

I recently came across this bug and thought I'd share it for anyone else
running a similar environment:

https://bugs.launchpad.net/horizon/+bug/1520071

Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] instances floating IPs not reachable while restarting nova-network

2015-11-26 Thread Joe Topjian
Yup, this is expected. It happens for both single-host and multi-host. With
the former, we have an older environment where it takes around 10 minutes
for all network access to resume. That's with a few hundred tenants, a few
hundred vlans, and a few hundred floating IPs all on one host, though.

>From your list of reasons for restarting, we only need to restart
nova-network for config changes. If you're running into odd issues that you
think nova-network might be causing, definitely feel free to describe some
symptoms :)

Joe
On Nov 26, 2015 8:49 AM, "Gustavo Randich" 
wrote:

> Hi everybody, (still using nova-network in production... :)
>
> Using nova-network (icehouse), multi-host, FlatDHCPManager
>
> Is it expectable to experience an interruption of various seconds in
> instances' floating IP reachability when nova-network is restarted and
> repopulates iptables' NAT output/prerouting/float-snat tables?  (IP packets
> are not delivered to VMs until iptables forwarding rules are setup)
>
> We don't restart nova-network often, but we have certain cases when we
> need(ed) to:
>
>   * nova-network not reconnecting to RabbitMQ (latest oslo messaging patch
> mitigates this)
>   * configuration changes in nova.conf from time to time
>   * sanitary periodic (weekly or monthly) restarts to prevent poorly
> understood problems of the past (resource leaks?); will stop doing this due
> to NAT downtime
>
> Thanks!
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Router associated with multiple l3 agents

2015-11-24 Thread Joe Topjian
Hi Matt,

> It's also weird that we've only seen this when the environment has been
> built using terraform. This particular customer re-creates the issue every
> time they rebuild.
>
I work on the OpenStack support for Terraform, so I might be able to help
with this. Could you provide an example Terraform configuration that can be
used to recreate this issue?

When you say "rebuild", do you mean this happens when the user performs
"terraform apply" and then a subsequent "terraform apply"?

If you'd like, please feel free to open an issue on the Terraform Github
page and we can try to debug this over there:

https://github.com/hashicorp/terraform/

Neutron isn't my strong suit, but I can definitely see what I can do to
help.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [Nova] Question about starting nova as service versus directly

2015-11-20 Thread Joe Topjian
> Yes, most likely is related to permissions. Another good source of
> information for troubleshooting is /var/log/upstart/nova-compute.log
>

Ah yes! Much easier.
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [Nova] Question about starting nova as service versus directly

2015-11-19 Thread Joe Topjian
Hi Adam,

I've seen this happen due to permission issues. Regardless of running with
sudo, upstart is dropping to the "nova" user.

I usually debug this by setting a shell on the nova user, sudoing/su'ing to
nova, then running nova-compute from there. It should die with an error
message of the cause.

Hope that helps,
Joe
On Nov 19, 2015 3:35 PM, "Adam Lawson"  wrote:

> So I can start Nova on Ubuntu/Icehouse via *$ sudo python
> /usr/bin/nova-compute* and it runs fine and stays online but it does not
> run/stay online if I use *$ sudo service nova-compute start/restart*.
>
> I guessed it might have been related to rootwrap but I ran out of time to
> troubleshoot so I reverted the image to a previously-known good state.
>
> Does anyone have an idea why this happens and how to correct? I checked
> and /etc/nova/rootwrap.conf file looked correct and /etc/nova/nova.conf as
> well via the root_helper parameter.
>
> //adam
>
> *Adam Lawson*
>
> AQORN, Inc.
> 427 North Tatnall Street
> Ste. 58461
> Wilmington, Delaware 19801-2230
> Toll-free: (844) 4-AQORN-NOW ext. 101
> International: +1 302-387-4660
> Direct: +1 916-246-2072
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-operators][osops] tools-contrib is open for business!

2015-11-19 Thread Joe Topjian
Thanks, JJ!

It looks like David Wahlstrom submitted a script and there's a question
about license.

https://review.openstack.org/#/c/247823/

Though contributions to contrib do not have to follow a certain coding
style, can be very lax on error handling, etc, should they at least mention
a license? Thoughts?


On Wed, Nov 18, 2015 at 2:38 PM, JJ Asghar  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
>
> Hey everyone,
>
> I just want to announce that tools-contrib[1] is now open for
> submissions. Please take a moment to read the README[2] to get
> yourself familiar with it. I'm hoping to see many scripts and tools
> start to trickle in.
>
> Remember, by committing to this repository, even a simple bash script
> you wrote, you're helping out your future Operators. This is for your
> future you, and our community, so treat em nice ;)!
>
> [1]: https://github.com/openstack/osops-tools-contrib
> [2]:
> https://github.com/openstack/osops-tools-contrib/blob/master/README.rst
>
> - --
> Best Regards,
> JJ Asghar
> c: 512.619.0722 t: @jjasghar irc: j^2
> -BEGIN PGP SIGNATURE-
> Version: GnuPG/MacGPG2 v2
> Comment: GPGTools - https://gpgtools.org
>
> iQIcBAEBCgAGBQJWTO+/AAoJEDZbxzMH0+jTRxQQAK2DJdCTnihR7YJhJAXgbdIn
> NZizqkK4lEhnfdis0XZJekofAib7NytuAtTuWUQOTLQaFv02UAnMqSyX5ofX42PZ
> mGaLtZ452k+EhdeJprO5254fka8VSaRvFOZUJg0K0QjZrj5qFwtG0T1yqVBBCQmI
> wdUkxBB/cL8M0Ve6LaQNS4vmx03ZC81FLEtVX2O62EV8FrP8sxuXc7XDTCRbLnhR
> rb2HJC7R9/AZtr2gjwr7id714QFEEAgCKca79l+vsaE3VRfy+KbHsKqY9vPrxPVn
> qqXLQOm8ZDgXedjxYraCDBbay/FQqVrsEt/0RiAKrtAIRbLm2ZkiR/XL6J3BtNzi
> 2sNt12m/VkrMv9zWUT/8oqiBb73eg3TbUipVeKmh4TD12KK16EYMSF+mH9T7DY2Z
> eP2AT6XEs+BDohP+I3L7WM5r/AKl9r40ulLEqRR7y+jcn5qwAOEb+UzUpna4wTt/
> mZD5UNNemoN5h2P4eMPpfnZnpNcy4Qe/qoohZdAov4Gvdm3tmbG9jIzUKF3Q9Av5
> Uqpe6gUcp3Qd2EaKYGR47B2f+QRLlTs9Sk5lLBJSyOxpA53KcK9125fS0YM6VMVQ
> wETlxAggnmt4diwSoJt8VSYrqXlieo7eHkjv/s4hSGIcYBqtkCPZnNPliJmvMmfh
> s/wsl6ICrB7oe55ghDbM
> =EWDz
> -END PGP SIGNATURE-
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-operators][osops] tools-contrib is open for business!

2015-11-19 Thread Joe Topjian
Unless there's a reason why we *can't* do something like that (I have no
idea why, but I never assume anything when it comes to things like
licensing :) then I'm in favor of updating the README to state "Your code
will be licensed under Apache 2 unless you mention otherwise."

On Thu, Nov 19, 2015 at 9:37 PM, Erik McCormick <emccorm...@cirrusseven.com>
wrote:

> +1 for the "unless otherwise stated" bit. I seem to recall some
> non-standard requirements from the likes of HP. Apache should be a good
> default though.
>
> -Erik
> On Nov 19, 2015 11:31 PM, "Matt Fischer" <m...@mattfischer.com> wrote:
>
>> Is there a reason why we can't license the entire repo with Apache2 and
>> if you want to contribute you agree to that? Otherwise it might become a
>> bit of a nightmare.  Or maybe at least do "Apache2 unless otherwise stated"?
>>
>> On Thu, Nov 19, 2015 at 9:17 PM, Joe Topjian <j...@topjian.net> wrote:
>>
>>> Thanks, JJ!
>>>
>>> It looks like David Wahlstrom submitted a script and there's a question
>>> about license.
>>>
>>> https://review.openstack.org/#/c/247823/
>>>
>>> Though contributions to contrib do not have to follow a certain coding
>>> style, can be very lax on error handling, etc, should they at least mention
>>> a license? Thoughts?
>>>
>>>
>>> On Wed, Nov 18, 2015 at 2:38 PM, JJ Asghar <j...@chef.io> wrote:
>>>
>>>> -BEGIN PGP SIGNED MESSAGE-
>>>> Hash: SHA512
>>>>
>>>>
>>>> Hey everyone,
>>>>
>>>> I just want to announce that tools-contrib[1] is now open for
>>>> submissions. Please take a moment to read the README[2] to get
>>>> yourself familiar with it. I'm hoping to see many scripts and tools
>>>> start to trickle in.
>>>>
>>>> Remember, by committing to this repository, even a simple bash script
>>>> you wrote, you're helping out your future Operators. This is for your
>>>> future you, and our community, so treat em nice ;)!
>>>>
>>>> [1]: https://github.com/openstack/osops-tools-contrib
>>>> [2]:
>>>> https://github.com/openstack/osops-tools-contrib/blob/master/README.rst
>>>>
>>>> - --
>>>> Best Regards,
>>>> JJ Asghar
>>>> c: 512.619.0722 t: @jjasghar irc: j^2
>>>> -BEGIN PGP SIGNATURE-
>>>> Version: GnuPG/MacGPG2 v2
>>>> Comment: GPGTools - https://gpgtools.org
>>>>
>>>> iQIcBAEBCgAGBQJWTO+/AAoJEDZbxzMH0+jTRxQQAK2DJdCTnihR7YJhJAXgbdIn
>>>> NZizqkK4lEhnfdis0XZJekofAib7NytuAtTuWUQOTLQaFv02UAnMqSyX5ofX42PZ
>>>> mGaLtZ452k+EhdeJprO5254fka8VSaRvFOZUJg0K0QjZrj5qFwtG0T1yqVBBCQmI
>>>> wdUkxBB/cL8M0Ve6LaQNS4vmx03ZC81FLEtVX2O62EV8FrP8sxuXc7XDTCRbLnhR
>>>> rb2HJC7R9/AZtr2gjwr7id714QFEEAgCKca79l+vsaE3VRfy+KbHsKqY9vPrxPVn
>>>> qqXLQOm8ZDgXedjxYraCDBbay/FQqVrsEt/0RiAKrtAIRbLm2ZkiR/XL6J3BtNzi
>>>> 2sNt12m/VkrMv9zWUT/8oqiBb73eg3TbUipVeKmh4TD12KK16EYMSF+mH9T7DY2Z
>>>> eP2AT6XEs+BDohP+I3L7WM5r/AKl9r40ulLEqRR7y+jcn5qwAOEb+UzUpna4wTt/
>>>> mZD5UNNemoN5h2P4eMPpfnZnpNcy4Qe/qoohZdAov4Gvdm3tmbG9jIzUKF3Q9Av5
>>>> Uqpe6gUcp3Qd2EaKYGR47B2f+QRLlTs9Sk5lLBJSyOxpA53KcK9125fS0YM6VMVQ
>>>> wETlxAggnmt4diwSoJt8VSYrqXlieo7eHkjv/s4hSGIcYBqtkCPZnNPliJmvMmfh
>>>> s/wsl6ICrB7oe55ghDbM
>>>> =EWDz
>>>> -END PGP SIGNATURE-
>>>>
>>>> ___
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators@lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>
>>>
>>>
>>> ___
>>> OpenStack-operators mailing list
>>> OpenStack-operators@lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] OPs Midcycle location discussion.

2015-11-16 Thread Joe Topjian
+1 Option 1

On Mon, Nov 16, 2015 at 10:01 AM, Jonathan Proulx  wrote:

>
> Let me restate the question a bit as I think I'm hearing two different
> responses that may be getting conflated.
>
> Option 1:  There's a single Ops Midcycle that shifts around and we
> look at ways to increase remote participation. (obviously this doesn't
> preclude other meetups)
>
> Option 2: There are multiple Ops Meetups around midcycle (presumably
> starting with North America, Asia, and Europe) and we look at ways of
> coordinationg those re reduce duplication of effort any synthesis of
> results.
>
> I was advocating option 1 mostly because I think synthesis of option 2
> is harder than stepping up preparation of etherpads before sessions
> and review of them afterward is which is motly the level of remote
> participation I'd envision in the first case (possibly also running
> some email threads on any reccommendations that come out and seem
> controvertial for any reason)
>
> So far though seems the tide is runiing toward option 2, multiple
> meet-ups. Though wee're still at a very small sample size.
>
> -Jon
>
>
> On Mon, Nov 16, 2015 at 10:50:52AM -0500, Jonathan Proulx wrote:
> :Hi All,
> :
> :1st User Committee IRC meeting will be today at 19:00UTC on
> :#openstack-meeting, we haven't exactly settled on an agenda yet but I
> :hope to raise this issue the...
> :
> :It has been suggested that we make the February 15-16 European Ops
> :Meetup in Manchester UK [1] the 'official' OPs Midcycle.  Previously
> :all mid cycles have been US based.
> :
> :Personally I like the idea of broadening or geographic reach rather
> :than staying concentrated in North America. I particularly like it
> :being 'opposite' the summit location.
> :
> :This would likely trade off some depth of participation as fewer
> :of the same people would be able to travel to all midcycles in person.
> :
> :Discuss...(also come by  #openstack-meeting at 19:00 UTC if you think
> :this needs real time discussion)
> :
> :-Jon
> :
> :
> :--
> :
> :1.
> http://www.eventbrite.com/e/european-openstack-operators-meetup-tickets-19405855436?aff=es2
>
> --
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] OpenStack Tuning Guide

2015-11-04 Thread Joe Topjian
Hi Kevin,

Oops, noticed I didn't reply to all the first time.

I think it's great to see more people who want to collect and distil
knowledge like this. :)

Finally, I hate diverging resources, so if something like this already
> exists please speak up so we can focus our efforts on making sure that's up
> to date and well publicized.
>

These may be complementary / sibling, but thought I'd mention them:

* Hypervisor Tuning Guide[1]
* Performance Team[2]

1:
http://lists.openstack.org/pipermail/openstack-operators/2015-October/008557.html
2:
http://lists.openstack.org/pipermail/openstack-dev/2015-October/078028.html

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Informal Ops Meetup?

2015-10-29 Thread Joe Topjian
We're currently in the Prince room by the projector.

On Fri, Oct 30, 2015 at 10:07 AM, Edgar Magana 
wrote:

> Where are you meeting?
>
> Edgar
>
> From: "Kris G. Lindgren" 
> Date: Thursday, October 29, 2015 at 6:37 AM
> To: Sam Morrison , "
> openstack-operators@lists.openstack.org" <
> openstack-operators@lists.openstack.org>
>
> Subject: Re: [Openstack-operators] Informal Ops Meetup?
>
> We seem to have enough interest… so meeting time will be at 10am in the
> Prince room (if we get an actual room I will send an update).
>
> Does anyone have any ideas about what they want to talk about?  I am
> pretty much open to anything.  I started:
> https://etherpad.openstack.org/p/TYO-informal-ops-meetup  for tracking of
> some ideas/time/meeting place info.
>
> ___
> Kris Lindgren
> Senior Linux Systems Engineer
> GoDaddy
>
> From: Sam Morrison 
> Date: Thursday, October 29, 2015 at 6:14 PM
> To: "openstack-operators@lists.openstack.org" <
> openstack-operators@lists.openstack.org>
> Subject: Re: [Openstack-operators] Informal Ops Meetup?
>
> I’ll be there, talked to Tom too and he said there may be a room we can
> use else there is plenty of space around the dev lounge to use.
>
> See you tomorrow.
>
> Sam
>
>
> On 29 Oct 2015, at 6:02 PM, Xav Paice  wrote:
>
> Suits me :)
>
> On 29 October 2015 at 16:39, Kris G. Lindgren 
> wrote:
>
>> Hello all,
>>
>> I am not sure if you guys have looked at the schedule for Friday… but its
>> all working groups.  I was talking with a few other operators and the idea
>> came up around doing an informal ops meetup tomorrow.  So I wanted to float
>> this idea by the mailing list and see if anyone was interested in trying to
>> do an informal ops meet up tomorrow.
>>
>> ___
>> Kris Lindgren
>> Senior Linux Systems Engineer
>> GoDaddy
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Informal Ops Meetup?

2015-10-29 Thread Joe Topjian
Hi Kris,

I'll be around and am interested.

On Thu, Oct 29, 2015 at 4:39 PM, Kris G. Lindgren 
wrote:

> Hello all,
>
> I am not sure if you guys have looked at the schedule for Friday… but its
> all working groups.  I was talking with a few other operators and the idea
> came up around doing an informal ops meetup tomorrow.  So I wanted to float
> this idea by the mailing list and see if anyone was interested in trying to
> do an informal ops meet up tomorrow.
>
> ___
> Kris Lindgren
> Senior Linux Systems Engineer
> GoDaddy
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Hypervisor Tuning Guide

2015-10-27 Thread Joe Topjian
We had a great Hypervisor Tuning Guide session yesterday!

We agreed on an initial structure to the guide that will include four core
sections (CPU, Memory, Network, and Disk) and common subsections to each.
The etherpad[1] has this structure defined and during the session, we went
through and added some brief notes about what should be included.

Another agreement was that this guide should be detailed. It should have
specific actions such as "change the following sysctl setting to nnn"
rather than being more broad and generic such as "make sure you aren't
swapping". One disadvantage of this is the guide might become out of date
sooner than if it was more broad. We felt this was an acceptable tradeoff.

Our current plan is the following:

1. We're going to leave the etherpad active for the next two weeks to allow
people to continue adding notes at their leisure. I'll send a reminder
about this a few days before the deadline.

2. We'll then transfer the etherpad notes to the OpenStack wiki and begin
creating a rough draft of the guide. Brief notes will be elaborated on and
supporting documentation will be added. Areas that have no information will
be highlighted for help. Everyone is encouraged to edit the wiki during
this time.

3. Once a decent rough draft has been created, we'll look into creating a
formal OpenStack document.

We're all very busy, so there are no definitive timelines for completing
steps 2 and 3. At a minimum, we'll continue to touch base with this during
the Summits and mid-cycles. If there's enough interest, we could try to
schedule a large block of time to do a doc sprint during one of these
events.

Thanks,
Joe

1: https://etherpad.openstack.org/p/TYO-ops-hypervisor-tuning-guide
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [openstack-operators][osops] Something other than NOOP in our jenkins tests

2015-09-29 Thread Joe Topjian
+1

I like that idea. I think it also ties in nicely with both the Monitoring
and Tools WGs.

Some projects have a directory called "contrib" that contains contributed
items which might not be up to standard. Would that be a simple solution
for the "dumping ground"?



On Tue, Sep 29, 2015 at 2:29 PM, Kris G. Lindgren <klindg...@godaddy.com>
wrote:

> If we are going to be stringent on formatting – I would also like to see
> us be relatively consistent on arguments/env variables that are needed to
> make a script run.  Some pull in ENV vars, some source a rc file, some just
> say already source your rc file to start with, others accept command
> options.  It would be nice if we had a set of curated scripts that all
> worked in a similar fashion.
>
> Also, to Joe's point. It would be nice if we had two place for scripts.  A
> "dumping ground" that people could share what they had.  And a curated one,
> where everything within the curated repo follows a standard set of
> conventions/guidelines.
>
> _______
> Kris Lindgren
> Senior Linux Systems Engineer
> GoDaddy
>
> From: Joe Topjian
> Date: Tuesday, September 29, 2015 at 1:43 PM
> To: JJ Asghar
> Cc: "openstack-operators@lists.openstack.org"
> Subject: Re: [Openstack-operators] [openstack-operators][osops] Something
> other than NOOP in our jenkins tests
>
> So this will require bash scripts to adhere to bashate before being
> accepted? Is it possible to have the check as non-voting? Does this open
> the door to having other file types be checked?
>
> IMHO, it's more important for the OSOps project to foster collaboration
> and contributions rather than worry about an accepted style.
>
> As an example, yesterday's commits used hard-tabs:
>
> https://review.openstack.org/#/c/228545/
> https://review.openstack.org/#/c/228534/
>
> I think we're going to see a lot of variation of styles coming in.
>
> I don't want to come off as sounding ignorant or disrespectful to other
> projects that have guidelines in place -- I fully understand and respect
> those decisions.
>
> Joe
>
> On Tue, Sep 29, 2015 at 12:52 PM, JJ Asghar <j...@chef.io> wrote:
>
>> Awesome! That works!
>>
>> Best Regards,
>> JJ Asghar
>> c: 512.619.0722 t: @jjasghar irc: j^2
>>
>> On 9/29/15 1:27 PM, Christian Berendt wrote:
>> > On 09/29/2015 07:45 PM, JJ Asghar wrote:
>> >> So this popped up today[1]. This seems like something that should be
>> >> leveraged in our gates/validations?
>> >
>> > I prepared review requests to enable checks on the gates for
>> >
>> > * osops-tools-monitoring: https://review.openstack.org/#/c/229094/
>> > * osops-tools-generic: https://review.openstack.org/#/c/229043/
>> >
>> > Christian.
>> >
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Tokyo Summit Ops Design Summit Tracks - Draft Agenda

2015-09-22 Thread Joe Topjian
Hi Tom,

As luck would have it, the sole session I'm doing in the Main Conference
clashes with the current time slot for Hypervisor Tuning. I see that Tim is
also listed as a moderator for that session, so I'm more than happy to let
him run it. However, if it's not asking much, I'd really like to be able to
attend that one, moderator or not. :)

Thanks,
Joe

On Tue, Sep 22, 2015 at 7:22 PM, Tom Fifield  wrote:

> Sorry about that - judicious column widening and wrapping applied!
>
> On 23/09/15 00:32, Matt Fischer wrote:
>
>> Tom,
>>
>>
>> Can you make the columns a bit wider? I don't seem to have permissions
>> to do so and I cant read everything. I've resorted to copying and
>> pasting stuff into another window so I can read it.
>>
>> On Mon, Sep 21, 2015 at 11:04 PM, Tom Fifield > > wrote:
>>
>> Hi all,
>>
>> I've started wrangling things toward a draft agenda.
>>
>> You can watch it live on Google Sheets at:
>>
>>
>> https://docs.google.com/spreadsheets/d/1EUSYMs3GfglnD8yfFaAXWhLe0F5y9hCUKqCYe0Vp1oA/edit#gid=1480678842
>>
>> Comments and feedback welcome!
>>
>> Regards,
>>
>>
>> Tom
>>
>> On 16/09/15 11:45, Tom Fifield wrote:
>>
>> Last chance to provide your ideas for our design summit track.
>>
>> So far we are lacking:
>> * Lightning talks
>> * Working Groups
>> * General Sessions
>>
>> Starting next week we're going to prepare the draft agenda for
>> circulation and discussion. So, get in now. What would you like to
>> discuss with fellow ops and developers?
>>
>>
>> https://etherpad.openstack.org/p/TYO-ops-meetup
>>
>>
>> Regards,
>>
>>
>> Tom
>>
>> On 08/09/15 17:10, Tom Fifield wrote:
>>
>> Ping!
>>
>> This is your chance to provide input on our design summit
>> track for
>> Tokyo. Add your ideas on the etherpad below!
>>
>>
>> https://etherpad.openstack.org/p/TYO-ops-meetup
>>
>>
>> On 03/09/15 03:27, Tom Fifield wrote:
>>
>> Hi all,
>>
>> Thanks for those who made it to the recent meetup in
>> Palo Alto. It was a
>> fantastic couple of days, and many are excited to get
>> started on talking
>> about our ops track in the Tokyo design summit.
>>
>>
>> Recall that this is in addition to the operations and
>> other conference
>> track's presentations. It's aimed at giving us a
>> design-summit-style
>> place to congregate, swap best practices, ideas and give
>> feedback.
>>
>>
>> As usual, we're working to act on the feedback from all
>> past events to
>> make this one better than ever. One that we continue to
>> work on is the
>> need to see action happen as a result of this event, so
>> please - when
>> you are suggesting sessions in the below etherpad please
>> try and phrase
>> them in a way that will probably result in things
>> happening afterward.
>>
>>
>>
>> **
>>
>> Please propose session ideas on:
>>
>> https://etherpad.openstack.org/p/TYO-ops-meetup
>>
>> ensuring each session suggestion will have a result.
>>
>>
>> **
>>
>>
>> The room allocations are still being worked out, but the
>> current
>> thinking is that we will interleave general sessions and
>> working groups
>> across Tuesday and Wednesday, to allow for attendance
>> from ops in the
>> cross-project sessions.
>>
>>
>> More as it comes, and as always, further information
>> about ops meetups
>> and notes from the past can be found on the wiki @:
>>
>> https://wiki.openstack.org/wiki/Operations/Meetups
>>
>> Finally, don't forget to register ASAP!
>>
>> http://www.eventbrite.com/e/openstack-summit-october-2015-tokyo-tickets-17356780598
>>
>>
>>
>>
>> Regards,
>>
>>
>> Tom
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> 
>>
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>>
>> ___
>> OpenStack-operators mailing list
>> 

Re: [Openstack-operators] [openstack-operators] Announcement! We have everything ready to get to Stackforge!

2015-09-02 Thread Joe Topjian
Hi JJ,

Thank you for putting all of this together!

All, one decision that was made during the PAO Ops Meetup was to make the OSOps
github repos  "official" repos / projects. If
you've contributed to the existing repos, you may have an interest in this.

As well, if you preferred not to contribute to those repos in the past
because they weren't under the OpenStack/StackForge namespace, this may be
of interest to you, too. :)

Thanks,
Joe

On Wed, Sep 2, 2015 at 1:21 PM, JJ Asghar  wrote:

>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
>
> To follow up with the previous emails; it seems my formatting got
> screwed up.
>
> I've put the review[1] and it looks like it's pretty good.
>
> I've created a launchpad[2] with both answers[3] and bugs[4] to help
> organize this code base.
>
> I've also created an initial wiki[5] page too to start help
> bootstrapping this.
>
> I've mentioned this in our IRC channel also.
>
> If you have any questions or comments please don't hesitate to reach out.
>
> [1]: https://review.openstack.org/#/c/219760/
> [2]: http://launchpad.net/osops
> [3]: https://answers.launchpad.net/osops
> [4]: https://bugs.launchpad.net/osops
> [5]: https://wiki.openstack.org/wiki/Osops
>
> Oh! If yall can +1 the review to say you're in favor for this that would
> be amazing. The more support we get the more momentum we can get!
>
> Best Regards,
> JJ Asghar
> c: 512.619.0722 t: @jjasghar irc: j^2
> -BEGIN PGP SIGNATURE-
> Version: GnuPG/MacGPG2 v2
> Comment: GPGTools - https://gpgtools.org
>
> iQIcBAEBCgAGBQJV50xLAAoJEDZbxzMH0+jTPSoP/3OqI3vznKzGHKIfeQRnnos4
> o2hTdHPArs6/gaQrRTBZzPZ7WdMWypoznDluwzkdrLNlEGMOgF+aL1S1zxCEcjI0
> 6W5Co4Haba0nLwlAxwkw9KjoYDPJoL44hZckP75rhOsUx4Sn4eQLdgeyYyO3GtIJ
> 32HPc5cwsdgwcE2K9OrKBSPV1jBVk/HVNO0UWYLekx+7/3Jqk9Dk2KKv6d34Phgu
> BYNUrTnVFjQMW0PfKOcTndhiA1hlNGTMN9t05K/es1kS2i0vhlduN44AYODqoo/H
> 1p/DD51zC2mgSVJSCYcG450ISPOiMO/KHO9q4NRZLy+1Ef4E1eHCVghCvy0oi+6L
> J8hsLOrWw+U3xbOLFYCqPWxPDMtjm1IrHAwmio7Fp8h0rCk3WnagBHhJhx9SJH0j
> KAMY+wZ09ct/zaSUPmLdV0+0USsQYQgPBS4RO1x/q/z1rfOc4xcTUvc0eN/HiiML
> 2aT53Xqo5k/jA/sCZ911sywsxdiY8JKEbImHTlYvrYuxTfMgGg6LqiODNajxLjCc
> ACjrLyPgatKJBfdPR65xvWJoFfIEI0h4Usnhc6f7XNRDTUc+uwhHHp3oRxfvYxsN
> EYAI5mmDg9gsCM1f6M9Z2smsS1c58mgotHJw5TO5RMb/opK9f04ofdxUrzj3xkOa
> pr6KZ0IO4126Wv0BEswb
> =NL4C
> -END PGP SIGNATURE-
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Draft Agenda for PAO Ops Meetup (August 18, 19)

2015-08-13 Thread Joe Topjian
Hi Tom,

On Thu, Aug 13, 2015 at 2:08 AM, Tom Fifield t...@openstack.org wrote:

 Hi all,

 We're still lacking moderators for the following sessions - any takers?

 1. Hypervisor Tuning  - General Session (have a backup, but would like a
 primary)


I'm happy to be the primary on this -- not a problem at all.


 2. Burning Issues WG (already have one for the general session)
 3. Upgrades WG
 4. Packaging WG


 Regards,

 Tom


 On 10/08/15 22:35, Tom Fifield wrote:

 Thanks to Geoff and all,

 Here's the schedule diff:

 Large Deployments Team -- Large Deployments Team (inc Public Cloud)

 HPC WG -- ??? (any takers?)


 Current status is: working with those who volunteered to moderate and
 getting someone to take care of every slot.

 Also, don't forget the carpool etherpad if you need a ride or have a car:


 https://etherpad.openstack.org/p/PAO-ops-meetup-carpool



 Other etherpads now up at:

 https://etherpad.openstack.org/p/PAO-ops-burning-issues
 https://etherpad.openstack.org/p/PAO-ops-hypervisor-tuning
 https://etherpad.openstack.org/p/PAO-ops-large-deployments
 https://etherpad.openstack.org/p/PAO-ops-logging
 https://etherpad.openstack.org/p/PAO-ops-upgrades
 https://etherpad.openstack.org/p/PAO-ops-ops-guide-fixing
 https://etherpad.openstack.org/p/PAO-ops-containers-for-deployment
 https://etherpad.openstack.org/p/PAO-ops-lightning-talks

 https://etherpad.openstack.org/p/PAO-ops-cmdb
 https://etherpad.openstack.org/p/PAO-ops-deployment-tips
 https://etherpad.openstack.org/p/PAO-ops-network-model
 https://etherpad.openstack.org/p/PAO-ops-user-committtee
 https://etherpad.openstack.org/p/PAO-ops-tools-mon
 https://etherpad.openstack.org/p/PAO-ops-product-wg
 https://etherpad.openstack.org/p/PAO-ops-packaging
 https://etherpad.openstack.org/p/PAO-ops-tags
 https://etherpad.openstack.org/p/PAO-ops-feedback


 Tuesday Med II  Med III Salon A Salon B
  Bacchus
 9:00 - 10:00Registration



 10:00 - 10:30   Introduction



 10:30 - 11:15   Burning Issues



 11:15 - 11:55   Hypervisor Tuning



 11:55 - 12:05   Breakout Explain



 12:05 - 13:30   Lunch



 13:30 - 15:00   Large Deployments Team (inc Public Cloud)   Burning
 Issues  Logging WG  Upgrades WG Ops Guide Fixing
 15:00 - 15:30   Coffee



 15:30 - 16:00   Breakout Reports



 16:00 - 17:00   Using Containers for Deployment



 17:00 - 18:00   Lightning Talks















 Wednesday   Med II  Med III Salon A Salon B
  Bacchus
 9:00 - 09:45CMDB: use cases



 9:45 - 10:30Deployment Tips - read only slaves? admin-only API
 servers?



 10:30 - 11:15   What network model are you using? Are you happy?



 11:15 - 11:30   Coffee



 11:30 - 12:15   User Committee Discussion



 12:15 - 12:20   Breakout Explain



 12:20 - 13:30   Lunch



 13:30 - 15:00   Tools and Mon   Product WG  Packaging   
 Ops Tags Team
 15:00 - 15:30   Coffee



 15:30 - 16:00   Breakout Reports



 16:00 - 17:00   Feedback Session, Tokyo Planning










 Regards,


 Tom

 On 05/08/15 14:11, Tom Fifield wrote:

 Hi all,

 I've received feedback that maybe there won't be enough HPC folks in
 Palo Alto to run a 90 minute working session on it :)

 I would propose to slot in instead one of these three, which are
 currently not well included on the agenda:

 1) apps.openstack.org - What the Ops Community would like from it,
 should we look from the Application side, ie Applications that can run
 on your cloud, or Augment your cloud, Products that can help enhance
 your cloud.

 2) Openstack Personas (validation) - The UX team will have a set of
 roles that we would like to validate with the opertator community.

 3) Task Taxonomy - The UX team is creating an inventory of
 standardized tasks that can be used to create scenarios and create a
 common vernacular within the community.

 Any thoughts?

 Regards,


 Tom

 On 03/08/15 18:48, Tom Fifield wrote:

 Hi all,

 Registrations are going well for our meetup in Palo Alto. If you're
 on the fence, hopefully this discussion will get you quickly over the
 line so you don't miss out!


 http://www.eventbrite.com/e/openstack-ops-mid-cycle-meetup-tickets-17703258924

 So, I've taken our suggestions and attempted to wrangle them into
 something that would fit in the space we have over 2 days.

 As a reminder, we have two different kind of sessions - General
 Sessions, which are discussions for the operator community aimed to
 produce actions (eg best practices, feedback on badness),
 and**Working groups**focus on specific topics aiming to make concrete

 progress on tasks in that area.

 As always, some stuff has been munged and mangled in an attempt to
 fit it in. For example, we'd expect to talk about Kolla more
 generally in the context of Using Containers for Deployment,
 because there are some other ways to do that too. Similarly, we'd
 expect the ops project discussion to be rolled into the session on
 the user committee.

 Anyway, take a 

Re: [Openstack-operators] Palo Alto Midcycle - agenda brainstorming

2015-07-18 Thread Joe Topjian
Hi Tom,

The list of General Session ideas is definitely shorter than past meetups,
but maybe that's a good sign! It could be that past burning topics have
been acknowledged and handled.

If that's the case, does anyone have thoughts about extending the length of
Working Group sessions so there's more time to collaborate face-to-face as
a group?

Joe

On Sat, Jul 18, 2015 at 8:42 AM, Tom Fifield t...@openstack.org wrote:

 Hi all,

 If you have some time in the next few days, please contribute to the
 agenda planning. So far it's looking a bit light, and we need to lock in
 moderators soon!


  **
 
  Please propose session ideas on:
 
  https://etherpad.openstack.org/p/PAO-ops-meetup
 
  ensuring you read the new instructions to make sessions 'actionable'.
 
 
  **



 Regards,


 Tom

 On 09/07/15 21:28, Tom Fifield wrote:

 Hi all,

 As you've seen - the Ops mid-cycle will be in Palo Alto, August 1819,
 and we need your help to work out what should be on the agenda.

 If you're new: note this is aimed at giving us a design-summit-style
 place to congregate, swap best practices, ideas and give feedback, and
 is not a good place to learn about the basics of OpenStack.

 As usual, we're working to act on the feedback from all past events to
 make this one better than ever. One that we continue to work on is the
 need to see action happen as a result of this event, so please - when
 you are suggesting sessions in the below etherpad please try and phrase
 them in a way that will probably result in things happening afterward.


 **

 Please propose session ideas on:

 https://etherpad.openstack.org/p/PAO-ops-meetup

 ensuring you read the new instructions to make sessions 'actionable'.


 **


 The room allocations are still being worked out (all hail Allison!), but
 the current thinking is that the general sessions will all be in the
 morning of both days, and the working groups will be in the afternoon -
 similar to Philadelphia. We probably have a lot more space for smaller
 working groups this time.


 More as it comes, and as always, further information about ops meetups
 and notes from the past can be found on the wiki @:

 https://wiki.openstack.org/wiki/Operations/Meetups

 Finally, don't forget to register ASAP:

 http://www.eventbrite.com/e/openstack-ops-mid-cycle-meetup-tickets-17703258924
 !


 Regards,


 Tom

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] FAiled to create instance wiht openstack nova network

2015-07-13 Thread Joe Topjian
Hello,

According to nova.conf, you're running nova-network in multi-host mode.
Just to be verbose: if your OpenStack installation is an all-in-one or if
you intend for all network traffic to go through the cloud controller, this
setting should be changed to false.

The error message is reporting:

dnsmasq: failed to create listening socket for 192.168.22.1: Cannot assign
requested address

Off of the top of my head, I would check to see if 192.168.22.1 exists on
the server that dnsmasq is trying to run on (ip a | grep 192.168.22.1). As
well, check and see if anything else is listening on 53 or 67 on that
server (perhaps another instance of dnsmasq that has bound itself to all
interfaces?)

Hope that helps,
Joe



On Mon, Jul 13, 2015 at 1:07 PM, pra devOPS siv.dev...@gmail.com wrote:


 Can somebody suggest me on the below?

 Thanks,
 Dev

 On Fri, Jul 10, 2015 at 4:32 PM, pra devOPS siv.dev...@gmail.com wrote:

 Hi

 I am running as root, Please find below the nova config file. ( I am
 using nova network)

 http://paste.openstack.org/show/363300/

 Thanks,
 Dev

 On Fri, Jul 10, 2015 at 1:30 PM, matt m...@nycresistor.com wrote:

 root-wrap failed probably a config error.  might want to post your nova
 configs with commenting out of passwords / service tokens.

 dnsmasq --strict-order --bind-interfaces --conf-file= 
 --pid-file=/var/lib/nova/networks/nova-br100.pid 
 --listen-address=192.168.22.1 --except-interface=lo 
 --dhcp-range=set:demo-net,192.168.22.2,static,255.255.255.0,120s 
 --dhcp-lease-max=256 
 --dhcp-hostsfile=/var/lib/nova/networks/nova-br100.conf 
 --dhcp-script=/usr/bin/nova-dhcpbridge --leasefile-ro --domain=novalocal 
 --no-hosts --addn-hosts=/var/lib/nova/networks/nova-br100.hosts
 2015-07-10 15:30:29.753 3044 TRACE oslo.messaging.rpc.dispatcher Exit code: 
 2

 needs to run as root.  exit code 2 is obviously pretty bad.  so that NEEDs 
 to be fixed.



 On Fri, Jul 10, 2015 at 3:25 PM, pra devOPS siv.dev...@gmail.com
 wrote:

 All:

 I get the following error when trying to create an instance in
 openstack icehouse centOS 7 on nova network.

 nova network logs and UI logs are pasted at:
 *http://paste.openstack.org/show/362706/
 http://paste.openstack.org/show/362706/*



 Can somebdody give susggestiong?
 Thanks,Siva


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators





 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Scaling the Ops Meetup

2015-06-30 Thread Joe Topjian
Hi Tom,

I think this is a great problem to have. Difficult to solve, but it shows
how popular / important these meetups are.

I'm definitely in favor of a no booths type meetup. I feel if a company
wants to sponsor, they're doing it out of good will and any recognition
would come from that.

I'd love to keep the meetups as inclusive as possible. I found the
Philadelphia meetup to be extremely valuable networking-wise (as well as
extremely valuable in general). A lot of people I talked to may not have
attended if there was some kind of bar placed on entry.

I think the current schedule format is still working: open discussions
bring in a lot of feedback and tips, working groups continue to shape and
produce actions. Open discussions may become unwieldy as attendance grows,
but maybe having two tracks would solve that.

Do you have a good indication that the number of attendees will continue to
grow? Maybe it has hit (or will soon hit) a steady level?

I wouldn't be opposed to having a paid registration for the meetup. Could
the amount be polled if paid registration is considered?

Thanks, Tom. I don't envy your position, but I do very much appreciate the
work that goes into planning this. :)

Joe

On Mon, Jun 29, 2015 at 10:33 PM, Tom Fifield t...@openstack.org wrote:

 Hi all,

 Right now, behind-the-scenes, we're working on getting a venue for next
 ops mid-cycle. It's taking a little longer than normal, but rest assured
 it is happening.

 Why is it so difficult? As you may have noticed, we're reaching the size
 of event where both physically and financially, only the largest
 organisations can host us.

 We thought we might get away with organising this one old-school with a
 single host and sponsor. Then, for the next, start a brainstorming
 discussion with you about how we scale these events into the future -
 since once we get up and beyond a few hundred people, we're looking at
 having to hire a venue as well as make some changes to the format of the
 event.

 However, it seems that even this might be too late. We already had a
 company that proposed to host the meetup at a west coast US hotel
 instead of their place, and wanted to scope out other companies to
 sponsor food.

 This would be a change in the model, so let's commence the discussion of
 how we want to scale this event :)

 So far I've heard things like:
 * my $CORPORATE_BENEFACTOR would be fine to share sponsorship with others
 * I really don't want to get to the point where we want booths at the
 ops meetup

 Which are promising! It seems like we have a shared understanding of
 what to take this forward with.

 So, as the ops meetup grows - what would it look like for you?

 How do you think we can manage the venue selection and financial side of
 things? What about the session layout and the scheduling with the
 growing numbers of attendees?

 Current data can be found at
 https://wiki.openstack.org/wiki/Operations/Meetups#Venue_Selection .

 I would also be interested in your thoughts about how these events have
 only been in a limited geographical area so far, and how we can address
 that issue.


 Regards,


 Tom



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Instance memory overhead

2015-06-23 Thread Joe Topjian
In addition to what Kris said, here are two other ways to see memory usage
of qemu processes:

The first is with nova diagnostics uuid. By default this is an
admin-only command.

The second is by running virsh dommemstat instance-id directly on the
compute node.

Note that it's possible for the used memory (rss) to be greater than the
available memory. When this happens, I believe it is due to the qemu
process consuming more memory than the actual vm itself -- so the instance
has consumed all available memory, plus qemu itself needs some to function
properly. Someone please correct me if I'm wrong.

Hope that helps,
Joe

On Tue, Jun 23, 2015 at 10:12 AM, Kris G. Lindgren klindg...@godaddy.com
wrote:

   Not totally sure I am following - the output of free would help a lot.

  However, the number you should be caring about is free +buffers/cache.
 The reason for you discrepancy is you are including the cached in memory
 file system content that linux does in order to improve performance. On
 boxes with enough ram this can easily be 60+ GB.  When the system comes
 under memory pressure (from applications or the kernel wanting more memory)
 the kernel will remove any cached filesystem items to free up memory for
 processes.  This link [1] has a pretty good description of what I am
 talking about.

  Either way, if you want to test to make sure this is a case of
 filesystem caching you can run:

 echo 3  /proc/sys/vm/drop_caches

  Which will tell linux to drop all filesystem cache from memory, and I
 bet a ton of your memory will show up.  Note: in doing so - you will affect
 the performance of the box.  Since what use to be an in memory lookup will
 now have to go to the filesystem.  However, over time the cache will
 re-establish.  You can find more examples of how caching interacts with
 other part of the linux memory system here: [2]

  To your question about qemu process..  If you use ps aux, the
 columns VSZ and RSS will tell you are wanting.  VSZ is the virtual size
 (how much memory the process has asked the kernel for).  RSS is resident
 set side, or that actual amount of non-swapped memory the process is using.

  [1] - http://www.linuxatemyram.com/
  [2] - http://www.linuxatemyram.com/play.html
  

 Kris Lindgren
 Senior Linux Systems Engineer
 GoDaddy, LLC.


   From: Mike Leong leongmzl...@gmail.com
 Date: Tuesday, June 23, 2015 at 9:44 AM
 To: openstack-operators@lists.openstack.org 
 openstack-operators@lists.openstack.org
 Subject: [Openstack-operators] Instance memory overhead

   My instances are using much more memory that expected.  The amount free
 memory (free + cached) is under 3G on my servers even though the compute
 nodes are configured to reserve 32G.

  Here's my setup:
 Release: Ice House
  Server mem: 256G
 Qemu version: 2.0.0+dfsg-2ubuntu1.1
 Networking: Contrail 1.20
 Block storage: Ceph 0.80.7
 Hypervisor OS: Ubuntu 12.04
 memory over-provisioning is disabled
 kernel version: 3.11.0-26-generic

  On nova.conf
  reserved_host_memory_mb = 32768

  Info on instances:
 - root volume is file backed (qcow2) on the hypervisor local storage
 - each instance has a rbd volume mounted from Ceph
 - no swap file/partition

  I've confirmed, via nova-compute.log, that nova is respecting the
 reserved_host_memory_mb directive and is not over-provisioning.  On some
 hypervisors, nova-compute says there's 4GB available for use even though
 the OS has less that 4G left (free +cached)!

  I've also summed up the memory from /etc/libvir/qemu/*.xml files and the
 total looks good.

  Each hypervisor hosts about 45-50 instances.

  Is there good way to calculate the actual usage of each QEMU process?

  PS: I've tried free, summing up RSS, and smem but none of them can tell
 me where's the missing mem.

  thx
 mike

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] 100% CPU and hangs if syslog is restarted

2015-05-28 Thread Joe Topjian
Hello,

Yeah, I ran into it last fall:

http://www.gossamer-threads.com/lists/openstack/operators/41876

Good to know that this issue still exists in Juno (we're still on
Icehouse). Thanks for the note. :)

Joe

On Thu, May 28, 2015 at 10:56 AM, George Shuklin george.shuk...@gmail.com
wrote:

 Hello.

 Today we've discover a very serious bug in juno:
 https://bugs.launchpad.net/nova/+bug/1459726

 In short: if you're using syslog, and restart rsyslog, all APIs processes
 will eventually stuck with 100% CPU usage without doing anything.

 Is anyone hits this bug before? It looks like very nasty.

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] How do your end users use networking?

2015-05-22 Thread Joe Topjian
Hi Kris,

Busy week! It was good seeing you in Vancouver - even if it was just in
passing on the escalator ;)

  It is always nice to see that other people are doing the same things as
 you or see the same issues as you are and that you are not crazy.


+100


 Would it be accurate to say that most of your end users want almost
 nothing to do with the network?


Yes.


 In my experience what the majority of them (both internal and external)
 want is to consume from Openstack a compute resource, a property of which
 is it that resource has an IP address.  They, at most, care about which
 network they are on.  Where a network is usually an arbitrary
 definition around a set of real networks, that are constrained to a
 location, in which the company has attached some sort of policy.  For
 example, I want to be in the production network vs's the xyz lab network,
 vs's the backup network, vs's the corp network.  I would say for Godaddy,
 99% of our use cases would be defined as: I want a compute resource in the
 production network zone, or I want a compute resource in this other network
 zone.  The end user only cares that the IP the vm receives works in that
 zone, outside of that they don't care any other property of that IP.  They
 do not care what subnet it is in, what vlan it is on, what switch it is
 attached to, what router its attached to, or how data flows in/out of that
 network.  It just needs to work.


Again, yes.


 We have also found that by giving the users a floating ip address that can
 be moved between vm's (but still constrained within a network zone) we
 can solve almost all of our users asks.  Typically, the internal need for a
 floating ip is when a compute resource needs to talk to another protected
 internal or external resource. Where it is painful (read: slow) to have the
 acl's on that protected resource updated. The external need is from our
 hosting customers who have a domain name (or many) tied to an IP address
 and changing IP's/DNS is particularly painful.


Our use of floating IPs has been described as overloaded in other
discussions, and I think that's accurate. Our users use floating IPs both
as a way to move a common IP from one compute resource to another and as a
way to attach to a publicly accessible network for direct access.


 Since the vast majority of our end users don't care about any of the
 technical network stuff, we spend a large amount of time/effort in
 abstracting
 or hiding the technical stuff from the users view. Which has lead to a
 number of patches that we carry on both nova and neutron (and are available
 on our public github).


This is the primary reason we continue to use nova-network. I'm only
mentioning that to describe our implemented solution and not an attempt
start a Neutron v nova-network debate. We're not opposed to Neutron.


 At the same time we also have a *very* small subset of (internal) users
 who are at the exact opposite end of the scale.  They care very much about
 the network details, possibly all the way down to that they want to boot a
 vm to a specific HV, with a specific IP address on a specific network
 segment.  The difference however, is that these users are completely aware
 of the topology of the network and know which HV's map to which network
 segments and are essentially trying to make a very specific ask for
 scheduling.


We see the same thing. We will set up separate, smaller cloud environments
where the focus is on the network and not compute. To date, these clouds
have been lab environments. I'm not sure what we would do if we had users
who wanted advanced network features in our compute-focused clouds.

Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Venom vulnerability

2015-05-13 Thread Joe Topjian
 Hello,

 Looking through the details of the Venom vulnerability,
 https://securityblog.redhat.com/2015/05/13/venom-dont-get-bitten/, it
 would appear that the QEMU processes need to be restarted.



 Our understanding is thus that a soft reboot of the VM is not sufficient
 but a hard one would be OK.



 Some quick tests have shown that a suspend/resume of the VM also causes a
 new process.


The RedHat KB article (linked in the blog post you gave) also mentions that
migrating to a patched server should also be sufficient. If either methods
(suspend or migration) work, I think those are nicer ways of handling this
than hard reboots.

I also found this statement to be curious:

The sVirt and seccomp functionalities used to restrict host's QEMU process
privileges and resource access might mitigate the impact of successful
exploitation of this issue.

So perhaps RedHat already has mechanisms in place to prevent exploits such
as this from being successful? I wonder if Ubuntu has something similar in
place.


   How are others looking to address this vulnerability ?


It looks like RedHat has released updates, but I haven't received an
announcement for Ubuntu yet -- does anyone know the status?

As soon as a fix is released, we'll update our hosts. That will ensure new
instances aren't vulnerable. We'll then figure out some way of coordinating
fixing of older instances.

Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Venom vulnerability

2015-05-13 Thread Joe Topjian
Looks like the updated Ubuntu packages are available:

http://www.ubuntu.com/usn/usn-2608-1/

On Wed, May 13, 2015 at 10:44 AM, Matt Van Winkle mvanw...@rackspace.com
wrote:

  Yeah, something like that would be handy.

   From: matt m...@nycresistor.com
 Date: Wednesday, May 13, 2015 10:29 AM
 To: Daniel P. Berrange berra...@redhat.com
 Cc: Matt Van Winkle mvanw...@rackspace.com, 
 openstack-operators@lists.openstack.org 
 openstack-operators@lists.openstack.org
 Subject: Re: [Openstack-operators] Venom vulnerability

honestly that seems like a very useful feature to ask for...
 specifically for upgrading qemu.

  -matt

 On Wed, May 13, 2015 at 11:19 AM, Daniel P. Berrange berra...@redhat.com
 wrote:

 On Wed, May 13, 2015 at 03:08:47PM +, Matt Van Winkle wrote:
  So far, your assessment is spot on from what we've seen.  A migration
  (if you have live migrate that's even better) should net the same result
  for QEMU.  Some have floated the idea of live migrate within the same
  host.  I don't know if nova out of the box would support such a thing.

 Localhost migration (aka migration within the same host) is not something
 that is supported by libvirt/KVM. Various files QEMU has on disk are based
 on the VM name/uuid and you can't have 2 QEMU processes on the host having
 the files at the same time, which precludes localhost migration working.

 Regards,
 Daniel
  --
 |: http://berrange.com  -o-
 http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o-
 http://virt-manager.org :|
 |: http://autobuild.org   -o-
 http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-
 http://live.gnome.org/gtk-vnc :|

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Federation Ops Session at the Vancouver Summit

2015-05-12 Thread Joe Topjian
Hello,

Following suit of the other posts, this is an announcement / reminder of
the Federation Ops Session happening next Tuesday:

http://sched.co/3BBs

The etherpad for the session is here:

https://etherpad.openstack.org/p/YVR-ops-federation

I encourage everyone to add items they'd like to discuss prior to the
Session, but if you don't get to it, just bring it up during the Session --
we'll most likely be able to fit it in.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [Openstack] [nova] Cleaning up unused images in the cache

2015-04-28 Thread Joe Topjian
Hello,

I've got a similar question about cache-manager and the presence of a
 shared filesystem for instances images.
 I'm currently reading the source code in order to find out how this is
 managed but before I would be curious how you achieve this on production
 servers.

 For example images not used by compute node A will probably be cleaned on
 the shared FS despite the fact that compute B use it, that's the main
 problem.


This used to be a problem, but AFAIK it should not happen any more. If
you're noticing it happening, please raise a flag.


 How do you handle _base guys ?


We configure Nova to not have instances rely on _base files. We found it to
be too dangerous of a single point of failure. For example, we ran into the
scenario you described a few years ago before it was fixed. Bugs are one
thing, but there are a lot of other ways a _base file can become corrupt or
removed. Even if those scenarios are rare, the results are damaging enough
for us to totally forgo reliance of _base files.

Padraig Brady has an awesome article that details the many ways you can
configure _base and instance files:

http://www.pixelbeat.org/docs/openstack_libvirt_images/

I'm looping -operators into this thread for input on further ways to handle
_base. You might also be able to find some other methods by searching the
-operators mailing list archive.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Windows Instances and Volumes

2015-04-28 Thread Joe Topjian
Hello,

I'm wondering if anyone has best practices for Windows-based instances that
make heavy use of volumes?

I have a user who was running SQL Server off of an iSCSI-based volume. We
did a live-migration of the instance and that seemed to have caused Windows
to drop the drive. Disk Manager showed it as a new drive that needed to be
formatted. Everything was fine upon an explicit detach and reattach through
OpenStack.

Even though the volume backend is iSCSI, the instance, of course, doesn't
see it that way. I'm wondering if things would have been different if
Windows actually saw it as an iSCSI-based drive.

Also, I would have thought that running something such as SQL Server off a
volume would be highly discouraged, but looking into SQL Server on EC2 and
EBS shows the opposite. Amazon's official documentation only mentions EBS
and not whether that means EBS-based instances or independent EBS volumes.
Some forum posts make mention of the latter working out. Though I do
realize that EC2 is not a fair comparison.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] nova rescue

2015-03-29 Thread Joe Topjian
That's all very useful advise -- thank you. :)

On Sun, Mar 29, 2015 at 8:39 PM, gustavo panizzo (gfa) g...@zumbi.com.ar
wrote:



 On 03/29/2015 11:19 AM, Joe Topjian wrote:

 Hello,

 Without specifying a rescue image, Nova will use the image that the
 instance is based on when performing a rescue.

 I've noticed that this is problematic for cloud-friendly images such
 as the official Ubuntu images and the newer CentOS 7 images. I'm finding
 that /dev/vdb still ends up mounted as /, most likely because of the
 common label name being found. This, of course, defeats the entire
 purpose of rescuing.



 Is there an easy way around this or is the best practice to specify an
 image tailored specifically for rescuing? And if that's the case, can
 anyone recommend a good rescue image that's already in qemu format?



 we have modified the cloud images to mount root fs from /dev/vda1 instead
 of UUID=something, we modified grub and /etc/fstab

 we also added a warning on /etc/rc.local. as some people run servers in
 rescue mode for a long time, and when the server was un-rescued they lost
 data and that's not cool.

 but we need to modify cloud images anyway to change repos, users, keys,
 etc. so is just one more step on our image build process.

 if is a minimal change you can download the image and run guestfish -a
 image.qcow2


 --
 1AE0 322E B8F7 4717 BDEA BF1D 44BB 1BA7 9F6C 6333

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] nova rescue

2015-03-28 Thread Joe Topjian
Hello,

Without specifying a rescue image, Nova will use the image that the
instance is based on when performing a rescue.

I've noticed that this is problematic for cloud-friendly images such as
the official Ubuntu images and the newer CentOS 7 images. I'm finding that
/dev/vdb still ends up mounted as /, most likely because of the common
label name being found. This, of course, defeats the entire purpose of
rescuing.

Is there an easy way around this or is the best practice to specify an
image tailored specifically for rescuing? And if that's the case, can
anyone recommend a good rescue image that's already in qemu format?

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] FYI: Rabbit Heartbeat Patch Landed

2015-03-20 Thread Joe Topjian
We have other supporting services that require RabbitMQ and since they only
accept a single host in their connection config, we need a more reliable
way for them to connect. Those services work just fine with
HAProxy/RabbitMQ.

The OpenStack HA guide
http://docs.openstack.org/high-availability-guide/content/_configure_openstack_services_to_use_rabbitmq.html
is the only document I've come across that talks about connecting to
RabbitMQ by way of multiple hosts. Even the RabbitMQ docs
http://www.rabbitmq.com/clustering.html mention the use of a load
balancer, DNS round robin, etc.

So when someone reads up on RabbitMQ outside of the HA guide and sees that
it works well with HAProxy for other services, I think it's understandable
that they would attempt to connect OpenStack to RabbitMQ via HAProxy.


On Fri, Mar 20, 2015 at 12:07 AM, John Dewey j...@dewey.ws wrote:

  Why would anyone want to run rabbit behind haproxy?  I get people did it
 post the ‘rabbit_servers' flag.  Allowing the client to detect, handle, and
 retry is a far better alternative than load balancer health check
 intervals.

 On Thursday, March 19, 2015 at 9:42 AM, Kris G. Lindgren wrote:

 I have been working with dism and sileht on testing this patch in one of
 our pre-prod environments. There are still issues with rabbitmq behind
 haproxy that we are working through. However, in testing if you are using
 a list of hosts you should see significantly better catching/fixing of
 faults.

 If you are using cells with the don¹t forget to also apply:
 https://review.openstack.org/#/c/152667/
 
 Kris Lindgren
 Senior Linux Systems Engineer
 GoDaddy, LLC.



 On 3/19/15, 10:22 AM, Mark Voelker mvoel...@vmware.com wrote:

 At the Operator¹s midcycle meetup in Philadelphia recently there was a
 lot of operator interest[1] in the idea behind this patch:

 https://review.openstack.org/#/c/146047/

 Operators may want to take note that it merged yesterday. Happy testing!


 [1] See bottom of https://etherpad.openstack.org/p/PHL-ops-rabbit-queue

 At Your Service,

 Mark T. Voelker
 OpenStack Architect


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Live migration of instances with iscsi volumes

2015-03-19 Thread Joe Topjian
Hello,

I just resolved an issue where migrating instances with iSCSI volumes would
occasionally fail. There's a bug report here:

https://bugs.launchpad.net/nova/+bug/1423772

The core cause ended up being libvirt transferring the volume paths
verbatim. For example, take the situation where:

compute node 1: has luns 0-5 in use for volumes
compute node 2: has luns 0-2 in use for volumes
instance 1: hosted on compute node 1 with a volume using lun 5

If this instance was moved from node 1 to 2, it would fail with:

Live Migration failure: Failed to open file
'/dev/disk/by-path/ip-192.168.1.1:blah-blah-lun-5': No such file or
directory

Meanwhile, compute node 2 tried creating an iSCSI connection at lun 3,
which is the next lun in line for use.

Two other situations would happen:

* If lun 5 was already in use on the destination node, the migration would
think it's available and try connecting to it.
* If lun 5 was the natural next available lun, the migration would work and
everything would be fine.

I found this blueprint which resolves the problem:

https://blueprints.launchpad.net/nova/+spec/iscsi-live-migration-different-target
https://review.openstack.org/#/c/137466/

The patch will rewrite the libvirt xml file to use the correct luns on the
destination server rather than the luns from the source server.

Since I'm using Icehouse, I backported the changes -- fortunately it was
pretty easy.

This environment jumped from Grizzly to Icehouse. I'm not sure if it was
just dumb luck that I never saw this on Grizzly or something changed that
introduced it between Grizzly and Icehouse.

Anyway, I wanted to post a message because iSCSI-backed volumes are fairly
common.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Ops Meetup Monitoring/Tools Session

2015-03-04 Thread Joe Topjian
Hi all,

I'll be moderating the Monitoring/Tools session at next week's Ops Meetup.
The etherpage is here:

https://etherpad.openstack.org/p/PHL-ops-tools-wg

Please add items you'd like to see covered. So far, the general topics will
be:

* Discussion of Monasca, StackTach, and related tools. Members of the
Monasca and StackTach team will be attending, so feel free to ask
questions. They also want to gather feedback on the difficulties operators
are having in the areas that those tools solve.

* Review and focus on the action items on the Monitoring wiki page:

https://wiki.openstack.org/wiki/Operations/Monitoring

See everyone next week,
Joe

ps: A general note for everyone attending that Sunday March 8th marks the
start of Daylight Savings Time in North America. For those who still use a
time-keeping device that does not auto-adjust, the time will be shifted an
hour forward on March 8th at 2am.  :)
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] qemu 1.x to 2.0

2015-02-24 Thread Joe Topjian
Hi Mathieu,

Yeah, luckily I had an old copy of pxe-virtio.rom and just distributed it
via Puppet. I agree that it should be accessible via UCA.

As an update from my side: while things were working well for a while, we
had a few users report reboot issues. kvm/libvirt refused to start these
legacy instances. We were unable to determine if this was a coincidental
edge case, so to just get rid of this issue entirely, we ended up doing a
cloud-wide live-migration-shuffle. We have not had any issues since.

Note that if none of the prior work was done, live-migration wouldn't have
even worked and we would have had to do a cloud-wide hard-reboot of
instances. :/

Let me know if you have any more questions with this. :)

Thanks,
Joe

On Tue, Feb 24, 2015 at 11:40 AM, Mathieu Gagné mga...@iweb.com wrote:

 Joe,

 Finally got time to check the QEMU 2.0 upgrade. =)

 On 2014-11-26 11:30 PM, Joe Topjian wrote:


 Upon trying a migration again, I got an error about a
 missing pxe-virtio.rom.12.04. This was also mentioned in this bug report:

 https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1291321

 So I copied the /usr/share/qemu/pxe-virtio.rom file from a compute node
 that wasn't yet upgraded as /usr/share/qemu/pxe-virtio.rom.12.04 on the
 upgraded node (and the destination node of the migration). I rebooted
 libvirt-bin and sure enough the migration worked!


 I found out that the /usr/share/qemu/pxe-virtio.rom.12.04 file is
 packaged in the kvm-ipxe-precise package.

 Unfortunately, it is missing from UCA for Icehouse (Ubuntu 12.04).
 The package is only available in the Ubuntu 14.04 repository.

 How did you fix it on your side? Did you automate the copy of
 pxe-virtio.rom.12.04?

 Shouldn't the kvm-ipxe-precise package be made available in UCA?

 --
 Mathieu


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] State of Juno in Production

2015-02-17 Thread Joe Topjian
Cool, thanks, Jon. I've been following the thread on your scheduling issue
on the OpenStack list. I can't see our users hitting that issue, but it's
always good to keep in mind. :)

On Tue, Feb 17, 2015 at 1:17 PM, Jonathan Proulx j...@jonproulx.com wrote:

 Recently (4 weeks?) moved from Icehouse to Juno. It was pretty smooth
 (neutron has been much more well behaved though I know that's not
 relevant to you).

 One negative difference I noticed, but haven't really dug into yet
 since it's not a common pattern here:

 If I schedule 20 instances in one API call I get conductor timeouts
 and zero launches.  If I make many parallel scheduling calls for 20
 instances each response is good and scaling out to several hundred
 parallel launches is much faster and with out the neutron timeout
 errors that plagued me since I switched over to quantum in Grizzly.

 As I said I've not looked deeply at this so it may be a local config
 issue rather than something systemic with Juno, but if it's an
 important use case for you be sure to take a good look at it.

 -Jon

 On Tue, Feb 17, 2015 at 12:56 PM, Joe Topjian j...@topjian.net wrote:
  Nice - thanks, Jesse. :)
 
  On Tue, Feb 17, 2015 at 10:35 AM, Jesse Keating j...@bluebox.net wrote:
 
  On 2/17/15 8:46 AM, Joe Topjian wrote:
 
 
  The only issue I'm aware of is that live snapshotting is disabled. Has
  anyone re-enabled this and seen issues? What was the procedure to
  re-enable?
 
 
  We've re-enabled it. Live snapshots take more system resources, which
  meant I had to dial back down my Rally test to validate how it could
  perform.
 
  To re-enable it, we reverted the upstream commit that disabled it.
 
 
 
 https://github.com/blueboxgroup/nova/commit/fa3a9208ea366489410b4828bd20a74a571287a6
 
  Once that was clear, and we had an upgraded version of libvirt in place,
  live snapshots just happened.
 
  As for the rest of your mail, we're going from Havana to Juno, and we
 have
  neutron, so some of our experience won't necessarily apply to you.
 
  --
  -jlk
 
  ___
  OpenStack-operators mailing list
  OpenStack-operators@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
 
 
 
  ___
  OpenStack-operators mailing list
  OpenStack-operators@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
 

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] State of Juno in Production

2015-02-17 Thread Joe Topjian
Hello,

I'm beginning to plan for a Juno upgrade and wanted to get some feedback
from anyone else who has gone through the upgrade and has been running Juno
in production.

The environment that will be upgraded is pretty basic: nova-network, no
cells, Keystone v2. We run a RabbitMQ cluster, though, and per other recent
discussions, see the same reported issues.

The only issue I'm aware of is that live snapshotting is disabled. Has
anyone re-enabled this and seen issues? What was the procedure to re-enable?

Any other gotchas or significant differences seen from running Icehouse?

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [Ceilometer] Real world experience with Ceilometer deployments - Feedback requested

2015-02-12 Thread Joe Topjian
Hi Tim,

Does anyone have any proposals regarding

  - Possible replacements for Ceilometer that you have used instead

 It seems that many sites have written their own systems.


Sorry - I should have appended this at the end of my last post.

I need to preface this with I have never used Ceilometer nor do our
environments require billing. But we're already collecting a lot of
information that could be used for billing.

The `nova usage-list` command reports a tenant's compute resource
allocation per 24 hour period.

For per-instance metrics, I've posted a script that will collect them here:

https://github.com/osops/tools-generic/blob/master/libvirt/instance_metrics.rb

I recently discovered that the `nova diagnostics` command reports almost
the same information, minus the CPU usage that I'm polling via `ps`. This
might not be needed for most environments, though, and so `nova
diagnostics` alone should be fine.

So between all of this information, we're able to create a good picture of
a tenant's compute usage. Of course, if we were to do billing, this would
all need fed into a billing system of some sort. Plus, the 24 hour
resolution might be too large.

But hopefully it gives a good indication that polling some basic metrics of
compute usage doesn't require a lot of resources. :)
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] How to handle updates of public images?

2015-02-05 Thread Joe Topjian
We do exactly this.

Public images are named very generically like Ubuntu 14.04. Not even
14.04.1 or something like that. Old images are renamed and made private.
Existing instances continue to run, but, as others have mentioned, if a
user is using a  UUID to launch instances, that will break for them. This
is an acceptable trade-offf or us. Our documentation makes mention of this
and to use the names.

The OpenStack CLI tools as well as Vagrant (the two most used non-Dashboard
tools that are used) both support image names, so we haven't run into a
UUID-only issue.

We have a modified MOTD that lists some different scripts that the user can
run, such as:

* Using our local apt-cache server (Ubuntu only)
* Enabling automatic updates
* Install the openstack command-line tools

We had a few debates about turning on automatic updates in the images we
provide. Ultimately we chose to not enable them and instead go with the
MOTD message. There are several reasons why having automatic updates
enabled is a benefit, but the single reason that made us not do it is
simply if an automatic update breaks the user's instance, it's our fault.
It's a very debatable argument.

Also, we use Packer to bundle all of this. We have most of it available
here:

https://github.com/cybera/openstack-images

In addition to all of this, we allow users to upload their own images. So
if the core set of images we provide doesn't meet their needs, they're free
to do create their own solution.

On Thu, Feb 5, 2015 at 7:02 AM, Abel Lopez alopg...@gmail.com wrote:

 I always recommend the following:
 All public images are named generically enough that they can be replaced
 with a new version of the same name. This helps new instances booting.
 The prior image is renamed with -OLD-$date. This lets users know that
 their image has been deprecated. This image is made private so no new
 instances can be launched.
 All images include an updated motd that indicates available security
 updates.

 We're discussing baking the images with automatic updates, but still
 haven't reached an agreement.


 On Thursday, February 5, 2015, Tim Bell tim.b...@cern.ch wrote:

  -Original Message-
  From: George Shuklin [mailto:george.shuk...@gmail.com]
  Sent: 05 February 2015 14:10
  To: openstack-operators@lists.openstack.org
  Subject: [Openstack-operators] How to handle updates of public images?
 
  Hello everyone.
 
  We are updating our public images regularly (to provide them to
 customers in
  up-to-date state). But there is a problem: If some instance starts from
 image it
  becomes 'used'. That means:
  * That image is used as _base for nova
  * If instance is reverted this image is used to recreate instance's disk
  * If instance is rescued this image is used as rescue base
  * It is redownloaded during resize/migration (on a new compute node)
 
  One more (our specific):
  We're using raw disks with _base on slow SATA drives (in comparison to
 fast SSD
  for disks), and if that SATA fails, we replace it (and nova redownloads
 stuff in
  _base).
 
  If image is deleted, it causes problems with nova (nova can't download
 _base).
 
  The second part of the problem: glance disallows to update image
 (upload new
  image with same ID), so we're forced to upload updated image with new
 ID and
  to remove the old one. This causes problems described above.
  And if tenant boots from own snapshot and removes snapshot without
 removing
  instance, it causes same problem even without our activity.
 
  How do you handle public image updates in your case?
 

 We have a similar problem. For the Horizon based end users, we've defined
 a panel using image meta data. Details are at
 http://openstack-in-production.blogspot.ch/2015/02/choosing-right-image.html
 .

 For the CLI users, we propose to use the sort options from Glance to find
 the latest image of a particular OS.

 It would be good if there was a way of marking an image as hidden so that
 it can still be used for snapshots/migration but would not be shown in
 image list operations.

  Thanks!
 
  ___
  OpenStack-operators mailing list
  OpenStack-operators@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] How to handle updates of public images?

2015-02-05 Thread Joe Topjian
I'm curious: are you using _base files? We're not and we're able to block
migrate instances based on deleted images or images that were public but
are now private.

On Thu, Feb 5, 2015 at 2:42 PM, Belmiro Moreira 
moreira.belmiro.email.li...@gmail.com wrote:

 We don't delete public images from Glance because it breaks migrate/resize
 and block live migration. Not tested with upstream Kilo, though.
 As consequence, our public image list has been growing over time...

 In order to manage image releases we use glance image properties to tag
 them.

 Some relevant reviews:
 https://review.openstack.org/#/c/150337/
 https://review.openstack.org/#/c/90321/

 Belmiro
 CERN

 On Thu, Feb 5, 2015 at 8:16 PM, Kris G. Lindgren klindg...@godaddy.com
 wrote:

 In the case of a raw backed qcow2 image (pretty sure that¹s the default)
 the instances root disk as seen inside the vm is made up of changes made
 on the instance disk (qcow2 layer) + the base image (raw).  Also, remember
 that as currently coded a resize migration will almost always be a
 migrate.  However, since the vm is successfully running on the old compute
 node it *should* be a trivial change that if the backing image is no
 longer available via glance - copy that over to the new host as well.
 

 Kris Lindgren
 Senior Linux Systems Engineer
 GoDaddy, LLC.




 On 2/5/15, 11:55 AM, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from George Shuklin's message of 2015-02-05 05:09:51 -0800:
  Hello everyone.
 
  We are updating our public images regularly (to provide them to
  customers in up-to-date state). But there is a problem: If some
 instance
  starts from image it becomes 'used'. That means:
  * That image is used as _base for nova
  * If instance is reverted this image is used to recreate instance's
 disk
  * If instance is rescued this image is used as rescue base
  * It is redownloaded during resize/migration (on a new compute node)
 
 
 Some thoughts:
 
 * All of the operations described should be operating on an image ID. So
 the other suggestions of renaming seems the right way to go. Ubuntu
 14.04 becomes Ubuntu 14.04 02052015 and the ID remains in the system
 for a while. If something inside Nova doesn't work with IDs, it seems
 like a bug.
 
 * rebuild, revert, rescue, and resize, are all very _not_ cloud things
 that increase the complexity of Nova. Perhaps we should all reconsider
 their usefulness and encourage our users to spin up new resources, use
 volumes and/or backup/restore methods, and then tear down old instances.
 
 One way to encourage them is to make it clear that these operations will
 only work for X amount of time before old versions images will be
 removed.
 So if you spin up Ubuntu 14.04 today, reverts and resizes and rescues
 are only guaranteed to work for 6 months. Then aggressively clean up 
 6 month old image ids. To make this practical, you might even require
 a role, something like reverter, rescuer, resizer and only allow
 those roles to do these operations, and then before purging images,
 notify those users in those roles of instances they won't be able to
 resize/rescue/revert anymore.
 
 It also makes no sense to me why migrating an instance requires its
 original image. The instance root disk is all that should matter.
 
 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RHEL 7 / CentOS 7 instances losing their network gateway

2015-01-28 Thread Joe Topjian
I'm pretty sure I've resolved this issue. Since this seems to happen
randomly, it might just be a coincidence that this is by far the longest
streak that it hasn't happened. :)

I noticed that CentOS 7 and RHEL 7 are setting a `valid_lft` and
`preferred_lft` timeout on the IPv4 address. You can see this by doing an
ip a on CentOS7/RHEL7 and comparing with either CentOS6 or Ubuntu. This
is the first time I've seen this used on IPv4. It's usually used for IPv6
privacy addresses. The timeout is set to something larger than the lease
renewal time.

What happens, though, is that it is occasionally taking a little longer to
receive the DHCP renewal. Then the `valid_lft` hits zero and the IP is
removed from the interface. When this happens, the kernel will clean up any
routes used by the removed IP (in this case, the default gateway).

A few seconds later, the late DHCP renewal is finally received and the IP
is added back to the interface. But due to how CentOS/RHEL7 is handling the
renewal in /usr/sbin/dhclient-script, the gateway is never re-added.

My guess as to why a newer version of dnsmasq does not exhibit this issue
is because it's advertising renewals a little different: enough to trigger
the part of dhclient-script to re-add the gateway. I have not verified this
theory, though.

What I've done for now is modified dhclient-script and removed any portion
that sets a valid_lft and preferred_lft, so now they are set to forever
just like other distros.

And so far, so good (crossing fingers).

Thanks,
Joe

On Tue, Jan 27, 2015 at 1:53 PM, Joe Topjian j...@topjian.net wrote:

 Hi George,

 All instances have only a single interface.

 Thanks,
 Joe

 On Tue, Jan 27, 2015 at 1:38 PM, George Shuklin george.shuk...@gmail.com
 wrote:

  How many network interfaces have your instance? If more than one - check
 settings for second network (subnet). It can have own dhcp settings which
 may mess up with routes for the main network.


 On 01/27/2015 06:08 PM, Joe Topjian wrote:

 Hello,

  I have run into two different OpenStack clouds where instances running
 either RHEL 7 or CentOS 7 images are randomly losing their network gateway.

  There's nothing in the logs that show any indication of why. There's no
 DHCP hiccup or anything like that. The gateway has just disappeared.

  If I log into the instance via another instance (so on the same subnet
 since there's no gateway), I can manually re-add the gateway and everything
 works... until it loses it again.

  One cloud is running Havana and the other is running Icehouse. Both are
 using nova-network and both are Ubuntu 12.04.

  On the Havana cloud, we decided to install the dnsmasq package from
 Ubuntu 14.04. This looks to have resolved the issue as this was back in
 November and I haven't heard an update since.

  However, we don't want to do that just yet on the Icehouse cloud. We'd
 like to understand exactly why this is happening and why updating dnsmasq
 resolves an issue that only one specific type of image is having.

  I can make my way around CentOS, but I'm not as familiar with it as I
 am with Ubuntu (especially CentOS 7). Does anyone know what change in
 RHEL7/CentOS7 might be causing this? Or does anyone have any other ideas on
 how to troubleshoot the issue?

  I currently have access to two instances in this state, so I'd be happy
 to act as remote hands and eyes. :)

  Thanks,
 Joe


 ___
 OpenStack-operators mailing 
 listOpenStack-operators@lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RHEL 7 / CentOS 7 instances losing their network gateway

2015-01-27 Thread Joe Topjian
Thanks, Kris. I'm going to see if there's any oddities between the version
of dnsmasq packaged with 12.04/Icehouse and systemd-dhcp.

On Tue, Jan 27, 2015 at 9:25 AM, Kris G. Lindgren klindg...@godaddy.com
wrote:

  I can't help as we use config-drive to set networking and are just
 starting to roll out Cent7 vm's.  However, a huge change from Cent6 to
 Cent7 was the switch from upstart/dhclient to systemd/systemd-dhcp.
  

 Kris Lindgren
 Senior Linux Systems Engineer
 GoDaddy, LLC.



   From: Joe Topjian j...@topjian.net
 Date: Tuesday, January 27, 2015 at 9:08 AM
 To: openstack-operators@lists.openstack.org 
 openstack-operators@lists.openstack.org
 Subject: [Openstack-operators] RHEL 7 / CentOS 7 instances losing their
 network gateway

   Hello,

  I have run into two different OpenStack clouds where instances running
 either RHEL 7 or CentOS 7 images are randomly losing their network gateway.

  There's nothing in the logs that show any indication of why. There's no
 DHCP hiccup or anything like that. The gateway has just disappeared.

  If I log into the instance via another instance (so on the same subnet
 since there's no gateway), I can manually re-add the gateway and everything
 works... until it loses it again.

  One cloud is running Havana and the other is running Icehouse. Both are
 using nova-network and both are Ubuntu 12.04.

  On the Havana cloud, we decided to install the dnsmasq package from
 Ubuntu 14.04. This looks to have resolved the issue as this was back in
 November and I haven't heard an update since.

  However, we don't want to do that just yet on the Icehouse cloud. We'd
 like to understand exactly why this is happening and why updating dnsmasq
 resolves an issue that only one specific type of image is having.

  I can make my way around CentOS, but I'm not as familiar with it as I am
 with Ubuntu (especially CentOS 7). Does anyone know what change in
 RHEL7/CentOS7 might be causing this? Or does anyone have any other ideas on
 how to troubleshoot the issue?

  I currently have access to two instances in this state, so I'd be happy
 to act as remote hands and eyes. :)

  Thanks,
 Joe

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] RHEL 7 / CentOS 7 instances losing their network gateway

2015-01-27 Thread Joe Topjian
Hi George,

All instances have only a single interface.

Thanks,
Joe

On Tue, Jan 27, 2015 at 1:38 PM, George Shuklin george.shuk...@gmail.com
wrote:

  How many network interfaces have your instance? If more than one - check
 settings for second network (subnet). It can have own dhcp settings which
 may mess up with routes for the main network.


 On 01/27/2015 06:08 PM, Joe Topjian wrote:

 Hello,

  I have run into two different OpenStack clouds where instances running
 either RHEL 7 or CentOS 7 images are randomly losing their network gateway.

  There's nothing in the logs that show any indication of why. There's no
 DHCP hiccup or anything like that. The gateway has just disappeared.

  If I log into the instance via another instance (so on the same subnet
 since there's no gateway), I can manually re-add the gateway and everything
 works... until it loses it again.

  One cloud is running Havana and the other is running Icehouse. Both are
 using nova-network and both are Ubuntu 12.04.

  On the Havana cloud, we decided to install the dnsmasq package from
 Ubuntu 14.04. This looks to have resolved the issue as this was back in
 November and I haven't heard an update since.

  However, we don't want to do that just yet on the Icehouse cloud. We'd
 like to understand exactly why this is happening and why updating dnsmasq
 resolves an issue that only one specific type of image is having.

  I can make my way around CentOS, but I'm not as familiar with it as I am
 with Ubuntu (especially CentOS 7). Does anyone know what change in
 RHEL7/CentOS7 might be causing this? Or does anyone have any other ideas on
 how to troubleshoot the issue?

  I currently have access to two instances in this state, so I'd be happy
 to act as remote hands and eyes. :)

  Thanks,
 Joe


 ___
 OpenStack-operators mailing 
 listOpenStack-operators@lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] RHEL 7 / CentOS 7 instances losing their network gateway

2015-01-27 Thread Joe Topjian
Hello,

I have run into two different OpenStack clouds where instances running
either RHEL 7 or CentOS 7 images are randomly losing their network gateway.

There's nothing in the logs that show any indication of why. There's no
DHCP hiccup or anything like that. The gateway has just disappeared.

If I log into the instance via another instance (so on the same subnet
since there's no gateway), I can manually re-add the gateway and everything
works... until it loses it again.

One cloud is running Havana and the other is running Icehouse. Both are
using nova-network and both are Ubuntu 12.04.

On the Havana cloud, we decided to install the dnsmasq package from Ubuntu
14.04. This looks to have resolved the issue as this was back in November
and I haven't heard an update since.

However, we don't want to do that just yet on the Icehouse cloud. We'd like
to understand exactly why this is happening and why updating dnsmasq
resolves an issue that only one specific type of image is having.

I can make my way around CentOS, but I'm not as familiar with it as I am
with Ubuntu (especially CentOS 7). Does anyone know what change in
RHEL7/CentOS7 might be causing this? Or does anyone have any other ideas on
how to troubleshoot the issue?

I currently have access to two instances in this state, so I'd be happy to
act as remote hands and eyes. :)

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] Ops Monitoring

2014-12-17 Thread Joe Topjian
Hi all,

After far too long, I have compiled most of what was discussed at the
previous two Ops Monitoring sessions into this wiki page:

https://wiki.openstack.org/wiki/Operations/Monitoring

The wiki page has some action items. The intention is that if anyone wants
to contribute monitoring knowledge to the OpenStack Operators community,
that'd be an excellent place to start.

Further, since it's on the wiki, anyone is more than welcome to add content
-- you don't have to wait for a Meetup session. :)

I think it'd be really cool if one outcome was an actual OpenStack
Monitoring Guide that detailed each OpenStack component, what to monitor on
that component, and examples of how.

I was very close to adding Monitoring to the Working Groups wiki page, but
after reviewing the other groups, I decided not to. From reviewing the
other groups, it's obvious that I would be a horrible person to lead an
official working group. To be frank, I just don't have the time to give it
the attention it would deserve.

With that said, if anyone wants to jump in and lead an official Monitoring
WG, go for it! I'm happy to help where I can, including continuing to
moderate the Ops Meetup sessions and contributing knowledge where and when
I can.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] qemu 1.x to 2.0

2014-10-29 Thread Joe Topjian
Hi all,

I had some more fun with this this morning.

libvirt 1.2.2-0ubuntu13.1.6 became available in the Ubuntu cloud archive
precise-updates repository. Installing it on compute nodes removed
apparmor. When trying to restart an instance, libvirt complained that it
couldn't find the apparmor profile.

When I attempted to reinstall apparmor, apt told me it would also remove
nova-* and libvirt...

I ended up digging up a cached copy of the previous libvirt-bin and
libvirt0 packages (1.2.2-0ubuntu13.1.2~cloud0) and installing that, then
reinstalling apparmor. Everything is now happy.

I opened a bug report about it here:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1387251

Optimistically, though, this new version of libvirt attempts to bridge the
gap between Ubuntu 12.04 and 14.04. Information can be found here:

https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1374622
https://wiki.ubuntu.com/QemuPTMigration

Though I have a feeling that this new version should not actually be in the
precise repo...

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Guaranteed Resources

2014-10-24 Thread Joe Topjian
Thanks, Simon. That's one idea that we were thinking of -- sort of a DIY
reservation system that the users can handle on their own.

On Fri, Oct 24, 2014 at 1:38 AM, Simon Pasquier spasqu...@mirantis.com
wrote:

 Hello Joe,
 I would have recommended to have a look at Blazar but since you already
 did... Maybe your users could mimic how Blazar accomplishes resource
 reservation? IIUC Blazar will spawn the reserved instances but in shelved
 mode [1] so they won't consume any cloud resources but they will still be
 accounted by the resource tracker. When the lease starts, Blazar will
 unshelve the instances.
 HTH
 Simon
 [1] https://wiki.openstack.org/wiki/Blazar#Virtual_instance_reservation

 On Thu, Oct 23, 2014 at 5:20 PM, Joe Topjian j...@topjian.net wrote:

 Hello,

 I'm sure some of you have run into this situation before and I'm
 wondering how you've dealt with it:

 A user requests that they must have access to a certain amount of
 resources at all times. This is to prevent them from being unable to launch
 instances in the cloud when the cloud is at full capacity.

 I've always seen the nova.reservations table, so I thought there was some
 simple reservation system in OpenStack but never got around to looking into
 it. I think I was totally wrong about what that table does -- it looks like
 it's just used to assist in deducting resources from a user's quota when
 they launch an instance.

 There are also projects like Climate/Blazar, but a cursory look says it
 requires Keystone v3, which we're not using right now.

 Curiously, the quotas table has a column called hard_limit which would
 make one think that there was such a thing as a soft_limit, but that's
 not the case, either. I see a few blueprints about adding soft limits, but
 nothing in place.

 Has anyone cooked up their own solution for this?

 Thanks,
 Joe

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Console security question when using nova-novncproxy to access console

2014-10-22 Thread Joe Topjian
Hi Niall,

It looks like vnc password support was removed from the vmware driver last
October:

https://github.com/openstack/nova/commit/058ea40e7b7fb2181a2058e6118dce3f051e1ff3

For libvirt, there is an option in qemu.conf for vnc_password, but I'm
not sure how it would work with OpenStack.

Thanks,
Joe


On Tue, Oct 21, 2014 at 9:30 PM, Niall Power niall.po...@oracle.com wrote:

 Hi all,

 I have a question about a security consideration on a compute node when
 using nova-novncproxy for console access.

 Is there any existing mechanism within Nova to automatically authenticate
 against the VNC console an instance
 (I'm talking about plain old VNC authentication) or to generally prevent
 unauthorized local user accounts on the compute-node from accessing the VNC
 console of an instance?

 I understand that nova-novnc proxy and websockify bridge between the
 public network and the private internal/infrastructure network of the
 compute-node using wss:// to secure and encrypt the connection over the
 public network. I also understand that VNC authentication is comparatively
 very weak

 This is perhaps only an issue when the compute-node is also permitting
 traditional Unix type user logins.
 Let's say we have an instance running on the compute-node and the
 hypervisor or container manager serves out the console over VNC on a known
 port and the tenant has authenticated and logged in on the console using
 Horizon, perhaps as the administrator. A local user on the compute node, if
 they specified the correct port, could in theory then access the console
 and the administrative account of that instance without needing to
 authenticate.

 VNC authentication using password (and optionally username) would seem
 like the traditional way to prevent such unauthorized access. I can't find
 anything within the Nova code base that seems to cater for password
 authentication with the VNC server. For example the vmware nova driver
 returns the following dictionary
 of parameters for an instance console in vmops.py:get_vnc_console():
{'host': CONF.vmware.host_ip,
 'port': self._get_vnc_port(vm_ref),
 'internal_access_path': None}

 No suggestion of a password to authenticate with the VNC server. Is this
 intentionally not supported, lacking, or is there perhaps simply a better
 way to address this problem?

 Thanks in advance!
 Niall Power


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] qemu 1.x to 2.0

2014-10-19 Thread Joe Topjian
Hello,

We recently upgraded an OpenStack Grizzly environment to Icehouse (doing a
quick stop-over at Havana). This environment is still running Ubuntu 12.04.

The Ubuntu 14.04 release notes
https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes#Ubuntu_Server make
mention of incompatibilities with 12.04 and moving to 14.04 and qemu 2.0. I
didn't think that this would apply for upgrades staying on 12.04, but it
indeed does.

We found that existing instances could not be live migrated (as per the
release notes). Additionally, instances that were hard-rebooted and had the
libvirt xml file rebuilt could no longer start, either.

The exact error message we saw was:

Length mismatch: vga.vram: 100 in != 80

I found a few bugs that are related to this, but I don't think they're
fully relevant to the issue I ran into:

https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1308756
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1291321
https://bugs.launchpad.net/nova/+bug/1312133

We ended up downgrading to the stock Ubuntu 12.04 qemu 1.0 packages and
everything is working nicely.

I'm wondering if anyone else has run into this issue and how they dealt
with it or plan to deal with it.

Also, I'm curious as to why exactly qemu 1.x to 2.0 are incompatible with
each other. Is this just an Ubuntu issue? Or is this native of qemu?

Unless I'm missing something, this seems like a big deal. If we continue to
use Ubuntu's OpenStack packages, we're basically stuck at 12.04 and
Icehouse unless we have all users snapshot their instance and re-launch in
a new cloud.

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] rsyslog update caused services to break?

2014-10-13 Thread Joe Topjian
That's really interesting - thanks for the link.

We've been able to narrow down why didn't see this with Cinder or Swift: in
short, Swift isn't using Oslo (AFAIK) and we had some previous logging
issues with Cinder in Havana so we altered the logging setup a bit.


On Mon, Oct 13, 2014 at 3:05 AM, Francois Deppierraz franc...@ctrlaltdel.ch
 wrote:

 Hi Joe,

 Yes, same problem here running Ubuntu 14.04.

 The symptom is nova-api, nova-conductor and glance-api eating all CPU
 without responding to API requests anymore.

 It is possible to reproduce it thanks to the following script.

 https://gist.github.com/dbishop/7a2e224f3aafea1a1fc3

 François

 On 11. 10. 14 00:40, Joe Topjian wrote:
  Hello,
 
  This morning we noticed various nova, glance, and keystone services (not
  cinder or swift) not working in two different clouds and required a
 restart.
 
  We thought it was a network issue since one of the only commonalities
  between the two clouds was that they are on the same network.
 
  Then later in the day I logged into a test cloud on a totally separate
  network and had the same problem.
 
  Looking at all three environments, the commonality is now that they have
  Ubuntu security updates automatically applied in the morning and this
  morning rsyslog was patched and restarted.
 
  I found this oslo bug that kind of sounds like the issue we saw:
 
  https://bugs.launchpad.net/oslo.log/+bug/1076466
 
  Doing further investigation, log files do indeed show a lack of entries
  for various services/daemons until they were restarted.
 
  Has anyone else run into this? Maybe even this morning, too? :)
 
  Thanks,
  Joe
 
 
 
 
  ___
  OpenStack-operators mailing list
  OpenStack-operators@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
 


 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] rsyslog update caused services to break?

2014-10-10 Thread Joe Topjian
Hello,

This morning we noticed various nova, glance, and keystone services (not
cinder or swift) not working in two different clouds and required a restart.

We thought it was a network issue since one of the only commonalities
between the two clouds was that they are on the same network.

Then later in the day I logged into a test cloud on a totally separate
network and had the same problem.

Looking at all three environments, the commonality is now that they have
Ubuntu security updates automatically applied in the morning and this
morning rsyslog was patched and restarted.

I found this oslo bug that kind of sounds like the issue we saw:

https://bugs.launchpad.net/oslo.log/+bug/1076466

Doing further investigation, log files do indeed show a lack of entries for
various services/daemons until they were restarted.

Has anyone else run into this? Maybe even this morning, too? :)

Thanks,
Joe
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] [glance] how to update the contents of an image

2014-10-08 Thread Joe Topjian
We just ran some tests on our production Icehouse environment and can
confirm that:

* Snapshotting an instance based on a deleted image works
* Snapshotting an instance based on a public-turned-private image works
* Block migration of an instance based on a deleted image works

This environment does not utilize _base images.

We didn't do any resizing tests as we do not have our environment
configured to allow it. At the moment, if a user tries resizing they
receive an error message. It's not a friendly way to disable an action so
we plan on just removing the option from Horizon altogether.

We also tested the following with an older Grizzly environment that
supports live migration:

* Snapshotting an instance based on a deleted image (and base image) works
* Live migrating an instance based on a deleted image (and base image) works

Resizing is not supported in this environment as well.

I'm curious about your environment where these actions are failing for you?

Thanks,
Joe

On Tue, Oct 7, 2014 at 3:03 PM, George Shuklin george.shuk...@gmail.com
wrote:

 Yep, this bug is still actual. Resize, migration and so on does not work
 if original image is deleted. And 'I will never remove any public image'
 will not helps, because if user restore instance from snapshot and remove
 snapshot it will cause error too. That looks stupid in the light of (our)
 raw disk format, where 'base copy' just never used.

 Error is harmless and can be fixed by

 nova reset-state --active UUID
 (optionally)
 nova stop UUID
 nova start UUID

 But it's still annoying because user can not fix it own instance without
 administrator intervention.

 We have plans to fix it for havana (it cause disturbance for as and
 inconvenience for our customers), but fix will not be accepted by upstream
 (they drops upstream support ASAP), so I think we should switch to
 old-style maillist based patch RFC.

 On 10/07/2014 10:16 PM, Jan van Eldik wrote:

 Hi,

 Please note the issue reported in https://bugs.launchpad.net/
 nova/+bug/1160773: Cannot resize instance if base image is not
 available

 AFAIK it is still the case that instances cannot be resized or migrated
 once the image from which it has been created has been deleted.

   cheers, Jan

 On 10/07/2014 09:01 PM, Abel Lopez wrote:

 Right, and I think the best thing about marking a deprecated image
 private is that
 new instances can’t select that image unless the tenant is an
 image-member of it.
 So if a specific tenant has some “real valid” need to use the old
 version (I can’t imagine why), they could find it in “Project Images”
 instead of “Public”.

 On Oct 7, 2014, at 11:57 AM, Sławek Kapłoński sla...@kaplonski.pl
 wrote:

  Hello,

 Yes, I agree that this is not big problem when there is info Image not
 found
 in horizon but I saw this discussion and I thought that I will ask
 about that
 :) It would be nice to have some other info like for example: Image 1
 (archived) or something like that :)

 ---
 Best regards
 Sławek Kapłoński
 sla...@kaplonski.pl

 Dnia wtorek 07 październik 2014 18:21:13 piszesz:

 I've never worried about Image not Found, as its only a UI concern.
 IMO
 it lets the users know something has changed. Totally optional, and the
 same effect can be gained by just renaming it -OLD and leaving it
 public.
 At some point, it still needs to be removed.

 On Tuesday, October 7, 2014, Sławek Kapłoński sla...@kaplonski.pl
 wrote:

 Hello,

 I use Your solution and I made old images as private in such change
 but
 then
 there is one more problem: all instances spawned from that old images
 are
 have
 in horizon info about image: not found.
 Do You maybe know how to solve that?

 ---
 Best regards
 Sławek Kapłoński
 sla...@kaplonski.pl javascript:;

 Dnia wtorek, 7 października 2014 10:05:57 Abel Lopez pisze:

 You are correct, deleted images are not deleted from the DB, rather
 their
 row has ‘deleted=1’, so specifying the UUID of another image already
 in
 glance for a new image being upload will end in tears.

 What I was trying to convey was, when Christian is uploading a new
 image


 of

  the same name as an existing image, the UUID will be different. IMO,
 the
 correct process should be:
 1. Make desired changes to your image.
 2. Rename the existing image (e.g. Fedora-20-OLD)
 3. (optional) Make the old image private ( is-public 0 )
 4. Upload the new image using the desired name (e.g. Fedora-20 or
 like
 Fedora-20-LATEST )

 Obviously I assume there was testing for viability of the image
 before
 it
 was uploaded to glance.

 For more information, be sure to catch my talk on Tuesday 9am at the


 summit.

  On Oct 7, 2014, at 9:58 AM, George Shuklin george.shuk...@gmail.com


 javascript:; wrote:

 As far as I know, it is not possible to assign uuid from deleted
 image


 to

  the new one, because deleted images keeps their metadata in DB.

 On 09/26/2014 04:43 PM, Abel Lopez wrote:

 Glance images are immutable. In order to update it, you should do
 as


 

Re: [Openstack-operators] Problem creating resizable CentOS 6.5 image

2014-10-06 Thread Joe Topjian
Does this cover the scenario of a user launching CentOS 6.x, updating the
kernel, snapshotting, and having the relaunched instance resized?


On Mon, Oct 6, 2014 at 12:38 PM, Regan McDonald re...@wavefunction.org
wrote:

 Seconded. This is what I use with my CentOS images, and it works great.


 On Mon, Oct 6, 2014 at 1:54 PM, Robert Plestenjak 
 robert.plesten...@xlab.si wrote:

 Try this:

 https://github.com/flegmatik/linux-rootfs-resize

 - Robert

 - Original Message -
 From: Antonio Messina antonio.s.mess...@gmail.com
 To: Robert van Leeuwen robert.vanleeu...@spilgames.com
 Cc: openstack-operators@lists.openstack.org
 Sent: Friday, October 3, 2014 2:50:44 PM
 Subject: Re: [Openstack-operators] Problem creating resizable CentOS 6.5
   image

 I use this snippet in my %post section. I don't find it particularly
 elegant, but it works just fine:

 # Set up to grow root in initramfs
 cat  EOF  05-grow-root.sh
 #!/bin/sh

 /bin/echo
 /bin/echo Resizing root filesystem

 /bin/echo d
 n
 p
 1


 w
  | /sbin/fdisk -c -u /dev/vda
 /sbin/e2fsck -f /dev/vda1
 /sbin/resize2fs /dev/vda1
 EOF

 chmod +x 05-grow-root.sh

 dracut --force --include 05-grow-root.sh /mount --install 'echo
 fdisk e2fsck resize2fs' /boot/initramfs-grow_root-$(ls /boot/|grep
 initramfs|sed s/initramfs-//g) $(ls /boot/|grep vmlinuz|sed
 s/vmlinuz-//g)
 rm -f 05-grow-root.sh

 tail -4 /boot/grub/grub.conf | sed
 s/initramfs/initramfs-grow_root/g| sed s/CentOS/ResizePartition/g |
 sed s/crashkernel=auto/crashkernel=0@0/g  /boot/grub/grub.conf

 It only works if the root filesystem is `/dev/vd1` (which is a very
 common setup anyway) but can be adapted.

 I only tested it with CentOS 5 and 6. The full script is available at
 https://github.com/gc3-uzh-ch/openstack-tools/

 .a.


 --
 antonio.s.mess...@gmail.com
 antonio.mess...@uzh.ch +41 (0)44 635 42 22
 S3IT: Service and Support for Science IT   http://www.s3it.uzh.ch/
 University of Zurich
 Winterthurerstrasse 190
 CH-8057 Zurich Switzerland

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Nodes and configurations management in Puppet

2014-09-25 Thread Joe Topjian
Hi Mathieu,

My setup is very similar to yours. Node definitions are in site.pp and
Hiera is used for all configuration. The Hiera hierarchies are also very
similar.

Overall, I have a love/hate relationship with the setup. I could go on in
detail, but it'd be all Puppet-specific rather than OpenStack. I'd be happy
to discuss off-list.

Of if there's enough interest, I can post it here. I just don't want to
muddy up this list with non-OpenStack things.

Thanks,
Joe


On Thu, Sep 25, 2014 at 8:40 AM, Mathieu Gagné mga...@iweb.com wrote:

 Hi,

 Some of you use Puppet to manage your OpenStack infrastructure.

 - How do you manage your node definitions?
   Do you have an external ENC?
   Or plain site.pp, Puppet Enterprise, theforeman, etc. ?

 - How about your configuration?
   Do you use Hiera? Or do you rely on the ENC to manage them?


 My question is related to the complexity that managing multiple OpenStack
 environments (staging/production), regions and cells involves over time.

 Is there a magically way to manage node definitions and *especially*
 configurations so you guys no have a heart attack each time you have to dig
 into them? How about versioning?


 To answer my own questions and start the discussion:

 I don't use an external ENC. The site.pp manifest has been the one used
 since day one. Since we have a strong host naming convention, I didn't see
 the limit of this model (yet). Regex has been a good friend so far.

 As for configurations, Hiera is used to organize then with a hierarchy to
 manage environments and regions specific configurations:

   - environments/%{::environment}/regions/%{::openstack_region}/common
   - environments/%{::environment}/common
   - common

 I'm still exploring solutions for cells.

 How about you guys?

 --
 Mathieu

 ___
 OpenStack-operators mailing list
 OpenStack-operators@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators