Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-28 Thread Premysl Kouril
>
> Only the memory mapped for the guest is striclty allocated from the
> NUMA node selected. The QEMU overhead should float on the host NUMA
> nodes. So it seems that the "reserved_host_memory_mb" is enough.
>

Even if that would be true and overhead memory could float in NUMA
nodes it generally doesn't prevent us from running into OOM troubles.
No matter where (in which NUMA node) the overhead memory gets
allocated, it is not included in available memory calculation for that
NUMA node when provisioning new instance and thus can cause OOM (once
the guest operating system  of the newly provisioned instance actually
starts allocating memory which can only be allocated from its assigned
NUMA node).

Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-27 Thread Premysl Kouril
> Lastly, qemu has overhead that varies depending on what you're doing in the
> guest.  In particular, there are various IO queues that can consume
> significant amounts of memory.  The company that I work for put in a good
> bit of effort engineering things so that they work more reliably, and part
> of that was determining how much memory to reserve for the host.
>
> Chris

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not
considered in nova code when calculating if the instance being
provisioned actually fits into host's available resources (only the
memory, configured in instance's flavor is considered). And this is
especially being a problem when CPU pinning is used as the memory
allocation is bounded by limits of specific NUMA node (due to the
strict memory allocation mode). This renders the global reservation
parameter reserved_host_memory_mb useless as it doesn't take NUMA into
account.

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.

Regards,
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Monasca] alarms based on events

2016-01-23 Thread Premysl Kouril
Hi Roland,

> I don't think it would be difficult to add support for non-periodic 
> metrics/alarms. There are a couple of approaches we could take, so a design 
> discussion would be good to have if you are interested in implementing this. 
> This is feature that we are not working on right now, but it is on our list 
> to implement in the near future.

Definitely interested, so if there is a discussion I am happy to join
and outline our use cases.

Regards,
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-22 Thread Premysl Kouril
On 22 Jan 2016 17:43, "James Bottomley" <
james.bottom...@hansenpartnership.com> wrote:
>
> On Fri, 2016-01-22 at 14:58 +0100, Premysl Kouril wrote:
> > Hi Matt, James,
> >
> > any thoughts on the below notes?
>
> To be honest, not really.  You've repeated stage two of the Oracle
> argument: wheel out benchmarks and attack alleged "complexity".

Sorry fo that, it is true that it is just our conclusion which is not based
on any deep analysis.

> don't really have a great interest in repeating a historical argument.
>  Oracle didn't get it either until they released the feature and ran
> into the huge management complexity of raw devices in the field, so if
> you have the resources to repeat the experiment and see if you get
> different results, be my guest.
>

> The lesson I took from the Oracle affair all those years ago is that
> it's far harder to replace well understood and functional file
> interfaces with new ones (mainly because of the tooling and historical
> understanding that comes with the old ones) than it is to gain
> performance in existing interfaces.

Ok, point taken. I agree this can be a problem.

>
> The 3x difference in the benchmarks would seem to indicate a local
> tuning or configuration problem, because it's not what most people see.
>  What the current benchmarks seem to show is about a 1-5% difference
> between the directio and the direct to block paths depending on fstype,
> how its tuned, ioscheduler and underlying device.

I will try to find problem in our configuration and re-run our benchmarks.
By chance do you have some more information (and possibly configuration)
about the bechmarks you are mentioning?

Thanks,
Prema
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Monasca] alarms based on events

2016-01-22 Thread Premysl Kouril
Hi Roland,

I had a chat with people on IRC about it and I understood that in order for
my use case to work the Monasca software needs to implement "non-periodic"
metrics (because the monitored box sends out the event only on health
change and not periodicaly) and I understood that this enhancement is
currently being designed.

Is that correct?

Cheers,
Prema
On 22 Jan 2016 01:04, "Hochmuth, Roland M" <roland.hochm...@hpe.com> wrote:

> Hi Prema, SNMP isn't handled in Monasca and I have little experience in
> that area. This would be new development.
>
> It is possible to map binary data, such as health/status of a system or
> component. The usual way is to use the value 0 for up/OK and 1 for
> down/NOT_OK. A component would need to be developed to handle SNMP traps,
> then translate and send them to the Monasca API as binary data. Possibly,
> this component could be added to the Agent.
>
> Using the Monasca Alarm API, an alarm could be defined, such as
> max(snmp{}) > 0.
>
> The latency for a min/max alarm expression in Monasca is very low.
>
> Regards --Roland
>
>
> On 1/18/16, 9:07 AM, "Premysl Kouril" <premysl.kou...@gmail.com> wrote:
>
> >Hello,
> >
> >we are just evaluating Monasca for our new cloud infrastructure and I
> >would like to ask if there are any possibilities in current Monasca or
> >some development plans to address following use case:
> >
> >We have a box which we need to monitor and when something goes wrong
> >with the box, it sends out and SNMP trap indicating that it is in bad
> >condition and when the box is fixed it sends out SNMP trap indicating
> >that it is OK and operational again (in other words: the box is
> >indicating health state transitions by sending events - in this case
> >SNMP traps).
> >
> >Is it possible in Monasca to define such alarm which would work on top
> >of such events? In other words - Is it possible to have a Monasca
> >alarm which would go red on some external event go back green on some
> >other external event? By alarm I really mean a stateful entity in
> >monasca database not some notification to administrator.
> >
> >Best regards.
> >Prema
> >
> >__
> >OpenStack Development Mailing List (not for usage questions)
> >Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-22 Thread Premysl Kouril
Hi Matt, James,

any thoughts on the below notes?

Best Regards,
Prema
On 19 Jan 2016 20:47, "Premysl Kouril" <premysl.kou...@gmail.com> wrote:

> Hi James,
>
>
> >
> > You still haven't answered Anita's question: when you say "sponsor" do
> > you mean provide resources to existing developers to work on your
> > feature or provide new developers.
> >
>
> I did, I am copy-pasting my response to Anita here again:
>
> Both. We are first trying this "Are you asking for current Nova
> developers to work on this feature?" and if we won't find anybody we
> will start with "your company interested in having your developers
> interact with Nova developers"
>
>
> >
> > Heh, this is history repeating itself from over a decade ago when
> > Oracle would have confidently told you that Linux had to have raw
> > devices because that's the only way a database will perform.  Fast
> > forward to today and all oracle databases use file backends.
> >
> > Simplicity is also in the eye of the beholder.  LVM has a very simple
> > naming structure whereas filesystems have complex hierarchical ones.
> >  Once you start trying to scale to millions of instances, you'll find
> > there's quite a management penalty for the LVM simplicity.
>
> We won't definitely have millions instances on hypervisors but we can
> certainly have applications demanding million IOPS (in sum) from
> hypervisor in near future.
>
> >
> >>  It seems from our benchmarks that LVM behavior when
> >> processing many IOPs (10s of thousands) is more stable than if
> >> filesystem is used as backend.
> >
> > It sounds like you haven't enabled directio here ... that was the
> > solution to the oracle issue.
>
>
> If you mean O_DIRECT mode then we had than one during our benchmarks.
> Here is our benchmark setup and results:
>
> testing box configuration:
>
>   CPU: 4x E7-8867 v3 (total of 64 physical cores)
>   RAM: 1TB
>   Storage: 12x enteprise class SSD disks (each disk 140 000/120 000
> IOPS read/write)
> disks connected via 12Gb/s SAS3 lanes
>
>   So we are using big boxes which can run quite a lot of VMs.
>
>   Out of the disks we create linux md raid (we did raid5 and raid10)
> and do some fine tuning:
>
> 1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases
> parallelism for raid5
> 2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi
> queueing
> 3) we increase size of caching (for raid5)
>
>  On that raid we either create LVM group or filesystem depending if we
> are testing LVM nova backend or file-based nova backend.
>
>
> On this hypervisor we run nova/kvm and we provision 10-20 VMs and we
> run benchmark tests from these VMs and we are trying to saturate IO on
> hypervisor.
>
> We use following command running inside the VMs:
>
> fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1
> --name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1
> --readwrite=randwrite
>
> So you can see that in the guest OS we use --direct=1 which causes the
> test file to be opened with O_DIRECT. Actually I am now not sure but
> if using file-based backend then I hope that the virtual disk is
> automatically opened with O_DIRECT and that it is done by libvirt/qemu
> by default without any explicit configuration.
>
> Anyway, with this we have following results:
>
> If we use file-based backend in Nova, ext4 filesystem and RAID5 then
> in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which
> means in total about 32000 IOPS.
>
> If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000
> IOPS per machine, in total about 9 IOPS.
>
> This is a significant difference.
>
> This test was done about half a year ago by one of our engineers who
> no longer works for us but we still do have the box and everything, so
> if community is interested I can re-run the tests, again validate
> results, do any reconfiguration etc.
>
>
>
> > And this was precisely the Oracle argument.  The reason it foundered is
> > that most FS complexity goes to manage the data structures ... the I/O
> > path can still be made short and fast, as DirectIO demonstrates.  Then
> > the management penalty you pay (having to manage all the data
> > structures that the filesystem would have managed for you) starts to
> > outweigh any minor performance advantages.
>
> The only thing O_DIRECT does is that it instructs the kernel to skip
> filesystem cache for the file opened in this mode. Rest of the
> filesystem complexity remains in the IO's datapath. Note for

Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-19 Thread Premysl Kouril
> I'm not a Nova developer. I am interesting in clarifying what you are
> asking.
>
> Are you asking for current Nova developers to work on this feature? Or
> s your company interested in having your developers interact with Nova
> developers?
>
> Thank you,
> Anita.


Both. We are first trying this "Are you asking for current Nova
developers to work on this feature?" and if we won't find anybody we
will start with "your company interested in having your developers
interact with Nova developers"

Thanks,
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-19 Thread Premysl Kouril
Hi James,


>
> You still haven't answered Anita's question: when you say "sponsor" do
> you mean provide resources to existing developers to work on your
> feature or provide new developers.
>

I did, I am copy-pasting my response to Anita here again:

Both. We are first trying this "Are you asking for current Nova
developers to work on this feature?" and if we won't find anybody we
will start with "your company interested in having your developers
interact with Nova developers"


>
> Heh, this is history repeating itself from over a decade ago when
> Oracle would have confidently told you that Linux had to have raw
> devices because that's the only way a database will perform.  Fast
> forward to today and all oracle databases use file backends.
>
> Simplicity is also in the eye of the beholder.  LVM has a very simple
> naming structure whereas filesystems have complex hierarchical ones.
>  Once you start trying to scale to millions of instances, you'll find
> there's quite a management penalty for the LVM simplicity.

We won't definitely have millions instances on hypervisors but we can
certainly have applications demanding million IOPS (in sum) from
hypervisor in near future.

>
>>  It seems from our benchmarks that LVM behavior when
>> processing many IOPs (10s of thousands) is more stable than if
>> filesystem is used as backend.
>
> It sounds like you haven't enabled directio here ... that was the
> solution to the oracle issue.


If you mean O_DIRECT mode then we had than one during our benchmarks.
Here is our benchmark setup and results:

testing box configuration:

  CPU: 4x E7-8867 v3 (total of 64 physical cores)
  RAM: 1TB
  Storage: 12x enteprise class SSD disks (each disk 140 000/120 000
IOPS read/write)
disks connected via 12Gb/s SAS3 lanes

  So we are using big boxes which can run quite a lot of VMs.

  Out of the disks we create linux md raid (we did raid5 and raid10)
and do some fine tuning:

1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases
parallelism for raid5
2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi queueing
3) we increase size of caching (for raid5)

 On that raid we either create LVM group or filesystem depending if we
are testing LVM nova backend or file-based nova backend.


On this hypervisor we run nova/kvm and we provision 10-20 VMs and we
run benchmark tests from these VMs and we are trying to saturate IO on
hypervisor.

We use following command running inside the VMs:

fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1
--name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1
--readwrite=randwrite

So you can see that in the guest OS we use --direct=1 which causes the
test file to be opened with O_DIRECT. Actually I am now not sure but
if using file-based backend then I hope that the virtual disk is
automatically opened with O_DIRECT and that it is done by libvirt/qemu
by default without any explicit configuration.

Anyway, with this we have following results:

If we use file-based backend in Nova, ext4 filesystem and RAID5 then
in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which
means in total about 32000 IOPS.

If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000
IOPS per machine, in total about 9 IOPS.

This is a significant difference.

This test was done about half a year ago by one of our engineers who
no longer works for us but we still do have the box and everything, so
if community is interested I can re-run the tests, again validate
results, do any reconfiguration etc.



> And this was precisely the Oracle argument.  The reason it foundered is
> that most FS complexity goes to manage the data structures ... the I/O
> path can still be made short and fast, as DirectIO demonstrates.  Then
> the management penalty you pay (having to manage all the data
> structures that the filesystem would have managed for you) starts to
> outweigh any minor performance advantages.

The only thing O_DIRECT does is that it instructs the kernel to skip
filesystem cache for the file opened in this mode. Rest of the
filesystem complexity remains in the IO's datapath. Note for example -
we did a test on file-based backend with BTRFS - results were
absolutely horrible - there's just too much stuff filesystem has to do
when processing IOs and we believe a lot of it is just not necessary
when the storage is actually used to only store virtual disks.

Anyway, I am really glad that you brought these views, we are happy to
reconsider our decisions so let's have a discussion I am sure we
missed many things when we were evaluating both backends.

One more question: What about the Cinder? I think they are using LVM
for storing volumes, right? Why they don't use files?

Thanks,
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [Nova] sponsor some LVM development

2016-01-19 Thread Premysl Kouril
Hi Matt,

thanks for letting me know, we will definitely do reach you out if we
start some activity in this area.

To answer your question: main reason for LVM is simplicity and
performance. It seems from our benchmarks that LVM behavior when
processing many IOPs (10s of thousands) is more stable than if
filesystem is used as backend. Also a filesystem generally is heavier
and more complex technology than LVM and we wanted to stay really as
simple as possible on the IO datapath - to make everything
(maintaining, tuning, configuring) easier.

Do you see this as reasonable argumentation? Do you see some major
benefits of file-based backend over the LVM one?

Cheers,
Prema

On Tue, Jan 19, 2016 at 12:18 PM, Matthew Booth  wrote:
> Hello, Premysl,
>
> I'm not working on these features, however I am working in this area of code
> implementing the libvirt storage pools spec. If anybody does start working
> on this, please reach out to coordinate as I have a bunch of related
> patches. My work should also make your features significantly easier to
> implement.
>
> Out of curiosity, can you explain why you want to use LVM specifically over
> the file-based backends?
>
> Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Monasca] alarms based on events

2016-01-18 Thread Premysl Kouril
Hello,

we are just evaluating Monasca for our new cloud infrastructure and I
would like to ask if there are any possibilities in current Monasca or
some development plans to address following use case:

We have a box which we need to monitor and when something goes wrong
with the box, it sends out and SNMP trap indicating that it is in bad
condition and when the box is fixed it sends out SNMP trap indicating
that it is OK and operational again (in other words: the box is
indicating health state transitions by sending events - in this case
SNMP traps).

Is it possible in Monasca to define such alarm which would work on top
of such events? In other words - Is it possible to have a Monasca
alarm which would go red on some external event go back green on some
other external event? By alarm I really mean a stateful entity in
monasca database not some notification to administrator.

Best regards.
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Nova] sponsor some LVM development

2016-01-18 Thread Premysl Kouril
Hello everybody,

we are a Europe based operator and we have a case for LVM based nova
instances in our new cloud infrastructure. We are currently
considering to contribute to OpenStack Nova to implement some features
which are currently not supported for LVM based instances (they are
only supported for raw/qcow2 file based instances). As an example of
such features - nova block live migration or thin provisioning - these
nowadays don't work with LVM based instances (they do work for file
based).

Before actually diving into development here internally - we wanted to
check on possibility to actually sponsor this development within
existing community. So if there is someone who would be interested in
this work please drop me an email.

Regards,
Prema

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev