Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-09 Thread Preston L. Bannister
In my case Centos 7 is using QEMU 1.5.3 ... which is *ancient*. This is on
a node with a packstack install of OpenStack. If you have a different
result, I would like to know why...

Got a bit further in my reading and testing. Also got my raw volume read
performance in an instance from ~300MB/s (with some tweaking) up to the
current ~800MB/s. Given the host raw volume read rate is ~1.2GB/s, and
there are substantial improvements in the software stack (Linux, iSCSI,
QEMU) in later versions ... this is a good result.

Found the main bottleneck was the iSCSI target in the physical Linux host.
(Not in my current test case, in QEMU.) From online presentations on
QEMU/iSCSI/Linux, there are known large improvements in more recent
versions. Punt. Need to re-test on top of Ubuntu Trusty LTS (what most
customers seem headed toward).  Will re-base my testing, at some point.

My testing (for simplicity) is on an all-in-one node.

Curious what other folk are getting with very-fast iSCSI targets. What is
the upper range?








On Mon, Mar 7, 2016 at 7:59 AM, Chris Friesen 
wrote:

> Just a heads-up that the 3.10 kernel in CentOS/RHEL is *not* a stock 3.10
> kernel.  It has had many things backported from later kernels, though they
> may not have backported the specific improvements you're looking for.
>
> I think CentOS is using qemu 2.3, which is pretty new.  Not sure how new
> their libiscsi is though.
>
> Chris
>
> On 03/07/2016 12:25 AM, Preston L. Bannister wrote:
>
>> Should add that the physical host of the moment is Centos 7 with a
>> packstack
>> install of OpenStack. The instance is Ubuntu Trusty. Centos 7 has a
>> relatively
>> old 3.10 Linux kernel.
>>
>>  From the last week (or so) of digging, I found there were substantial
>> claimed
>> improvements in /both/ flash support in Linux and the block I/O path in
>> QEMU -
>> in more recent versions. How much that impacts the current measures, I do
>> not
>> (yet) know.
>>
>> Which suggests a bit of tension. Redhat folk are behind much of these
>> improvements, but RHEL (and Centos) are rather far behind. Existing RHEL
>> customers want and need careful, conservative changes. Folk deploying
>> OpenStack
>> need more aggressive release content, for which Ubuntu is currently the
>> best base.
>>
>> Will we see a "Redhat Cloud Base" as an offering with RHEL support
>> levels, and
>> more aggressive QEMU and Linux kernel inclusion?
>>
>> At least for now, building OpenStack clouds on Ubuntu might be a much
>> better bet.
>>
>>
>> Are those claimed improvements in QEMU and the Linux kernel going to make
>> a
>> difference in my measured result? I do not know. Still reading, building
>> tests,
>> and collecting measures...
>>
>>
>>
>>
>> On Thu, Mar 3, 2016 at 11:28 AM, Chris Friesen <
>> chris.frie...@windriver.com
>> > wrote:
>>
>> On 03/03/2016 01:13 PM, Preston L. Bannister wrote:
>>
>>  > Scanning the same volume from within the instance still
>> gets the same
>>  > ~450MB/s that I saw before.
>>
>>  Hmmm, with iSCSI inbetween that could be the TCP memcpy
>> limitation.
>>
>>
>> Measuring iSCSI in isolation is next on my list. Both on the
>> physical
>> host, and
>> in the instance. (Now to find that link to the iSCSI test,
>> again...)
>>
>>
>> Based on earlier comments it appears that you're using the qemu
>> built-in
>> iSCSI initiator.
>>
>> Assuming that's the case, maybe it would make sense to do a test run
>> with
>> the in-kernel iSCSI code and take qemu out of the picture?
>>
>> Chris
>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> > >
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-07 Thread Chris Friesen
Just a heads-up that the 3.10 kernel in CentOS/RHEL is *not* a stock 3.10 
kernel.  It has had many things backported from later kernels, though they may 
not have backported the specific improvements you're looking for.


I think CentOS is using qemu 2.3, which is pretty new.  Not sure how new their 
libiscsi is though.


Chris

On 03/07/2016 12:25 AM, Preston L. Bannister wrote:

Should add that the physical host of the moment is Centos 7 with a packstack
install of OpenStack. The instance is Ubuntu Trusty. Centos 7 has a relatively
old 3.10 Linux kernel.

 From the last week (or so) of digging, I found there were substantial claimed
improvements in /both/ flash support in Linux and the block I/O path in QEMU -
in more recent versions. How much that impacts the current measures, I do not
(yet) know.

Which suggests a bit of tension. Redhat folk are behind much of these
improvements, but RHEL (and Centos) are rather far behind. Existing RHEL
customers want and need careful, conservative changes. Folk deploying OpenStack
need more aggressive release content, for which Ubuntu is currently the best 
base.

Will we see a "Redhat Cloud Base" as an offering with RHEL support levels, and
more aggressive QEMU and Linux kernel inclusion?

At least for now, building OpenStack clouds on Ubuntu might be a much better 
bet.


Are those claimed improvements in QEMU and the Linux kernel going to make a
difference in my measured result? I do not know. Still reading, building tests,
and collecting measures...




On Thu, Mar 3, 2016 at 11:28 AM, Chris Friesen > wrote:

On 03/03/2016 01:13 PM, Preston L. Bannister wrote:

 > Scanning the same volume from within the instance still gets the 
same
 > ~450MB/s that I saw before.

 Hmmm, with iSCSI inbetween that could be the TCP memcpy limitation.


Measuring iSCSI in isolation is next on my list. Both on the physical
host, and
in the instance. (Now to find that link to the iSCSI test, again...)


Based on earlier comments it appears that you're using the qemu built-in
iSCSI initiator.

Assuming that's the case, maybe it would make sense to do a test run with
the in-kernel iSCSI code and take qemu out of the picture?

Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-06 Thread Preston L. Bannister
Should add that the physical host of the moment is Centos 7 with a
packstack install of OpenStack. The instance is Ubuntu Trusty. Centos 7 has
a relatively old 3.10 Linux kernel.

>From the last week (or so) of digging, I found there were substantial
claimed improvements in *both* flash support in Linux and the block I/O
path in QEMU - in more recent versions. How much that impacts the current
measures, I do not (yet) know.

Which suggests a bit of tension. Redhat folk are behind much of these
improvements, but RHEL (and Centos) are rather far behind. Existing RHEL
customers want and need careful, conservative changes. Folk deploying
OpenStack need more aggressive release content, for which Ubuntu is
currently the best base.

Will we see a "Redhat Cloud Base" as an offering with RHEL support levels,
and more aggressive QEMU and Linux kernel inclusion?

At least for now, building OpenStack clouds on Ubuntu might be a much
better bet.


Are those claimed improvements in QEMU and the Linux kernel going to make a
difference in my measured result? I do not know. Still reading, building
tests, and collecting measures...




On Thu, Mar 3, 2016 at 11:28 AM, Chris Friesen 
wrote:

> On 03/03/2016 01:13 PM, Preston L. Bannister wrote:
>
> > Scanning the same volume from within the instance still gets the same
>> > ~450MB/s that I saw before.
>>
>> Hmmm, with iSCSI inbetween that could be the TCP memcpy limitation.
>>
>>
>> Measuring iSCSI in isolation is next on my list. Both on the physical
>> host, and
>> in the instance. (Now to find that link to the iSCSI test, again...)
>>
>
> Based on earlier comments it appears that you're using the qemu built-in
> iSCSI initiator.
>
> Assuming that's the case, maybe it would make sense to do a test run with
> the in-kernel iSCSI code and take qemu out of the picture?
>
> Chris
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-03 Thread Chris Friesen

On 03/03/2016 01:13 PM, Preston L. Bannister wrote:


> Scanning the same volume from within the instance still gets the same
> ~450MB/s that I saw before.

Hmmm, with iSCSI inbetween that could be the TCP memcpy limitation.


Measuring iSCSI in isolation is next on my list. Both on the physical host, and
in the instance. (Now to find that link to the iSCSI test, again...)


Based on earlier comments it appears that you're using the qemu built-in iSCSI 
initiator.


Assuming that's the case, maybe it would make sense to do a test run with the 
in-kernel iSCSI code and take qemu out of the picture?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-03 Thread Preston L. Bannister
Note that my end goal is to benchmark an application that runs in an
instance that does primarily large sequential full-volume-reads.

On this path I ran into unexpectedly poor performance within the instance.
If this is a common characteristic of OpenStack, then this becomes a
question of concern to OpenStack developers.

Until recently, ~450MB/s would be (and still is for many cases)
*outstanding* throughput. Most similar questions on the web are happy with
saturating a couple of gigabit links, or a few spinning disks. So that
few(?) folk (to now) have asked questions at this level of performance ...
is not a surprise.

But with flash displacing spinning disks, much higher throughput is
possible. If there is an unnecessary bottleneck, this might be a good time
to call attention.


>From general notions to current specifics... :)


On Wed, Mar 2, 2016 at 10:10 PM, Philipp Marek 
wrote:

> > The benchmark scripts are in:
> >   https://github.com/pbannister/openstack-bootstrap



> in case that might help, here are a few notes and hints about doing
> benchmarks for the DRDB block device driver:
>
> http://blogs.linbit.com/p/897/benchmarking-drbd/
>
> Perhaps there's something interesting for you.
>

Found this earlier. :)


> Found that if I repeatedly scanned the same 8GB volume from the physical
> > host (with 1/4TB of memory), the entire volume was cached in (host)
> memory
> > (very fast scan times).
>


> If the iSCSI target (or QEMU, for direct access) is set up to use buffer
> cache, yes.
> Whether you really want that is up to discussion - it might be much more
> beneficial to move that RAM from the Hypervisor to the VM, which should
> then be able to do more efficient caching of the filesystem contents that
> it should operate on.
>

You are right, but my aim was a bit different. Doing a bit of
divide-and-conquer.

In essence, this test was to see if reducing the host-side volume-read time
to (practically) zero would have *any* impact on performance. Given the
*huge* introduced latency (somewhere), I did not expect a notable
difference - and that it what the measure shows. This further supports the
theory that host-side Linux is *not* the issue.



> Scanning the same volume from within the instance still gets the same
> > ~450MB/s that I saw before.
>


> Hmmm, with iSCSI inbetween that could be the TCP memcpy limitation.


Measuring iSCSI in isolation is next on my list. Both on the physical host,
and in the instance. (Now to find that link to the iSCSI test, again...)




> > The "iostat" numbers from the instance show ~44 %iowait, and ~50 %idle.
> > (Which to my reading might explain the ~50% loss of performance.) Why so
> > much idle/latency?
> >
> > The in-instance "dd" CPU use is ~12%. (Not very interesting.)
>


> Because your "dd" testcase will be single-threaded, io-depth 1.
> And that means synchronous access, each IO has to wait for the preceeding
> one to finish...
>

Given the Linux kernel read-ahead parameter has a noticeable impact on
performance, I believe that "dd" does not need wait (much?) for I/O. Note
also the large difference between host and instance with "dd".



> > Not sure from where the (apparent) latency comes. The host iSCSI target?
> > The QEMU iSCSI initiator? Onwards...
>


> Thread scheduling, inter-CPU cache trashing (if the iSCSI target is on
> a different physical CPU package/socket than the VM), ...
>
> Benchmarking is a dark art.
>

This physical host has an absurd number of CPUs (at 40), so what you
mention is possible. At these high rates, if only losing 10-20% of the
throughput, I might consider such causes. But losing 60% ... my guess ...
the cause is much less esoteric.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-02 Thread Philipp Marek
Hi Preston,

 
> The benchmark scripts are in:
> 
>   https://github.com/pbannister/openstack-bootstrap
in case that might help, here are a few notes and hints about doing 
benchmarks for the DRDB block device driver:

http://blogs.linbit.com/p/897/benchmarking-drbd/

Perhaps there's something interesting for you.


> Found that if I repeatedly scanned the same 8GB volume from the physical
> host (with 1/4TB of memory), the entire volume was cached in (host) memory
> (very fast scan times).
If the iSCSI target (or QEMU, for direct access) is set up to use buffer 
cache, yes.
Whether you really want that is up to discussion - it might be much more 
beneficial to move that RAM from the Hypervisor to the VM, which should 
then be able to do more efficient caching of the filesystem contents that 
it should operate on.


> Scanning the same volume from within the instance still gets the same
> ~450MB/s that I saw before. 
Hmmm, with iSCSI inbetween that could be the TCP memcpy limitation.

> The "iostat" numbers from the instance show ~44 %iowait, and ~50 %idle.
> (Which to my reading might explain the ~50% loss of performance.) Why so
> much idle/latency?
> 
> The in-instance "dd" CPU use is ~12%. (Not very interesting.)
Because your "dd" testcase will be single-threaded, io-depth 1.
And that means synchronous access, each IO has to wait for the preceeding 
one to finish...


> Not sure from where the (apparent) latency comes. The host iSCSI target?
> The QEMU iSCSI initiator? Onwards...
Thread scheduling, inter-CPU cache trashing (if the iSCSI target is on 
a different physical CPU package/socket than the VM), ...


Benchmarking is a dark art.


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-02 Thread Preston L. Bannister
First, my degree from school is in Physics. So I know something about
designing experiments. :)

The benchmark scripts runs "dd" 218 times, against different volumes (of
differing sizes), with differing "bs". Measures are collected both from the
physical host, and from the within the instance. Linux is told to drop
caches before the start.

The benchmark scripts are in:

  https://github.com/pbannister/openstack-bootstrap

(Very much a work in progress! Not complete or properly documented.)


Second, went through the exercise of collecting hints from the web as to
parameters for tuning iSCSI performance. (I did not expect changing Linux
TCP parameters to change the result for iSCSI over loopback, but measured
to be certain.) Followed all the hints, with no change in performance (as
expected).

Found that if I repeatedly scanned the same 8GB volume from the physical
host (with 1/4TB of memory), the entire volume was cached in (host) memory
(very fast scan times).

Scanning the same volume from within the instance still gets the same
~450MB/s that I saw before. The difference is that "iostat" in the host is
~93% idle. In the host, *iscsi_ttx* is using ~58% of a CPU (sound high?),
and *qemu-kvm* is using ~30% of a CPU. (The physical host is a fairly new
box - with 40(!) CPUs.)

The "iostat" numbers from the instance show ~44 %iowait, and ~50 %idle.
(Which to my reading might explain the ~50% loss of performance.) Why so
much idle/latency?

The in-instance "dd" CPU use is ~12%. (Not very interesting.)


Not sure from where the (apparent) latency comes. The host iSCSI target?
The QEMU iSCSI initiator? Onwards...





On Tue, Mar 1, 2016 at 5:13 PM, Rick Jones  wrote:

> On 03/01/2016 04:29 PM, Preston L. Bannister wrote:
>
> Running "dd" in the physical host against the Cinder-allocated volumes
>> nets ~1.2GB/s (roughly in line with expectations for the striped flash
>> volume).
>>
>> Running "dd" in an instance against the same volume (now attached to the
>> instance) got ~300MB/s, which was pathetic. (I was expecting 80-90% of
>> the raw host volume numbers, or better.) Upping read-ahead in the
>> instance via "hdparm" boosted throughput to ~450MB/s. Much better, but
>> still sad.
>>
>> In the second measure the volume data passes through iSCSI and then the
>> QEMU hypervisor. I expected to lose some performance, but not more than
>> half!
>>
>> Note that as this is an all-in-one OpenStack node, iSCSI is strictly
>> local and not crossing a network. (I did not want network latency or
>> throughput to be a concern with this first measure.)
>>
>
> Well, not crossing a physical network :)  You will be however likely
> crossing the loopback network on the node.
>

Well ... yes. I suspect the latency and bandwidth numbers for loopback are
rather better. :)

For the purposes of this experiment, I wanted to eliminate the physical
network limits as a consideration.



What sort of per-CPU utilizations do you see when running the test to the
> instance?  Also, out of curiosity, what block size are you using in dd?  I
> wonder how well that "maps" to what iSCSI will be doing.
>

First, this measure was collected via a script that tried a moderately
exhaustive number of variations. Yes, I had the same question. Kernel host
read-ahead is 6MB (automatically set). Did not see notable gain past
"bs=2M". (Was expecting a bigger gain for larger reads, but not what
measures showed.)
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-01 Thread Rick Jones

On 03/01/2016 04:29 PM, Preston L. Bannister wrote:


Running "dd" in the physical host against the Cinder-allocated volumes
nets ~1.2GB/s (roughly in line with expectations for the striped flash
volume).

Running "dd" in an instance against the same volume (now attached to the
instance) got ~300MB/s, which was pathetic. (I was expecting 80-90% of
the raw host volume numbers, or better.) Upping read-ahead in the
instance via "hdparm" boosted throughput to ~450MB/s. Much better, but
still sad.

In the second measure the volume data passes through iSCSI and then the
QEMU hypervisor. I expected to lose some performance, but not more than
half!

Note that as this is an all-in-one OpenStack node, iSCSI is strictly
local and not crossing a network. (I did not want network latency or
throughput to be a concern with this first measure.)


Well, not crossing a physical network :)  You will be however likely 
crossing the loopback network on the node.


What sort of per-CPU utilizations do you see when running the test to 
the instance?  Also, out of curiosity, what block size are you using in 
dd?  I wonder how well that "maps" to what iSCSI will be doing.


rick jones
http://www.netperf.org/

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][cinder] Limits on volume read throughput?

2016-03-01 Thread Preston L. Bannister
I have need to benchmark volume-read performance of an application running
in an instance, assuming extremely fast storage.

To simulate fast storage, I have an AIO install of OpenStack, with local
flash disks. Cinder LVM volumes are striped across three flash drives (what
I have in the present setup).

Since I am only interested in sequential-read performance, the "dd" utility
is sufficient as a measure.

Running "dd" in the physical host against the Cinder-allocated volumes nets
~1.2GB/s (roughly in line with expectations for the striped flash volume).

Running "dd" in an instance against the same volume (now attached to the
instance) got ~300MB/s, which was pathetic. (I was expecting 80-90% of the
raw host volume numbers, or better.) Upping read-ahead in the instance via
"hdparm" boosted throughput to ~450MB/s. Much better, but still sad.

In the second measure the volume data passes through iSCSI and then the
QEMU hypervisor. I expected to lose some performance, but not more than
half!

Note that as this is an all-in-one OpenStack node, iSCSI is strictly local
and not crossing a network. (I did not want network latency or throughput
to be a concern with this first measure.)

I do not see any prior mention of performance of this sort on the web or in
the mailing list. Possible I missed something.

What sort of numbers are you seeing out of high performance storage?

Is the huge drop in read-rate within an instance something others have seen?

Is the default iSCSI configuration used by Nova and Cinder optimal?
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev