Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2013-03-07 Thread Jan Kara
On Wed 06-03-13 20:05:58, Harshana Ranmuthu wrote:
> >   I don't know how to fix the issue without reverting the patch. Sorry.
> 
> with reference to post URL
> http://marc.info/?l=linux-fsdevel=134997043820759=2
> 
> I was going through this post, as we are also having problems with the
> same commit. In our case, we are appending to a file rather than
> updating.
> 
> Let me explain the issue as I understand and the solution I think
> which can fix the issue.
> 
> If a write() writes to a page (for the second time) which is being
> flushed (because the page was dirty from the fist write), data written
> to disk may have part from the first write and  rest from the second
> write. Although, after the second write, page will have the correct
> data, but disk will have corrupted data, ie, part from first write and
> rest from the second write. After flush(), dirty flag on the page will
> be cleared.
  This isn't true. Dirty bit gets cleared *before* the page is submitted
for IO. So second write sets the dirty bit again and that assures writeback
will happen in future again. So what you propose below is indeed happening.
So my question is: What problems are you really observing?

> If there has been a third write on the same page (this
> time assuming a clean write), then accurate data will be written to
> disk on the next flush. If there's no third write, then file on disk
> will remain to have corrupted data. If we can make sure, after the
> second write(), page remains dirty (because flush and write() happened
> at the same time), then even there's no third write, that page will be
> flushed again with accurate data, correcting the corruption on the
> disk.
> In summery, solution is to make sure the conflicting (write and flush
> happen at the same time) pages kept dirty after the flush. This will
> make sure these pages will be flushed again even there's no subsequent
> write()s to them.
> 
> I don't know how easy / difficult this change is. Hope you'll consider this.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2013-03-07 Thread Jan Kara
On Wed 06-03-13 20:05:58, Harshana Ranmuthu wrote:
I don't know how to fix the issue without reverting the patch. Sorry.
 
 with reference to post URL
 http://marc.info/?l=linux-fsdevelm=134997043820759w=2
 
 I was going through this post, as we are also having problems with the
 same commit. In our case, we are appending to a file rather than
 updating.
 
 Let me explain the issue as I understand and the solution I think
 which can fix the issue.
 
 If a write() writes to a page (for the second time) which is being
 flushed (because the page was dirty from the fist write), data written
 to disk may have part from the first write and  rest from the second
 write. Although, after the second write, page will have the correct
 data, but disk will have corrupted data, ie, part from first write and
 rest from the second write. After flush(), dirty flag on the page will
 be cleared.
  This isn't true. Dirty bit gets cleared *before* the page is submitted
for IO. So second write sets the dirty bit again and that assures writeback
will happen in future again. So what you propose below is indeed happening.
So my question is: What problems are you really observing?

 If there has been a third write on the same page (this
 time assuming a clean write), then accurate data will be written to
 disk on the next flush. If there's no third write, then file on disk
 will remain to have corrupted data. If we can make sure, after the
 second write(), page remains dirty (because flush and write() happened
 at the same time), then even there's no third write, that page will be
 flushed again with accurate data, correcting the corruption on the
 disk.
 In summery, solution is to make sure the conflicting (write and flush
 happen at the same time) pages kept dirty after the flush. This will
 make sure these pages will be flushed again even there's no subsequent
 write()s to them.
 
 I don't know how easy / difficult this change is. Hope you'll consider this.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2013-03-06 Thread Harshana Ranmuthu
>   I don't know how to fix the issue without reverting the patch. Sorry.

with reference to post URL
http://marc.info/?l=linux-fsdevel=134997043820759=2

I was going through this post, as we are also having problems with the
same commit. In our case, we are appending to a file rather than
updating.

Let me explain the issue as I understand and the solution I think
which can fix the issue.

If a write() writes to a page (for the second time) which is being
flushed (because the page was dirty from the fist write), data written
to disk may have part from the first write and  rest from the second
write. Although, after the second write, page will have the correct
data, but disk will have corrupted data, ie, part from first write and
rest from the second write. After flush(), dirty flag on the page will
be cleared. If there has been a third write on the same page (this
time assuming a clean write), then accurate data will be written to
disk on the next flush. If there's no third write, then file on disk
will remain to have corrupted data. If we can make sure, after the
second write(), page remains dirty (because flush and write() happened
at the same time), then even there's no third write, that page will be
flushed again with accurate data, correcting the corruption on the
disk.
In summery, solution is to make sure the conflicting (write and flush
happen at the same time) pages kept dirty after the flush. This will
make sure these pages will be flushed again even there's no subsequent
write()s to them.

I don't know how easy / difficult this change is. Hope you'll consider this.

Harshana
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2013-03-06 Thread Harshana Ranmuthu
   I don't know how to fix the issue without reverting the patch. Sorry.

with reference to post URL
http://marc.info/?l=linux-fsdevelm=134997043820759w=2

I was going through this post, as we are also having problems with the
same commit. In our case, we are appending to a file rather than
updating.

Let me explain the issue as I understand and the solution I think
which can fix the issue.

If a write() writes to a page (for the second time) which is being
flushed (because the page was dirty from the fist write), data written
to disk may have part from the first write and  rest from the second
write. Although, after the second write, page will have the correct
data, but disk will have corrupted data, ie, part from first write and
rest from the second write. After flush(), dirty flag on the page will
be cleared. If there has been a third write on the same page (this
time assuming a clean write), then accurate data will be written to
disk on the next flush. If there's no third write, then file on disk
will remain to have corrupted data. If we can make sure, after the
second write(), page remains dirty (because flush and write() happened
at the same time), then even there's no third write, that page will be
flushed again with accurate data, correcting the corruption on the
disk.
In summery, solution is to make sure the conflicting (write and flush
happen at the same time) pages kept dirty after the flush. This will
make sure these pages will be flushed again even there's no subsequent
write()s to them.

I don't know how easy / difficult this change is. Hope you'll consider this.

Harshana
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Jan Kara
On Thu 11-10-12 13:58:00, Viktor Nagy wrote:
> >The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
> >writeback when grabbing pages to begin a write". At the first sight I was
> >somewhat surprised when I saw that code path in the traces but later 
> >when I
> >did some math it's clear. What the commit does is that when a page is 
> >just
> >being written out to disk, we don't allow it's contents to be changed and
> >wait for IO to finish before letting next write to proceed. Now if you 
> >have
> >1 GB file, that's 256000 pages. By the observation from my test machine,
> >writeback code keeps around 1 pages in flight to disk at any moment
> >(this number fluctuates a lot but average is around that number). Your
> >program dirties about 25600 pages per second. So the probability one of
> >dirtied pages is a page under writeback is equal to 1 for all practical
> >purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
> >you are going to hit about 1000 pages under writeback per second which
> >clearly has a noticeable impact (even single page can have). Pity I 
> >didn't
> >do the math when we were considering those patches.
> >
> >There were plans to avoid waiting if underlying storage doesn't need it 
> >but
> >I'm not sure how far that plans got (added a couple of relevant CCs).
> >Anyway you are about second or third real workload that sees regression 
> >due
> >to "stable pages" so we have to fix that sooner rather than later... 
> >Thanks
> >for your detailed report!
> We develop a game server which gets very high load in some
> countries. We are trying to serve as much players as possible with
> one server.
> Currently the CPU usage is below the 50% at the peak times. And with
> the old kernel it runs smoothly. The pdflush runs non-stop on the
> database disk with ~3 MByte/s write (minimal read).
> This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
> I think we are still below the theoratical limits of this server...
> but only if the disk writes are never done in sync.
> 
> I will try the 3.2.31 kernel without the problematic commit
> (3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a
> write").
> Is it a good idea? Will it be worse than 2.6.32?
> >>>   Running without that commit should work just fine unless you use
> >>>something exotic like DIF/DIX or similar. Whether things will be worse than
> >>>in 2.6.32 I cannot say. For me, your test program behaves fine without that
> >>>commit but whether your real workload won't hit some other problem is
> >>>always a question. But if you hit another regression I'm interested in
> >>>hearing about it :).
> >>I've just tested it. After I've set the dirty_bytes over the file
> >>size the writes are never blocked.
> >>So it's working nice without the mentioned commit.
> >>
> >>The problem is that if you read the kernel's documentation about the
> >>dirty page handling it does not work that way (with the commit) It
> >>works unpredictable.
> >   Which documentation do you mean exatly? The process won't be throttled
> >because of dirtying too much memory but we can still block it for other
> >reasons - e.g. because we decide to evict it's code from memory and have to
> >reload it again when the process gets scheduled. Or we can block during
> >memory allocation (which may be needed to allocate a page you write to) if
> >we find it necessary. There are no promises really...
> >
> Ok, it is very hard to get an overview about this whole thing.
> I thought I understood the behaviour checking the file
> Documentation/sysctl/vm.txt:
> 
> "
> dirty_bytes
> 
> Contains the amount of dirty memory at which a process generating
> disk writes
> will itself start writeback.
> ...
> "
> 
> Ok, it not says exactly that other things can influence too.
> 
> Several people are trying to get over the problem caused by the
> commit with setting the value of /sys/block/sda/queue/nr_requests to
> 4 (from 128).
> This helped a lot but was not enough for us.
  Yes, that reduces amount of IO in flight at any moment so it reduces
chances you will wait in grab_cache_page_write_begin(). But it also reduces
throughput...

> I attach two performance graphs which shows our own CPU usage
> measurement (red). One minute averages, the blue line is the SQL
> time %.
> 
> And a nice question: Without reverting the patch is it possible to
> get a smooth performance (in our case)?
  I don't know how to fix the issue without reverting the patch. Sorry.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Jan Kara
On Thu 11-10-12 11:52:54, Viktor Nagy wrote:
> On 2012.10.10. 22:27, Jan Kara wrote:
> >On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
> >>On 10/10/2012 06:57 PM, Jan Kara wrote:
> >>>   Hello,
> >>>
> >>>On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
> Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
> are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
> The kernel 2.6.39 works nice.
> 
> How this hurt us in the real life: We have a very high performance
> game server where the MySQL have to do many writes along the reads.
> All writes and reads are very simple and have to be very quick. If
> we run the system with Linux 3.2 we get unacceptable performance.
> Now we are stuck with 2.6.32 kernel here because this problem.
> 
> I attach the test program wrote by me which shows the problem. The
> program just writes blocks continously to random position to a given
> big file. The write rate limited to 100 MByte/s. In a well-working
> kernel it have to run with constant 100 MBit/s speed for indefinite
> long. The test have to be run on a simple HDD.
> 
> Test steps:
> 1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
> Ext4 forces flushes periodically. I recommend to use XFS.
> 2. create a big file on the test partiton. For 8 GByte RAM you can
> create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
> file. File creation can be done with this command:  dd if=/dev/zero
> of=bigfile2048M.bin bs=1M count=2048
> 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
> 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
> 
> In the beginning there can be some slowness even on well-working
> kernels. If you create the bigfile in the same run then it runs
> usually smootly from the beginning.
> 
> I don't know a setting of /proc/sys/vm variables which runs this
> test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
> bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
> testfile size the test program should never be blocked.
> >>>   I've run your program and I can confirm your results. As a side note,
> >>>your test program as a bug as it uses 'int' for offset arithmetics so when
> >>>the file is larger than 2 GB, you can hit some problems but for our case
> >>>that's not really important.
> >>Sorry for the bug and maybe the poor implementation. I am much
> >>better in Pascal than in C.
> >>(You can not make such mistake in Pascal (FreePascal). Is there a
> >>way (compiler switch) in C/C++ to get there a warning?)
> >   Actually I somewhat doubt that even FreePascal is able to give you a
> >warning that arithmetic can overflow...
> Well, you get a hint at least (FPC 2.6).
> 
> program inttest;
> 
> var
>   i,j : integer;
> 
> procedure Test(x : int64);
> begin
>   Writeln('x=',x);
> end;
> 
> begin
>   i := 100;
>   j := 100;
>   Test(100*100);
>   Test(int64(i)*j);
>   Test(i*j);  // result is wrong, but you get a hint here
  You get a hint about automatic conversion from 'integer' to 'int64'? I
don't have a fpc compiler at hand to check that but I'd be surprised
because that tends to be rather common.  I imagine you get the warning if
the compiler can figure out the numbers in advance. But in your test
program the situation was more like:
ReadLn(i);
j = 4096;
Test(i*j);

And there the compiler nows nothing about the resulting value...

> >>>The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
> >>>writeback when grabbing pages to begin a write". At the first sight I was
> >>>somewhat surprised when I saw that code path in the traces but later when I
> >>>did some math it's clear. What the commit does is that when a page is just
> >>>being written out to disk, we don't allow it's contents to be changed and
> >>>wait for IO to finish before letting next write to proceed. Now if you have
> >>>1 GB file, that's 256000 pages. By the observation from my test machine,
> >>>writeback code keeps around 1 pages in flight to disk at any moment
> >>>(this number fluctuates a lot but average is around that number). Your
> >>>program dirties about 25600 pages per second. So the probability one of
> >>>dirtied pages is a page under writeback is equal to 1 for all practical
> >>>purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
> >>>you are going to hit about 1000 pages under writeback per second which
> >>>clearly has a noticeable impact (even single page can have). Pity I didn't
> >>>do the math when we were considering those patches.
> >>>
> >>>There were plans to avoid waiting if underlying storage doesn't need it but
> >>>I'm not sure how far that plans got (added a couple of relevant CCs).
> >>>Anyway you are about second or third real workload that sees regression due
> >>>to "stable pages" so we 

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Viktor Nagy

Hi,

On 2012.10.10. 22:27, Jan Kara wrote:

On Wed 10-10-12 22:44:41, Viktor Nagy wrote:

On 10/10/2012 06:57 PM, Jan Kara wrote:

   Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance
game server where the MySQL have to do many writes along the reads.
All writes and reads are very simple and have to be very quick. If
we run the system with Linux 3.2 we get unacceptable performance.
Now we are stuck with 2.6.32 kernel here because this problem.

I attach the test program wrote by me which shows the problem. The
program just writes blocks continously to random position to a given
big file. The write rate limited to 100 MByte/s. In a well-working
kernel it have to run with constant 100 MBit/s speed for indefinite
long. The test have to be run on a simple HDD.

Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
Ext4 forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
file. File creation can be done with this command:  dd if=/dev/zero
of=bigfile2048M.bin bs=1M count=2048
3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working
kernels. If you create the bigfile in the same run then it runs
usually smootly from the beginning.

I don't know a setting of /proc/sys/vm variables which runs this
test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
testfile size the test program should never be blocked.

   I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.

Sorry for the bug and maybe the poor implementation. I am much
better in Pascal than in C.
(You can not make such mistake in Pascal (FreePascal). Is there a
way (compiler switch) in C/C++ to get there a warning?)

   Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...

Well, you get a hint at least (FPC 2.6).

program inttest;

var
  i,j : integer;

procedure Test(x : int64);
begin
  Writeln('x=',x);
end;

begin
  i := 100;
  j := 100;
  Test(100*100);
  Test(int64(i)*j);
  Test(i*j);  // result is wrong, but you get a hint here
  Write('Press enter to continue...'); readln;
end.




The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
writeback when grabbing pages to begin a write". At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to "stable pages" so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza

Thank you for your response!

I'm very happy that I've found the right people.

We develop a game server which gets very high load in some
countries. We are trying to serve as much players as possible with
one server.
Currently the CPU usage is below the 50% at the peak times. And with
the old kernel it runs smoothly. The pdflush runs non-stop on the
database disk with ~3 MByte/s write (minimal read).
This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
I think we are still below the theoratical limits of this server...
but only if the disk writes are never done in sync.

I will try the 3.2.31 kernel without the problematic commit
(3d08bcc8 "mm: Wait 

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Viktor Nagy

Hi,

On 2012.10.10. 22:27, Jan Kara wrote:

On Wed 10-10-12 22:44:41, Viktor Nagy wrote:

On 10/10/2012 06:57 PM, Jan Kara wrote:

   Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance
game server where the MySQL have to do many writes along the reads.
All writes and reads are very simple and have to be very quick. If
we run the system with Linux 3.2 we get unacceptable performance.
Now we are stuck with 2.6.32 kernel here because this problem.

I attach the test program wrote by me which shows the problem. The
program just writes blocks continously to random position to a given
big file. The write rate limited to 100 MByte/s. In a well-working
kernel it have to run with constant 100 MBit/s speed for indefinite
long. The test have to be run on a simple HDD.

Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
Ext4 forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
file. File creation can be done with this command:  dd if=/dev/zero
of=bigfile2048M.bin bs=1M count=2048
3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working
kernels. If you create the bigfile in the same run then it runs
usually smootly from the beginning.

I don't know a setting of /proc/sys/vm variables which runs this
test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
bug, because if I have much more /proc/sys/vm/dirty_bytes than the
testfile size the test program should never be blocked.

   I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.

Sorry for the bug and maybe the poor implementation. I am much
better in Pascal than in C.
(You can not make such mistake in Pascal (FreePascal). Is there a
way (compiler switch) in C/C++ to get there a warning?)

   Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...

Well, you get a hint at least (FPC 2.6).

program inttest;

var
  i,j : integer;

procedure Test(x : int64);
begin
  Writeln('x=',x);
end;

begin
  i := 100;
  j := 100;
  Test(100*100);
  Test(int64(i)*j);
  Test(i*j);  // result is wrong, but you get a hint here
  Write('Press enter to continue...'); readln;
end.




The regression you observe is caused by commit 3d08bcc8 mm: Wait for
writeback when grabbing pages to begin a write. At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to stable pages so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza

Thank you for your response!

I'm very happy that I've found the right people.

We develop a game server which gets very high load in some
countries. We are trying to serve as much players as possible with
one server.
Currently the CPU usage is below the 50% at the peak times. And with
the old kernel it runs smoothly. The pdflush runs non-stop on the
database disk with ~3 MByte/s write (minimal read).
This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
I think we are still below the theoratical limits of this server...
but only if the disk writes are never done in sync.

I will try the 3.2.31 kernel without the problematic commit
(3d08bcc8 mm: Wait for 

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Jan Kara
On Thu 11-10-12 11:52:54, Viktor Nagy wrote:
 On 2012.10.10. 22:27, Jan Kara wrote:
 On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
 On 10/10/2012 06:57 PM, Jan Kara wrote:
Hello,
 
 On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
 Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
 are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
 The kernel 2.6.39 works nice.
 
 How this hurt us in the real life: We have a very high performance
 game server where the MySQL have to do many writes along the reads.
 All writes and reads are very simple and have to be very quick. If
 we run the system with Linux 3.2 we get unacceptable performance.
 Now we are stuck with 2.6.32 kernel here because this problem.
 
 I attach the test program wrote by me which shows the problem. The
 program just writes blocks continously to random position to a given
 big file. The write rate limited to 100 MByte/s. In a well-working
 kernel it have to run with constant 100 MBit/s speed for indefinite
 long. The test have to be run on a simple HDD.
 
 Test steps:
 1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
 Ext4 forces flushes periodically. I recommend to use XFS.
 2. create a big file on the test partiton. For 8 GByte RAM you can
 create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
 file. File creation can be done with this command:  dd if=/dev/zero
 of=bigfile2048M.bin bs=1M count=2048
 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
 
 In the beginning there can be some slowness even on well-working
 kernels. If you create the bigfile in the same run then it runs
 usually smootly from the beginning.
 
 I don't know a setting of /proc/sys/vm variables which runs this
 test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
 bug, because if I have much more /proc/sys/vm/dirty_bytes than the
 testfile size the test program should never be blocked.
I've run your program and I can confirm your results. As a side note,
 your test program as a bug as it uses 'int' for offset arithmetics so when
 the file is larger than 2 GB, you can hit some problems but for our case
 that's not really important.
 Sorry for the bug and maybe the poor implementation. I am much
 better in Pascal than in C.
 (You can not make such mistake in Pascal (FreePascal). Is there a
 way (compiler switch) in C/C++ to get there a warning?)
Actually I somewhat doubt that even FreePascal is able to give you a
 warning that arithmetic can overflow...
 Well, you get a hint at least (FPC 2.6).
 
 program inttest;
 
 var
   i,j : integer;
 
 procedure Test(x : int64);
 begin
   Writeln('x=',x);
 end;
 
 begin
   i := 100;
   j := 100;
   Test(100*100);
   Test(int64(i)*j);
   Test(i*j);  // result is wrong, but you get a hint here
  You get a hint about automatic conversion from 'integer' to 'int64'? I
don't have a fpc compiler at hand to check that but I'd be surprised
because that tends to be rather common.  I imagine you get the warning if
the compiler can figure out the numbers in advance. But in your test
program the situation was more like:
ReadLn(i);
j = 4096;
Test(i*j);

And there the compiler nows nothing about the resulting value...

 The regression you observe is caused by commit 3d08bcc8 mm: Wait for
 writeback when grabbing pages to begin a write. At the first sight I was
 somewhat surprised when I saw that code path in the traces but later when I
 did some math it's clear. What the commit does is that when a page is just
 being written out to disk, we don't allow it's contents to be changed and
 wait for IO to finish before letting next write to proceed. Now if you have
 1 GB file, that's 256000 pages. By the observation from my test machine,
 writeback code keeps around 1 pages in flight to disk at any moment
 (this number fluctuates a lot but average is around that number). Your
 program dirties about 25600 pages per second. So the probability one of
 dirtied pages is a page under writeback is equal to 1 for all practical
 purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
 you are going to hit about 1000 pages under writeback per second which
 clearly has a noticeable impact (even single page can have). Pity I didn't
 do the math when we were considering those patches.
 
 There were plans to avoid waiting if underlying storage doesn't need it but
 I'm not sure how far that plans got (added a couple of relevant CCs).
 Anyway you are about second or third real workload that sees regression due
 to stable pages so we have to fix that sooner rather than later... Thanks
 for your detailed report!
 
Honza
 Thank you for your response!
 
 I'm very happy that I've found the right people.
 
 We develop a game server which gets very high load in some
 countries. We are trying to 

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-11 Thread Jan Kara
On Thu 11-10-12 13:58:00, Viktor Nagy wrote:
 The regression you observe is caused by commit 3d08bcc8 mm: Wait for
 writeback when grabbing pages to begin a write. At the first sight I was
 somewhat surprised when I saw that code path in the traces but later 
 when I
 did some math it's clear. What the commit does is that when a page is 
 just
 being written out to disk, we don't allow it's contents to be changed and
 wait for IO to finish before letting next write to proceed. Now if you 
 have
 1 GB file, that's 256000 pages. By the observation from my test machine,
 writeback code keeps around 1 pages in flight to disk at any moment
 (this number fluctuates a lot but average is around that number). Your
 program dirties about 25600 pages per second. So the probability one of
 dirtied pages is a page under writeback is equal to 1 for all practical
 purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
 you are going to hit about 1000 pages under writeback per second which
 clearly has a noticeable impact (even single page can have). Pity I 
 didn't
 do the math when we were considering those patches.
 
 There were plans to avoid waiting if underlying storage doesn't need it 
 but
 I'm not sure how far that plans got (added a couple of relevant CCs).
 Anyway you are about second or third real workload that sees regression 
 due
 to stable pages so we have to fix that sooner rather than later... 
 Thanks
 for your detailed report!
 We develop a game server which gets very high load in some
 countries. We are trying to serve as much players as possible with
 one server.
 Currently the CPU usage is below the 50% at the peak times. And with
 the old kernel it runs smoothly. The pdflush runs non-stop on the
 database disk with ~3 MByte/s write (minimal read).
 This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
 I think we are still below the theoratical limits of this server...
 but only if the disk writes are never done in sync.
 
 I will try the 3.2.31 kernel without the problematic commit
 (3d08bcc8 mm: Wait for writeback when grabbing pages to begin a
 write).
 Is it a good idea? Will it be worse than 2.6.32?
Running without that commit should work just fine unless you use
 something exotic like DIF/DIX or similar. Whether things will be worse than
 in 2.6.32 I cannot say. For me, your test program behaves fine without that
 commit but whether your real workload won't hit some other problem is
 always a question. But if you hit another regression I'm interested in
 hearing about it :).
 I've just tested it. After I've set the dirty_bytes over the file
 size the writes are never blocked.
 So it's working nice without the mentioned commit.
 
 The problem is that if you read the kernel's documentation about the
 dirty page handling it does not work that way (with the commit) It
 works unpredictable.
Which documentation do you mean exatly? The process won't be throttled
 because of dirtying too much memory but we can still block it for other
 reasons - e.g. because we decide to evict it's code from memory and have to
 reload it again when the process gets scheduled. Or we can block during
 memory allocation (which may be needed to allocate a page you write to) if
 we find it necessary. There are no promises really...
 
 Ok, it is very hard to get an overview about this whole thing.
 I thought I understood the behaviour checking the file
 Documentation/sysctl/vm.txt:
 
 
 dirty_bytes
 
 Contains the amount of dirty memory at which a process generating
 disk writes
 will itself start writeback.
 ...
 
 
 Ok, it not says exactly that other things can influence too.
 
 Several people are trying to get over the problem caused by the
 commit with setting the value of /sys/block/sda/queue/nr_requests to
 4 (from 128).
 This helped a lot but was not enough for us.
  Yes, that reduces amount of IO in flight at any moment so it reduces
chances you will wait in grab_cache_page_write_begin(). But it also reduces
throughput...

 I attach two performance graphs which shows our own CPU usage
 measurement (red). One minute averages, the blue line is the SQL
 time %.
 
 And a nice question: Without reverting the patch is it possible to
 get a smooth performance (in our case)?
  I don't know how to fix the issue without reverting the patch. Sorry.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Jan Kara
On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
> On 10/10/2012 06:57 PM, Jan Kara wrote:
> >   Hello,
> >
> >On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
> >>Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
> >>are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
> >>The kernel 2.6.39 works nice.
> >>
> >>How this hurt us in the real life: We have a very high performance
> >>game server where the MySQL have to do many writes along the reads.
> >>All writes and reads are very simple and have to be very quick. If
> >>we run the system with Linux 3.2 we get unacceptable performance.
> >>Now we are stuck with 2.6.32 kernel here because this problem.
> >>
> >>I attach the test program wrote by me which shows the problem. The
> >>program just writes blocks continously to random position to a given
> >>big file. The write rate limited to 100 MByte/s. In a well-working
> >>kernel it have to run with constant 100 MBit/s speed for indefinite
> >>long. The test have to be run on a simple HDD.
> >>
> >>Test steps:
> >>1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
> >>Ext4 forces flushes periodically. I recommend to use XFS.
> >>2. create a big file on the test partiton. For 8 GByte RAM you can
> >>create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
> >>file. File creation can be done with this command:  dd if=/dev/zero
> >>of=bigfile2048M.bin bs=1M count=2048
> >>3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
> >>4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
> >>
> >>In the beginning there can be some slowness even on well-working
> >>kernels. If you create the bigfile in the same run then it runs
> >>usually smootly from the beginning.
> >>
> >>I don't know a setting of /proc/sys/vm variables which runs this
> >>test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
> >>bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
> >>testfile size the test program should never be blocked.
> >   I've run your program and I can confirm your results. As a side note,
> >your test program as a bug as it uses 'int' for offset arithmetics so when
> >the file is larger than 2 GB, you can hit some problems but for our case
> >that's not really important.
> Sorry for the bug and maybe the poor implementation. I am much
> better in Pascal than in C.
> (You can not make such mistake in Pascal (FreePascal). Is there a
> way (compiler switch) in C/C++ to get there a warning?)
  Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...

> >The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
> >writeback when grabbing pages to begin a write". At the first sight I was
> >somewhat surprised when I saw that code path in the traces but later when I
> >did some math it's clear. What the commit does is that when a page is just
> >being written out to disk, we don't allow it's contents to be changed and
> >wait for IO to finish before letting next write to proceed. Now if you have
> >1 GB file, that's 256000 pages. By the observation from my test machine,
> >writeback code keeps around 1 pages in flight to disk at any moment
> >(this number fluctuates a lot but average is around that number). Your
> >program dirties about 25600 pages per second. So the probability one of
> >dirtied pages is a page under writeback is equal to 1 for all practical
> >purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
> >you are going to hit about 1000 pages under writeback per second which
> >clearly has a noticeable impact (even single page can have). Pity I didn't
> >do the math when we were considering those patches.
> >
> >There were plans to avoid waiting if underlying storage doesn't need it but
> >I'm not sure how far that plans got (added a couple of relevant CCs).
> >Anyway you are about second or third real workload that sees regression due
> >to "stable pages" so we have to fix that sooner rather than later... Thanks
> >for your detailed report!
> >
> > Honza
> Thank you for your response!
> 
> I'm very happy that I've found the right people.
> 
> We develop a game server which gets very high load in some
> countries. We are trying to serve as much players as possible with
> one server.
> Currently the CPU usage is below the 50% at the peak times. And with
> the old kernel it runs smoothly. The pdflush runs non-stop on the
> database disk with ~3 MByte/s write (minimal read).
> This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
> I think we are still below the theoratical limits of this server...
> but only if the disk writes are never done in sync.
> 
> I will try the 3.2.31 kernel without the problematic commit
> (3d08bcc8 "mm: Wait for writeback when grabbing pages to begin a
> write").
> Is it a good idea? Will it be worse than 2.6.32?
  Running without 

Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Viktor Nagy

Hi,

On 10/10/2012 06:57 PM, Jan Kara wrote:

   Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance
game server where the MySQL have to do many writes along the reads.
All writes and reads are very simple and have to be very quick. If
we run the system with Linux 3.2 we get unacceptable performance.
Now we are stuck with 2.6.32 kernel here because this problem.

I attach the test program wrote by me which shows the problem. The
program just writes blocks continously to random position to a given
big file. The write rate limited to 100 MByte/s. In a well-working
kernel it have to run with constant 100 MBit/s speed for indefinite
long. The test have to be run on a simple HDD.

Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
Ext4 forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
file. File creation can be done with this command:  dd if=/dev/zero
of=bigfile2048M.bin bs=1M count=2048
3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working
kernels. If you create the bigfile in the same run then it runs
usually smootly from the beginning.

I don't know a setting of /proc/sys/vm variables which runs this
test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
testfile size the test program should never be blocked.

   I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.
Sorry for the bug and maybe the poor implementation. I am much better in 
Pascal than in C.
(You can not make such mistake in Pascal (FreePascal). Is there a way 
(compiler switch) in C/C++ to get there a warning?)


The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
writeback when grabbing pages to begin a write". At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to "stable pages" so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza

Thank you for your response!

I'm very happy that I've found the right people.

We develop a game server which gets very high load in some countries. We 
are trying to serve as much players as possible with one server.
Currently the CPU usage is below the 50% at the peak times. And with the 
old kernel it runs smoothly. The pdflush runs non-stop on the database 
disk with ~3 MByte/s write (minimal read).

This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
I think we are still below the theoratical limits of this server... but 
only if the disk writes are never done in sync.


I will try the 3.2.31 kernel without the problematic commit (3d08bcc8 
"mm: Wait for writeback when grabbing pages to begin a write").

Is it a good idea? Will it be worse than 2.6.32?

Thank you very much again!

Viktor
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Jan Kara
  Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
> Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
> are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
> The kernel 2.6.39 works nice.
> 
> How this hurt us in the real life: We have a very high performance
> game server where the MySQL have to do many writes along the reads.
> All writes and reads are very simple and have to be very quick. If
> we run the system with Linux 3.2 we get unacceptable performance.
> Now we are stuck with 2.6.32 kernel here because this problem.
> 
> I attach the test program wrote by me which shows the problem. The
> program just writes blocks continously to random position to a given
> big file. The write rate limited to 100 MByte/s. In a well-working
> kernel it have to run with constant 100 MBit/s speed for indefinite
> long. The test have to be run on a simple HDD.
> 
> Test steps:
> 1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
> Ext4 forces flushes periodically. I recommend to use XFS.
> 2. create a big file on the test partiton. For 8 GByte RAM you can
> create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
> file. File creation can be done with this command:  dd if=/dev/zero
> of=bigfile2048M.bin bs=1M count=2048
> 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
> 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
> 
> In the beginning there can be some slowness even on well-working
> kernels. If you create the bigfile in the same run then it runs
> usually smootly from the beginning.
> 
> I don't know a setting of /proc/sys/vm variables which runs this
> test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
> bug, because if I have much more "/proc/sys/vm/dirty_bytes" than the
> testfile size the test program should never be blocked.
  I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.

The regression you observe is caused by commit 3d08bcc8 "mm: Wait for
writeback when grabbing pages to begin a write". At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to "stable pages" so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Jan Kara
  Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
 Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
 are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
 The kernel 2.6.39 works nice.
 
 How this hurt us in the real life: We have a very high performance
 game server where the MySQL have to do many writes along the reads.
 All writes and reads are very simple and have to be very quick. If
 we run the system with Linux 3.2 we get unacceptable performance.
 Now we are stuck with 2.6.32 kernel here because this problem.
 
 I attach the test program wrote by me which shows the problem. The
 program just writes blocks continously to random position to a given
 big file. The write rate limited to 100 MByte/s. In a well-working
 kernel it have to run with constant 100 MBit/s speed for indefinite
 long. The test have to be run on a simple HDD.
 
 Test steps:
 1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
 Ext4 forces flushes periodically. I recommend to use XFS.
 2. create a big file on the test partiton. For 8 GByte RAM you can
 create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
 file. File creation can be done with this command:  dd if=/dev/zero
 of=bigfile2048M.bin bs=1M count=2048
 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
 
 In the beginning there can be some slowness even on well-working
 kernels. If you create the bigfile in the same run then it runs
 usually smootly from the beginning.
 
 I don't know a setting of /proc/sys/vm variables which runs this
 test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
 bug, because if I have much more /proc/sys/vm/dirty_bytes than the
 testfile size the test program should never be blocked.
  I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.

The regression you observe is caused by commit 3d08bcc8 mm: Wait for
writeback when grabbing pages to begin a write. At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to stable pages so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Viktor Nagy

Hi,

On 10/10/2012 06:57 PM, Jan Kara wrote:

   Hello,

On Tue 09-10-12 11:41:16, Viktor Nagy wrote:

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance
game server where the MySQL have to do many writes along the reads.
All writes and reads are very simple and have to be very quick. If
we run the system with Linux 3.2 we get unacceptable performance.
Now we are stuck with 2.6.32 kernel here because this problem.

I attach the test program wrote by me which shows the problem. The
program just writes blocks continously to random position to a given
big file. The write rate limited to 100 MByte/s. In a well-working
kernel it have to run with constant 100 MBit/s speed for indefinite
long. The test have to be run on a simple HDD.

Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
Ext4 forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
file. File creation can be done with this command:  dd if=/dev/zero
of=bigfile2048M.bin bs=1M count=2048
3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working
kernels. If you create the bigfile in the same run then it runs
usually smootly from the beginning.

I don't know a setting of /proc/sys/vm variables which runs this
test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
bug, because if I have much more /proc/sys/vm/dirty_bytes than the
testfile size the test program should never be blocked.

   I've run your program and I can confirm your results. As a side note,
your test program as a bug as it uses 'int' for offset arithmetics so when
the file is larger than 2 GB, you can hit some problems but for our case
that's not really important.
Sorry for the bug and maybe the poor implementation. I am much better in 
Pascal than in C.
(You can not make such mistake in Pascal (FreePascal). Is there a way 
(compiler switch) in C/C++ to get there a warning?)


The regression you observe is caused by commit 3d08bcc8 mm: Wait for
writeback when grabbing pages to begin a write. At the first sight I was
somewhat surprised when I saw that code path in the traces but later when I
did some math it's clear. What the commit does is that when a page is just
being written out to disk, we don't allow it's contents to be changed and
wait for IO to finish before letting next write to proceed. Now if you have
1 GB file, that's 256000 pages. By the observation from my test machine,
writeback code keeps around 1 pages in flight to disk at any moment
(this number fluctuates a lot but average is around that number). Your
program dirties about 25600 pages per second. So the probability one of
dirtied pages is a page under writeback is equal to 1 for all practical
purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
you are going to hit about 1000 pages under writeback per second which
clearly has a noticeable impact (even single page can have). Pity I didn't
do the math when we were considering those patches.

There were plans to avoid waiting if underlying storage doesn't need it but
I'm not sure how far that plans got (added a couple of relevant CCs).
Anyway you are about second or third real workload that sees regression due
to stable pages so we have to fix that sooner rather than later... Thanks
for your detailed report!

Honza

Thank you for your response!

I'm very happy that I've found the right people.

We develop a game server which gets very high load in some countries. We 
are trying to serve as much players as possible with one server.
Currently the CPU usage is below the 50% at the peak times. And with the 
old kernel it runs smoothly. The pdflush runs non-stop on the database 
disk with ~3 MByte/s write (minimal read).

This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
I think we are still below the theoratical limits of this server... but 
only if the disk writes are never done in sync.


I will try the 3.2.31 kernel without the problematic commit (3d08bcc8 
mm: Wait for writeback when grabbing pages to begin a write).

Is it a good idea? Will it be worse than 2.6.32?

Thank you very much again!

Viktor
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-10 Thread Jan Kara
On Wed 10-10-12 22:44:41, Viktor Nagy wrote:
 On 10/10/2012 06:57 PM, Jan Kara wrote:
Hello,
 
 On Tue 09-10-12 11:41:16, Viktor Nagy wrote:
 Since Kernel version 3.0 pdflush blocks writes even the dirty bytes
 are well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.
 The kernel 2.6.39 works nice.
 
 How this hurt us in the real life: We have a very high performance
 game server where the MySQL have to do many writes along the reads.
 All writes and reads are very simple and have to be very quick. If
 we run the system with Linux 3.2 we get unacceptable performance.
 Now we are stuck with 2.6.32 kernel here because this problem.
 
 I attach the test program wrote by me which shows the problem. The
 program just writes blocks continously to random position to a given
 big file. The write rate limited to 100 MByte/s. In a well-working
 kernel it have to run with constant 100 MBit/s speed for indefinite
 long. The test have to be run on a simple HDD.
 
 Test steps:
 1. You have to use an XFS, EXT2 or ReiserFS partition for the test,
 Ext4 forces flushes periodically. I recommend to use XFS.
 2. create a big file on the test partiton. For 8 GByte RAM you can
 create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte
 file. File creation can be done with this command:  dd if=/dev/zero
 of=bigfile2048M.bin bs=1M count=2048
 3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
 4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin
 
 In the beginning there can be some slowness even on well-working
 kernels. If you create the bigfile in the same run then it runs
 usually smootly from the beginning.
 
 I don't know a setting of /proc/sys/vm variables which runs this
 test smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel
 bug, because if I have much more /proc/sys/vm/dirty_bytes than the
 testfile size the test program should never be blocked.
I've run your program and I can confirm your results. As a side note,
 your test program as a bug as it uses 'int' for offset arithmetics so when
 the file is larger than 2 GB, you can hit some problems but for our case
 that's not really important.
 Sorry for the bug and maybe the poor implementation. I am much
 better in Pascal than in C.
 (You can not make such mistake in Pascal (FreePascal). Is there a
 way (compiler switch) in C/C++ to get there a warning?)
  Actually I somewhat doubt that even FreePascal is able to give you a
warning that arithmetic can overflow...

 The regression you observe is caused by commit 3d08bcc8 mm: Wait for
 writeback when grabbing pages to begin a write. At the first sight I was
 somewhat surprised when I saw that code path in the traces but later when I
 did some math it's clear. What the commit does is that when a page is just
 being written out to disk, we don't allow it's contents to be changed and
 wait for IO to finish before letting next write to proceed. Now if you have
 1 GB file, that's 256000 pages. By the observation from my test machine,
 writeback code keeps around 1 pages in flight to disk at any moment
 (this number fluctuates a lot but average is around that number). Your
 program dirties about 25600 pages per second. So the probability one of
 dirtied pages is a page under writeback is equal to 1 for all practical
 purposes (precisely it is 1-(1-1/256000)^25600). Actually, on average
 you are going to hit about 1000 pages under writeback per second which
 clearly has a noticeable impact (even single page can have). Pity I didn't
 do the math when we were considering those patches.
 
 There were plans to avoid waiting if underlying storage doesn't need it but
 I'm not sure how far that plans got (added a couple of relevant CCs).
 Anyway you are about second or third real workload that sees regression due
 to stable pages so we have to fix that sooner rather than later... Thanks
 for your detailed report!
 
  Honza
 Thank you for your response!
 
 I'm very happy that I've found the right people.
 
 We develop a game server which gets very high load in some
 countries. We are trying to serve as much players as possible with
 one server.
 Currently the CPU usage is below the 50% at the peak times. And with
 the old kernel it runs smoothly. The pdflush runs non-stop on the
 database disk with ~3 MByte/s write (minimal read).
 This is at 43000 active sockets, 18000 rq/s, ~4 packets/s.
 I think we are still below the theoratical limits of this server...
 but only if the disk writes are never done in sync.
 
 I will try the 3.2.31 kernel without the problematic commit
 (3d08bcc8 mm: Wait for writeback when grabbing pages to begin a
 write).
 Is it a good idea? Will it be worse than 2.6.32?
  Running without that commit should work just fine unless you use
something exotic like DIF/DIX or similar. Whether things will be worse than
in 2.6.32 I cannot say. For me, your test program behaves fine 

Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-09 Thread Viktor Nagy

Hello,

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes are 
well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.

The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance game 
server where the MySQL have to do many writes along the reads. All 
writes and reads are very simple and have to be very quick. If we run 
the system with Linux 3.2 we get unacceptable performance. Now we are 
stuck with 2.6.32 kernel here because this problem.


I attach the test program wrote by me which shows the problem. The 
program just writes blocks continously to random position to a given big 
file. The write rate limited to 100 MByte/s. In a well-working kernel it 
have to run with constant 100 MBit/s speed for indefinite long. The test 
have to be run on a simple HDD.


Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test, Ext4 
forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can 
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte file. 
File creation can be done with this command:  dd if=/dev/zero 
of=bigfile2048M.bin bs=1M count=2048

3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working 
kernels. If you create the bigfile in the same run then it runs usually 
smootly from the beginning.


I don't know a setting of /proc/sys/vm variables which runs this test 
smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel bug, because 
if I have much more "/proc/sys/vm/dirty_bytes" than the testfile size 
the test program should never be blocked.


A sample result:
11:20:05: speed:   99.994 MiB/s, time usage:  4.60 %, avg. block 
time:1.8 us, max. block time: 18 us
11:20:06: speed:   99.994 MiB/s, time usage:  4.60 %, avg. block 
time:1.8 us, max. block time: 10 us
11:20:07: speed:   99.994 MiB/s, time usage:  4.62 %, avg. block 
time:1.8 us, max. block time: 11 us
11:20:08: speed:   99.989 MiB/s, time usage:  4.59 %, avg. block 
time:1.8 us, max. block time: 58 us
11:20:09: speed:   99.997 MiB/s, time usage:  4.55 %, avg. block 
time:1.8 us, max. block time: 13 us
11:20:10: speed:   28.840 MiB/s, time usage: 96.47 %, avg. block time:  
130.7 us, max. block time: 114076 us
11:20:11: speed:   30.505 MiB/s, time usage: 98.14 %, avg. block time:  
125.7 us, max. block time: 135008 us
11:20:12: speed:   25.956 MiB/s, time usage: 99.71 %, avg. block time:  
150.1 us, max. block time: 129839 us
11:20:13: speed:   25.088 MiB/s, time usage: 96.43 %, avg. block time:  
150.1 us, max. block time: 149649 us
11:20:14: speed:   32.438 MiB/s, time usage: 98.64 %, avg. block time:  
118.8 us, max. block time: 145649 us
11:20:15: speed:   22.765 MiB/s, time usage: 99.11 %, avg. block time:  
170.1 us, max. block time: 159749 us


At 11:20:10 the pdflush started its work, based on the 
"dirty_expire_centisecs". The test file was 2GByte, the 
/proc/sys/vm/dirty_bytes/dirty_bytes was 40. The system (i5-3.4 
GHz, 8G RAM, Kernel 3.2.29-amd64) was booted to run this test only so it 
had 2,2 GByte RAM for cache, and 5.1 GByte RAM free (totally unused).


Sorry if I not found the right place to report this, please advise me 
were to send.


Best regards
Viktor Nagy

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include  

// parameters

char * filename = NULL;
enum testermode {READ, WRITE} mode = WRITE;
float speedlimit = 100; // MiB / s
int burstcount = 10;

// internals

int blocksize = 4096;
int blockcount = 0;

int fh = 0;

char filebuffer[65536];

static struct option long_options[] = 
{
	{"mode", required_argument, 0,  0 },
	{"m", required_argument, 0,  0 },
	{"file",required_argument, 0,  0 },
	{"f",required_argument, 0,  0 },
	{"limit",required_argument, 0,  0 },
	{"l",required_argument, 0,  0 },
	{"burst",required_argument, 0,  0 },
	{"b",required_argument, 0,  0 },
	{0, 0, 0,  0 }
};

void parse_commandline(int argc, char **argv)
{
	int r;
	int option_index = 0;

while (0 <= (r = getopt_long_only(argc, argv, "", long_options, _index)))
{
const char * optname = long_options[option_index].name;
		if (r == 0)
		{			
			if (strcmp("file", optname) == 0 || strcmp("f", optname) == 0)
			{
filename = strdup(optarg);
			}
			else if (strcmp("mode", optname) == 0 || strcmp("m", optname) == 0)
			{
if (strcmp("reader", optarg) == 0 || strcmp("read", optarg) == 0)
{
	mode = READ;
}
else if (strcmp("writer", optarg) == 0 || strcmp("write", optarg) == 0)
{
	mode = WRITE;
}
else
{
	printf("Invalid mode: \"%s\"\n", optarg);
}
			}
			else if (strcmp("burst", optname) == 0 || strcmp("b", optname) == 0)
			{
burstcount = 

Linux 3.0+ Disk performance problem - wrong pdflush behaviour

2012-10-09 Thread Viktor Nagy

Hello,

Since Kernel version 3.0 pdflush blocks writes even the dirty bytes are 
well below /proc/sys/vm/dirty_bytes or /proc/sys/vm/dirty_ratio.

The kernel 2.6.39 works nice.

How this hurt us in the real life: We have a very high performance game 
server where the MySQL have to do many writes along the reads. All 
writes and reads are very simple and have to be very quick. If we run 
the system with Linux 3.2 we get unacceptable performance. Now we are 
stuck with 2.6.32 kernel here because this problem.


I attach the test program wrote by me which shows the problem. The 
program just writes blocks continously to random position to a given big 
file. The write rate limited to 100 MByte/s. In a well-working kernel it 
have to run with constant 100 MBit/s speed for indefinite long. The test 
have to be run on a simple HDD.


Test steps:
1. You have to use an XFS, EXT2 or ReiserFS partition for the test, Ext4 
forces flushes periodically. I recommend to use XFS.
2. create a big file on the test partiton. For 8 GByte RAM you can 
create a 2 GByte file. For 2 GB RAM I recommend to create 500MByte file. 
File creation can be done with this command:  dd if=/dev/zero 
of=bigfile2048M.bin bs=1M count=2048

3. compile pdflushtest.c: (gcc -o pdflushtest pdflushtest.c)
4. run pdflushtest: ./pdflushtest --file=/where/is/the/bigfile2048M.bin

In the beginning there can be some slowness even on well-working 
kernels. If you create the bigfile in the same run then it runs usually 
smootly from the beginning.


I don't know a setting of /proc/sys/vm variables which runs this test 
smootly on a 3.2.29 (3.0+) kernel. I think this is a kernel bug, because 
if I have much more /proc/sys/vm/dirty_bytes than the testfile size 
the test program should never be blocked.


A sample result:
11:20:05: speed:   99.994 MiB/s, time usage:  4.60 %, avg. block 
time:1.8 us, max. block time: 18 us
11:20:06: speed:   99.994 MiB/s, time usage:  4.60 %, avg. block 
time:1.8 us, max. block time: 10 us
11:20:07: speed:   99.994 MiB/s, time usage:  4.62 %, avg. block 
time:1.8 us, max. block time: 11 us
11:20:08: speed:   99.989 MiB/s, time usage:  4.59 %, avg. block 
time:1.8 us, max. block time: 58 us
11:20:09: speed:   99.997 MiB/s, time usage:  4.55 %, avg. block 
time:1.8 us, max. block time: 13 us
11:20:10: speed:   28.840 MiB/s, time usage: 96.47 %, avg. block time:  
130.7 us, max. block time: 114076 us
11:20:11: speed:   30.505 MiB/s, time usage: 98.14 %, avg. block time:  
125.7 us, max. block time: 135008 us
11:20:12: speed:   25.956 MiB/s, time usage: 99.71 %, avg. block time:  
150.1 us, max. block time: 129839 us
11:20:13: speed:   25.088 MiB/s, time usage: 96.43 %, avg. block time:  
150.1 us, max. block time: 149649 us
11:20:14: speed:   32.438 MiB/s, time usage: 98.64 %, avg. block time:  
118.8 us, max. block time: 145649 us
11:20:15: speed:   22.765 MiB/s, time usage: 99.11 %, avg. block time:  
170.1 us, max. block time: 159749 us


At 11:20:10 the pdflush started its work, based on the 
dirty_expire_centisecs. The test file was 2GByte, the 
/proc/sys/vm/dirty_bytes/dirty_bytes was 40. The system (i5-3.4 
GHz, 8G RAM, Kernel 3.2.29-amd64) was booted to run this test only so it 
had 2,2 GByte RAM for cache, and 5.1 GByte RAM free (totally unused).


Sorry if I not found the right place to report this, please advise me 
were to send.


Best regards
Viktor Nagy

#include stdio.h
#include string.h
#include getopt.h
#include inttypes.h
#include sys/time.h
#include time.h
#include stdlib.h
#include unistd.h
#include sys/stat.h
#include fcntl.h 

// parameters

char * filename = NULL;
enum testermode {READ, WRITE} mode = WRITE;
float speedlimit = 100; // MiB / s
int burstcount = 10;

// internals

int blocksize = 4096;
int blockcount = 0;

int fh = 0;

char filebuffer[65536];

static struct option long_options[] = 
{
	{mode, required_argument, 0,  0 },
	{m, required_argument, 0,  0 },
	{file,required_argument, 0,  0 },
	{f,required_argument, 0,  0 },
	{limit,required_argument, 0,  0 },
	{l,required_argument, 0,  0 },
	{burst,required_argument, 0,  0 },
	{b,required_argument, 0,  0 },
	{0, 0, 0,  0 }
};

void parse_commandline(int argc, char **argv)
{
	int r;
	int option_index = 0;

while (0 = (r = getopt_long_only(argc, argv, , long_options, option_index)))
{
const char * optname = long_options[option_index].name;
		if (r == 0)
		{			
			if (strcmp(file, optname) == 0 || strcmp(f, optname) == 0)
			{
filename = strdup(optarg);
			}
			else if (strcmp(mode, optname) == 0 || strcmp(m, optname) == 0)
			{
if (strcmp(reader, optarg) == 0 || strcmp(read, optarg) == 0)
{
	mode = READ;
}
else if (strcmp(writer, optarg) == 0 || strcmp(write, optarg) == 0)
{
	mode = WRITE;
}
else
{
	printf(Invalid mode: \%s\\n, optarg);
}
			}
			else if (strcmp(burst, optname) == 0 || strcmp(b,