Re: [PATCH] sched: avoid large irq-latencies in smp-balancing

2007-11-07 Thread Eric St-Laurent

On Wed, 2007-11-07 at 17:10 -0500, Steven Rostedt wrote:
> > 
> > It would be nice if sched_nr_migrate didn't exist, really.  It's hard to
> > imagine anyone wanting to tweak it, apart from developers.
> 
> I'm not so sure about that. It is a tunable for RT. That is we can tweak
> this value to be smaller if we don't like the latencies it gives us.
> 
> This is one of those things that sacrifices performance for latency.
> The higher the number, the better it can spread tasks around, but it
> also causes large latencies.
> 
> I've just included this patch into 2.6.23.1-rt11 and it brought down an
> unbounded latency to just 42us. (previously we got into the
> milliseconds!).
> 
> Perhaps when this feature matures, we can come to a good defined value
> that would be good for all. But until then, I recommend keeping this a
> tunable.


Why not use the latency-expectation infrastructure?

Iterate under lock until (or before...) the system global latency is
respected.


- Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: avoid large irq-latencies in smp-balancing

2007-11-07 Thread Eric St-Laurent

On Wed, 2007-11-07 at 17:10 -0500, Steven Rostedt wrote:
  
  It would be nice if sched_nr_migrate didn't exist, really.  It's hard to
  imagine anyone wanting to tweak it, apart from developers.
 
 I'm not so sure about that. It is a tunable for RT. That is we can tweak
 this value to be smaller if we don't like the latencies it gives us.
 
 This is one of those things that sacrifices performance for latency.
 The higher the number, the better it can spread tasks around, but it
 also causes large latencies.
 
 I've just included this patch into 2.6.23.1-rt11 and it brought down an
 unbounded latency to just 42us. (previously we got into the
 milliseconds!).
 
 Perhaps when this feature matures, we can come to a good defined value
 that would be good for all. But until then, I recommend keeping this a
 tunable.


Why not use the latency-expectation infrastructure?

Iterate under lock until (or before...) the system global latency is
respected.


- Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-23 Thread Eric St-Laurent

On Tue, 2007-10-23 at 20:35 -0600, Matthew Wilcox wrote:

[...]

> > Multiple string objects can share the same data, by increasing the nrefs
> > count, a new data is allocated if the string is modified and nrefs > 1.
> 
> If we were trying to get rid of char * throughout the kernel, that might
> make some sense; stringbufs have a more limited target though.
> 

[...]

No contest, my suggestions only make sense for a general uses string
library.

I suspect most in-kernel string manipulations are limited to prepare
buffers to be copied to (and read from) user-space.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-23 Thread Eric St-Laurent

On Tue, 2007-10-23 at 17:12 -0400, Matthew Wilcox wrote:
> Consecutive calls to printk are non-atomic, which leads to various
> implementations for accumulating strings which can be printed in one call.
> This is a generic string buffer which can also be used for non-printk
> purposes.  There is no sb_scanf implementation yet as I haven't identified
> a user for it.
> 
> +
> +struct stringbuf {
> + char *s;
> + int alloc;
> + int len;
> +};
> +

I don't know if copy-on-write semantics are really useful for current
in-kernel uses, but I've coded and used a C++ string class like this in
the past:

struct string_data
{
int nrefs;
unsigned len;
unsigned capacity;
//char data[capacity];  /* allocated along string_data */
};

struct string   /* or typedef in C... */
{
struct string *data;
};

[ struct string_data is a hidden implementation detail, only struct
string is exposed ]

Multiple string objects can share the same data, by increasing the nrefs
count, a new data is allocated if the string is modified and nrefs > 1.

Not having to iterate over the string to calculate it's length,
allocating a larger buffer to eliminate re-allocation and copy-on-write
semantics make a string like this a vast performance improvement over a
normal C string for a minimal (about 3 ints per data buffer) memory
cost.

By using it correctly it can prevents buffer overflows.

You still always null terminate the string stored in data to directly
use it a normal C string.

You also statically allocate an empty string which is shared by all
"uninitialized" or empty strings.


Even without copy-on-write semantics and reference counting, I think
this approach is better because it uses 1 less "object" and allocation:

struct string - "handle" (pointer really) to string data
struct string_data - string data

versus:

struct stringbuf *sb - pointer to string object
struct stringbuf - string object
char *s (member of stringbuf) - string data



Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-23 Thread Eric St-Laurent

On Tue, 2007-10-23 at 17:12 -0400, Matthew Wilcox wrote:
 Consecutive calls to printk are non-atomic, which leads to various
 implementations for accumulating strings which can be printed in one call.
 This is a generic string buffer which can also be used for non-printk
 purposes.  There is no sb_scanf implementation yet as I haven't identified
 a user for it.
 
 +
 +struct stringbuf {
 + char *s;
 + int alloc;
 + int len;
 +};
 +

I don't know if copy-on-write semantics are really useful for current
in-kernel uses, but I've coded and used a C++ string class like this in
the past:

struct string_data
{
int nrefs;
unsigned len;
unsigned capacity;
//char data[capacity];  /* allocated along string_data */
};

struct string   /* or typedef in C... */
{
struct string *data;
};

[ struct string_data is a hidden implementation detail, only struct
string is exposed ]

Multiple string objects can share the same data, by increasing the nrefs
count, a new data is allocated if the string is modified and nrefs  1.

Not having to iterate over the string to calculate it's length,
allocating a larger buffer to eliminate re-allocation and copy-on-write
semantics make a string like this a vast performance improvement over a
normal C string for a minimal (about 3 ints per data buffer) memory
cost.

By using it correctly it can prevents buffer overflows.

You still always null terminate the string stored in data to directly
use it a normal C string.

You also statically allocate an empty string which is shared by all
uninitialized or empty strings.


Even without copy-on-write semantics and reference counting, I think
this approach is better because it uses 1 less object and allocation:

struct string - handle (pointer really) to string data
struct string_data - string data

versus:

struct stringbuf *sb - pointer to string object
struct stringbuf - string object
char *s (member of stringbuf) - string data



Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] stringbuf: A string buffer implementation

2007-10-23 Thread Eric St-Laurent

On Tue, 2007-10-23 at 20:35 -0600, Matthew Wilcox wrote:

[...]

  Multiple string objects can share the same data, by increasing the nrefs
  count, a new data is allocated if the string is modified and nrefs  1.
 
 If we were trying to get rid of char * throughout the kernel, that might
 make some sense; stringbufs have a more limited target though.
 

[...]

No contest, my suggestions only make sense for a general uses string
library.

I suspect most in-kernel string manipulations are limited to prepare
buffers to be copied to (and read from) user-space.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-02 Thread Eric St-Laurent

On Tue, 2007-10-02 at 11:17 +0200, Thomas Gleixner wrote:

[...]

> I have uploaded an update of the arch/x86 tree based on -rc9 to
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86
> 

[...]

> If there is anything we can help with the transition, please do not
> hesitate to ask.
> 
> Thanks,
> 
>   Thomas, Ingo

Hi Thomas,

This latest x86 branch build and boot without problem with my usual
x86_64 config.

If you remember our conversation one month ago, I was unable to build
your tree.

I've upgraded my Ubuntu distribution from 7.04 to 7.10 beta this week,
maybe this fixed it.

But I still had to do some manual fixes to get the packaging steps
working:

mkdir arch/x86_64/boot/
ln -s ../../../arch/x86/boot/bzImage arch/x86_64/boot/bzImage


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: yield API

2007-10-02 Thread Eric St-Laurent

On Tue, 2007-10-02 at 08:46 +0200, Ingo Molnar wrote:

[...]

> APIs that are not in any real, meaningful use, despite a decade of 
> presence are not really interesting to me personally. (especially in 
> this case where we know exactly _why_ the API is used so rarely.) Sure 
> we'll continue to support it in the best possible way, with the usual 
> kernel maintainance policy: without hurting other, more commonly used 
> APIs. That was the principle we followed in previous schedulers too. And 
> if anyone has a patch to make sched_yield() better than it is today, i'm 
> of course interested in it.

Do you still have intentions to add a directed yield API?  I remember
seeing it in the earlier CFS patches.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share

2007-10-02 Thread Eric St-Laurent

On Mon, 2007-10-01 at 16:44 +0200, Ingo Molnar wrote:
> > Adds tunables in sysfs to modify a user's cpu share.
> > 
> > A directory is created in sysfs for each new user in the system.
> > 
> > /sys/kernel/uids//cpu_share
> > 
> > Reading this file returns the cpu shares granted for the user.
> > Writing into this file modifies the cpu share for the user. Only an 
> > administrator is allowed to modify a user's cpu share.
> > 
> > Ex: 
> > # cd /sys/kernel/uids/
> > # cat 512/cpu_share
> > 1024
> > # echo 2048 > 512/cpu_share
> > # cat 512/cpu_share
> > 2048
> > #
> 
> looks good to me! I think this API is pretty straightforward. I've put 
> this into my tree and have updated the sched-devel git tree:
> 

While a sysfs interface is OK and somewhat orthogonal to the interface
proposed the containers patches, I think maybe a new syscall should be
considered.

Since we now have a fair share cpu scheduler, maybe an interface to
specify the cpu share directly (alternatively to priority) make sense.

For processes, it may become more intuitive (and precise) to set the
processing share directly than setting a priority which is converted to
a share.

Maybe something similar to ioprio_set() and ioprio_get() syscalls:

- per user cpu share
- per user group cpu share
- per process cpu share
- per process group cpu share


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH] Add sysfs control to modify a user's cpu share

2007-10-02 Thread Eric St-Laurent

On Mon, 2007-10-01 at 16:44 +0200, Ingo Molnar wrote:
  Adds tunables in sysfs to modify a user's cpu share.
  
  A directory is created in sysfs for each new user in the system.
  
  /sys/kernel/uids/uid/cpu_share
  
  Reading this file returns the cpu shares granted for the user.
  Writing into this file modifies the cpu share for the user. Only an 
  administrator is allowed to modify a user's cpu share.
  
  Ex: 
  # cd /sys/kernel/uids/
  # cat 512/cpu_share
  1024
  # echo 2048  512/cpu_share
  # cat 512/cpu_share
  2048
  #
 
 looks good to me! I think this API is pretty straightforward. I've put 
 this into my tree and have updated the sched-devel git tree:
 

While a sysfs interface is OK and somewhat orthogonal to the interface
proposed the containers patches, I think maybe a new syscall should be
considered.

Since we now have a fair share cpu scheduler, maybe an interface to
specify the cpu share directly (alternatively to priority) make sense.

For processes, it may become more intuitive (and precise) to set the
processing share directly than setting a priority which is converted to
a share.

Maybe something similar to ioprio_set() and ioprio_get() syscalls:

- per user cpu share
- per user group cpu share
- per process cpu share
- per process group cpu share


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: yield API

2007-10-02 Thread Eric St-Laurent

On Tue, 2007-10-02 at 08:46 +0200, Ingo Molnar wrote:

[...]

 APIs that are not in any real, meaningful use, despite a decade of 
 presence are not really interesting to me personally. (especially in 
 this case where we know exactly _why_ the API is used so rarely.) Sure 
 we'll continue to support it in the best possible way, with the usual 
 kernel maintainance policy: without hurting other, more commonly used 
 APIs. That was the principle we followed in previous schedulers too. And 
 if anyone has a patch to make sched_yield() better than it is today, i'm 
 of course interested in it.

Do you still have intentions to add a directed yield API?  I remember
seeing it in the earlier CFS patches.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 2.6.23-rc9 and a heads-up for the 2.6.24 series..

2007-10-02 Thread Eric St-Laurent

On Tue, 2007-10-02 at 11:17 +0200, Thomas Gleixner wrote:

[...]

 I have uploaded an update of the arch/x86 tree based on -rc9 to
 
   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86.git x86
 

[...]

 If there is anything we can help with the transition, please do not
 hesitate to ask.
 
 Thanks,
 
   Thomas, Ingo

Hi Thomas,

This latest x86 branch build and boot without problem with my usual
x86_64 config.

If you remember our conversation one month ago, I was unable to build
your tree.

I've upgraded my Ubuntu distribution from 7.04 to 7.10 beta this week,
maybe this fixed it.

But I still had to do some manual fixes to get the packaging steps
working:

mkdir arch/x86_64/boot/
ln -s ../../../arch/x86/boot/bzImage arch/x86_64/boot/bzImage


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1-mm2 (vm-dont-run-touch_buffer-during-buffercache-lookups.patch)

2007-08-01 Thread Eric St-Laurent
On Wed, 2007-01-08 at 00:46 -0700, Andrew Morton wrote:

> Or you could do something more real-worldly like start up OO, firefox and
> friends, then run /etc/cron.daily/everything and see what the
> before-and-after effects are.  The aggregate info we're looking for is
> captured in /proc/meminfo: swapped, Mapped, Cached, Buffers.

IMO it will be harder to come with reproducible numbers, everyone
desktop is different, as their filesystem contents.

Anyway I will cook up something and post it.  It might be useful for
others to understand the updatedb problem.

I intend to try only this specific patch not the full -mm, is there any
other patch I need to apply too?


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1-mm2 (vm-dont-run-touch_buffer-during-buffercache-lookups.patch)

2007-08-01 Thread Eric St-Laurent
On Tue, 2007-31-07 at 23:09 -0700, Andrew Morton wrote:

> +vm-dont-run-touch_buffer-during-buffercache-lookups.patch
> 
>  A little VM experiment.  See changelog for details.

> We don't have any tests to determine the effects of this, and nobody will
> bother setting one up, so ho hum, this remains in -mm for ever.

> I don't think there's any point in doing this until we have some decent
> testcases.


Hi Andrew,


For which problem this patch was coded?  Is it a potential fix to the
updatedb problem?

Is the patch effective without the filesystem dependant change you talk
about?  (I use reiserfs)

I've been thinking about a test case for the updatedb problem:

1. Script or program that create a large number of directories and zero
sized files.  Same setup for everyone to have reproducible results.

2. Run updatedb on those.

3. Observe the effects (with vmstat, slabinfo and meminfo) before,
during and after the updatedb run.

4. Do something to trigger some reclaim like copying a large file.

5. See the effects.


What do you think? What would be the ideal test case for the problem in
your opinion?


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1-mm2 (vm-dont-run-touch_buffer-during-buffercache-lookups.patch)

2007-08-01 Thread Eric St-Laurent
On Tue, 2007-31-07 at 23:09 -0700, Andrew Morton wrote:

 +vm-dont-run-touch_buffer-during-buffercache-lookups.patch
 
  A little VM experiment.  See changelog for details.

 We don't have any tests to determine the effects of this, and nobody will
 bother setting one up, so ho hum, this remains in -mm for ever.

 I don't think there's any point in doing this until we have some decent
 testcases.


Hi Andrew,


For which problem this patch was coded?  Is it a potential fix to the
updatedb problem?

Is the patch effective without the filesystem dependant change you talk
about?  (I use reiserfs)

I've been thinking about a test case for the updatedb problem:

1. Script or program that create a large number of directories and zero
sized files.  Same setup for everyone to have reproducible results.

2. Run updatedb on those.

3. Observe the effects (with vmstat, slabinfo and meminfo) before,
during and after the updatedb run.

4. Do something to trigger some reclaim like copying a large file.

5. See the effects.


What do you think? What would be the ideal test case for the problem in
your opinion?


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc1-mm2 (vm-dont-run-touch_buffer-during-buffercache-lookups.patch)

2007-08-01 Thread Eric St-Laurent
On Wed, 2007-01-08 at 00:46 -0700, Andrew Morton wrote:

 Or you could do something more real-worldly like start up OO, firefox and
 friends, then run /etc/cron.daily/everything and see what the
 before-and-after effects are.  The aggregate info we're looking for is
 captured in /proc/meminfo: swapped, Mapped, Cached, Buffers.

IMO it will be harder to come with reproducible numbers, everyone
desktop is different, as their filesystem contents.

Anyway I will cook up something and post it.  It might be useful for
others to understand the updatedb problem.

I intend to try only this specific patch not the full -mm, is there any
other patch I need to apply too?


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] fadvise POSIX_FADV_NOREUSE does nothing

2007-07-29 Thread Eric St-Laurent
Related to my other bug report today, calling posix_fadvise (which uses
fadvise64) with the POSIX_FADV_NOREUSE flag does nothing.  The pages are
not dropped behind.

I also tried call fadvise with POSIX_FADV_SEQUENTIAL first.

This is expected as the POSIX_FADV_NOREUSE is a no-op in the recent
kernels.

Also, POSIX_FADV_SEQUENTIAL only does the readahead window. It doesn't
hint the VM in any way to possibly drop-behind the pages.

(See the previous bug report for more details of the test case)

Relevant numbers:

Copying (using fadvise_cp) a large file test:

1st run: 0m9.018s
2nd run: 0m3.444s
Copying large file...
3rd run: 0m14.024s<<< page cache trashed
4th run: 0m3.449s

Test programs and batch files are attached.


- Eric

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	int in;
	int out;
	int pagesize; 
	void *buf;
	off_t pos;

	if (argc != 3) {
		printf("Usage: %s  \n", argv[0]); 
		return EXIT_FAILURE;
	}

	in = open(argv[1], O_RDONLY, 0);
	out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0666);

	posix_fadvise(in, 0, 0, POSIX_FADV_SEQUENTIAL);
	posix_fadvise(out, 0, 0, POSIX_FADV_SEQUENTIAL);

	pagesize = getpagesize();
	buf = malloc(pagesize);

	pos = 0;

	for (;;) {
		ssize_t count;

		count = read(in, buf, pagesize);
		if (!count || count == -1)
			break;

		write(out, buf, count);

		/* right usage pattern? */
		posix_fadvise(in, pos, count, POSIX_FADV_NOREUSE);
		posix_fadvise(out, pos, count, POSIX_FADV_NOREUSE);

		pos += count;
	}

	free(buf);
	close(in);
	close(out);

	return EXIT_SUCCESS;
}
all:
	gcc fadvise_cp.c -o fadvise_cp
	gcc working_set_simul.c -o working_set_simul



use-once-test.sh
Description: application/shellscript
#include 
#include 
#include 
#include 
#include 
#include  

int main(int argc, char *argv[])
{
	int fd;
	off_t size;
	char *mapping;
	unsigned r;
	unsigned i;

	if (argc != 2) {
		printf("Usage: %s \n", argv[0]); 
		return EXIT_FAILURE;
	}

	fd = open(argv[1], O_RDONLY, 0);
	size = lseek(fd, 0, SEEK_END); 

	mapping = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

	/* access (read) the file a couple of times*/
	for (r = 0; r < 4; r++) {
		for (i = 0; i < size; i++) {
			char t = mapping[i];
		}
	}

	munmap(mapping, size);
	close(fd);

	return EXIT_SUCCESS;
}


[BUG] Linux VM use-once mechanisms don't work (test case with numbers included)

2007-07-29 Thread Eric St-Laurent
Linux VM use-once mechanisms don't seem to work.  Simple scenario like
streaming a file much greater than physical RAM size should be
identified to avoid trashing the page cache with useless data.

I know the VM cannot predict the future or assume anything about the
user's intent.  But this workload is simple and common, it should be
detected and better handled.

Test case:

Linux 2.6.20-16-lowlatency SMP PREEMPT x86_64 (also tried on 2.6.23-rc1)

- A file of 1/3 the RAM size is created, mapped and frequently accessed
(4 times).
- The test is run multiple times (4 total) to time it's execution.
- After the first run, other runs take much less time, because the file
is cached.
- A previously created file, 4 times the size of the RAM, is read or
copied.
- The test is re-run (2 times) to time it's execution.

To test:

$ make
# ./use-once-test.sh

Some big files will be created in your /tmp. They don't get erased after
the test to speedup multiple runs.

Results:

- The test execution time greatly increase after reading or copying the
large file.
- Frequently used data got kick out of the page cache and replaced with
useless read once data.
- Both the read only and copy (read + write) cases don't work.

I believe this clearly illustrate the slowdowns I experience after I
copy large files around my system.  All applications on my desktop are
jerky for some moments after that.  Watching a DVD is another example.

Base test:

1st run: 0m8.958s
2nd run: 0m3.442s
3rd run: 0m3.452s
4th run: 0m3.443s

Reading a large file test:

1st run: 0m8.997s
2nd run: 0m3.522s
`/tmp/large_file' -> `/dev/null'
3rd run: 0m8.999s<<< page cache trashed
4th run: 0m3.440s

Copying (using cp) a large file test:

1st run: 0m8.979s
2nd run: 0m3.442s
`/tmp/large_file' -> `/tmp/large_file.copy'
3rd run: 0m13.814s<<< page cache trashed
4th run: 0m3.455s

Copying (using fadvise_cp) a large file test:

1st run: 0m9.018s
2nd run: 0m3.444s
Copying large file...
3rd run: 0m14.024s<<< page cache trashed
4th run: 0m3.449s

Copying (using splice-cp) a large file test:

1st run: 0m8.977s
2nd run: 0m3.442s
Copying large file...
3rd run: 0m14.118s<<< page cache trashed
4th run: 0m3.456s

Possible solutions:

Various patches to fix the use-once mechanisms were discussed in the
past.  Some more that 6 years ago and some more recently.

http://lwn.net/2001/0726/a/2q.php3
http://lkml.org/lkml/2005/5/3/6
http://lkml.org/lkml/2006/7/17/192
http://lkml.org/lkml/2007/7/9/340
http://lkml.org/lkml/2007/7/21/219 (*1)

(*1) I have tested Peter's patch with some success.  It fix the read
case, but no the copy case.  Results: http://lkml.org/lkml/2007/7/24/527

Test programs and batch files are attached.


- Eric

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	int in;
	int out;
	int pagesize; 
	void *buf;
	off_t pos;

	if (argc != 3) {
		printf("Usage: %s  \n", argv[0]); 
		return EXIT_FAILURE;
	}

	in = open(argv[1], O_RDONLY, 0);
	out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0666);

	posix_fadvise(in, 0, 0, POSIX_FADV_SEQUENTIAL);
	posix_fadvise(out, 0, 0, POSIX_FADV_SEQUENTIAL);

	pagesize = getpagesize();
	buf = malloc(pagesize);

	pos = 0;

	for (;;) {
		ssize_t count;

		count = read(in, buf, pagesize);
		if (!count || count == -1)
			break;

		write(out, buf, count);

		/* right usage pattern? */
		posix_fadvise(in, pos, count, POSIX_FADV_NOREUSE);
		posix_fadvise(out, pos, count, POSIX_FADV_NOREUSE);

		pos += count;
	}

	free(buf);
	close(in);
	close(out);

	return EXIT_SUCCESS;
}
all:
	gcc fadvise_cp.c -o fadvise_cp
	gcc working_set_simul.c -o working_set_simul



use-once-test.sh
Description: application/shellscript
#include 
#include 
#include 
#include 
#include 
#include  

int main(int argc, char *argv[])
{
	int fd;
	off_t size;
	char *mapping;
	unsigned r;
	unsigned i;

	if (argc != 2) {
		printf("Usage: %s \n", argv[0]); 
		return EXIT_FAILURE;
	}

	fd = open(argv[1], O_RDONLY, 0);
	size = lseek(fd, 0, SEEK_END); 

	mapping = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

	/* access (read) the file a couple of times*/
	for (r = 0; r < 4; r++) {
		for (i = 0; i < size; i++) {
			char t = mapping[i];
		}
	}

	munmap(mapping, size);
	close(fd);

	return EXIT_SUCCESS;
}


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-29 Thread Eric St-Laurent
On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:
> Eric St-Laurent wrote:
> > I test this on my main system, so patches with basic testing and
> > reasonable stability are preferred. I just want to avoid data corruption
> > bugs. FYI, I used to run the -rt tree most of the time.
> 
> OK here is one which just changes the rate that the active and inactive
> lists get scanned. Data corruption bugs should be minimal ;)
> 

Nick,

I have tried your patch with my test case, unfortunately it doesn't
help.

Numbers did vary a little bit more, and it seemed drop_caches was not
working as well as usual (used between the runs).

Also, overall the runs took about .1s more to complete.


Linux 2.6.23-rc1-nick PREEMPT x86_64

Base test:

1st run: 0m9.123s
2nd run: 0m3.565s
3rd run: 0m3.553s
4th run: 0m3.565s

Reading a large file test:

1st run: 0m9.146s
2nd run: 0m3.560s
`/tmp/large_file' -> `/dev/null'
3rd run: 0m19.759s
4th run: 0m3.515s

Copying (using cp) a large file test:

1st run: 0m9.085s
2nd run: 0m3.522s
`/tmp/large_file' -> `/tmp/large_file.copy'
3rd run: 0m9.977s
4th run: 0m3.518s


Anyway, what is the theory behind the patch?


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-29 Thread Eric St-Laurent
On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:
 Eric St-Laurent wrote:
  I test this on my main system, so patches with basic testing and
  reasonable stability are preferred. I just want to avoid data corruption
  bugs. FYI, I used to run the -rt tree most of the time.
 
 OK here is one which just changes the rate that the active and inactive
 lists get scanned. Data corruption bugs should be minimal ;)
 

Nick,

I have tried your patch with my test case, unfortunately it doesn't
help.

Numbers did vary a little bit more, and it seemed drop_caches was not
working as well as usual (used between the runs).

Also, overall the runs took about .1s more to complete.


Linux 2.6.23-rc1-nick PREEMPT x86_64

Base test:

1st run: 0m9.123s
2nd run: 0m3.565s
3rd run: 0m3.553s
4th run: 0m3.565s

Reading a large file test:

1st run: 0m9.146s
2nd run: 0m3.560s
`/tmp/large_file' - `/dev/null'
3rd run: 0m19.759s
4th run: 0m3.515s

Copying (using cp) a large file test:

1st run: 0m9.085s
2nd run: 0m3.522s
`/tmp/large_file' - `/tmp/large_file.copy'
3rd run: 0m9.977s
4th run: 0m3.518s


Anyway, what is the theory behind the patch?


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] Linux VM use-once mechanisms don't work (test case with numbers included)

2007-07-29 Thread Eric St-Laurent
Linux VM use-once mechanisms don't seem to work.  Simple scenario like
streaming a file much greater than physical RAM size should be
identified to avoid trashing the page cache with useless data.

I know the VM cannot predict the future or assume anything about the
user's intent.  But this workload is simple and common, it should be
detected and better handled.

Test case:

Linux 2.6.20-16-lowlatency SMP PREEMPT x86_64 (also tried on 2.6.23-rc1)

- A file of 1/3 the RAM size is created, mapped and frequently accessed
(4 times).
- The test is run multiple times (4 total) to time it's execution.
- After the first run, other runs take much less time, because the file
is cached.
- A previously created file, 4 times the size of the RAM, is read or
copied.
- The test is re-run (2 times) to time it's execution.

To test:

$ make
# ./use-once-test.sh

Some big files will be created in your /tmp. They don't get erased after
the test to speedup multiple runs.

Results:

- The test execution time greatly increase after reading or copying the
large file.
- Frequently used data got kick out of the page cache and replaced with
useless read once data.
- Both the read only and copy (read + write) cases don't work.

I believe this clearly illustrate the slowdowns I experience after I
copy large files around my system.  All applications on my desktop are
jerky for some moments after that.  Watching a DVD is another example.

Base test:

1st run: 0m8.958s
2nd run: 0m3.442s
3rd run: 0m3.452s
4th run: 0m3.443s

Reading a large file test:

1st run: 0m8.997s
2nd run: 0m3.522s
`/tmp/large_file' - `/dev/null'
3rd run: 0m8.999s page cache trashed
4th run: 0m3.440s

Copying (using cp) a large file test:

1st run: 0m8.979s
2nd run: 0m3.442s
`/tmp/large_file' - `/tmp/large_file.copy'
3rd run: 0m13.814s page cache trashed
4th run: 0m3.455s

Copying (using fadvise_cp) a large file test:

1st run: 0m9.018s
2nd run: 0m3.444s
Copying large file...
3rd run: 0m14.024s page cache trashed
4th run: 0m3.449s

Copying (using splice-cp) a large file test:

1st run: 0m8.977s
2nd run: 0m3.442s
Copying large file...
3rd run: 0m14.118s page cache trashed
4th run: 0m3.456s

Possible solutions:

Various patches to fix the use-once mechanisms were discussed in the
past.  Some more that 6 years ago and some more recently.

http://lwn.net/2001/0726/a/2q.php3
http://lkml.org/lkml/2005/5/3/6
http://lkml.org/lkml/2006/7/17/192
http://lkml.org/lkml/2007/7/9/340
http://lkml.org/lkml/2007/7/21/219 (*1)

(*1) I have tested Peter's patch with some success.  It fix the read
case, but no the copy case.  Results: http://lkml.org/lkml/2007/7/24/527

Test programs and batch files are attached.


- Eric

#include fcntl.h
#include stdio.h
#include stdlib.h
#include unistd.h

int main(int argc, char *argv[])
{
	int in;
	int out;
	int pagesize; 
	void *buf;
	off_t pos;

	if (argc != 3) {
		printf(Usage: %s src dest\n, argv[0]); 
		return EXIT_FAILURE;
	}

	in = open(argv[1], O_RDONLY, 0);
	out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0666);

	posix_fadvise(in, 0, 0, POSIX_FADV_SEQUENTIAL);
	posix_fadvise(out, 0, 0, POSIX_FADV_SEQUENTIAL);

	pagesize = getpagesize();
	buf = malloc(pagesize);

	pos = 0;

	for (;;) {
		ssize_t count;

		count = read(in, buf, pagesize);
		if (!count || count == -1)
			break;

		write(out, buf, count);

		/* right usage pattern? */
		posix_fadvise(in, pos, count, POSIX_FADV_NOREUSE);
		posix_fadvise(out, pos, count, POSIX_FADV_NOREUSE);

		pos += count;
	}

	free(buf);
	close(in);
	close(out);

	return EXIT_SUCCESS;
}
all:
	gcc fadvise_cp.c -o fadvise_cp
	gcc working_set_simul.c -o working_set_simul



use-once-test.sh
Description: application/shellscript
#include fcntl.h
#include memory.h
#include stdio.h
#include stdlib.h
#include sys/mman.h
#include unistd.h 

int main(int argc, char *argv[])
{
	int fd;
	off_t size;
	char *mapping;
	unsigned r;
	unsigned i;

	if (argc != 2) {
		printf(Usage: %s file\n, argv[0]); 
		return EXIT_FAILURE;
	}

	fd = open(argv[1], O_RDONLY, 0);
	size = lseek(fd, 0, SEEK_END); 

	mapping = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

	/* access (read) the file a couple of times*/
	for (r = 0; r  4; r++) {
		for (i = 0; i  size; i++) {
			char t = mapping[i];
		}
	}

	munmap(mapping, size);
	close(fd);

	return EXIT_SUCCESS;
}


[BUG] fadvise POSIX_FADV_NOREUSE does nothing

2007-07-29 Thread Eric St-Laurent
Related to my other bug report today, calling posix_fadvise (which uses
fadvise64) with the POSIX_FADV_NOREUSE flag does nothing.  The pages are
not dropped behind.

I also tried call fadvise with POSIX_FADV_SEQUENTIAL first.

This is expected as the POSIX_FADV_NOREUSE is a no-op in the recent
kernels.

Also, POSIX_FADV_SEQUENTIAL only does the readahead window. It doesn't
hint the VM in any way to possibly drop-behind the pages.

(See the previous bug report for more details of the test case)

Relevant numbers:

Copying (using fadvise_cp) a large file test:

1st run: 0m9.018s
2nd run: 0m3.444s
Copying large file...
3rd run: 0m14.024s page cache trashed
4th run: 0m3.449s

Test programs and batch files are attached.


- Eric

#include fcntl.h
#include stdio.h
#include stdlib.h
#include unistd.h

int main(int argc, char *argv[])
{
	int in;
	int out;
	int pagesize; 
	void *buf;
	off_t pos;

	if (argc != 3) {
		printf(Usage: %s src dest\n, argv[0]); 
		return EXIT_FAILURE;
	}

	in = open(argv[1], O_RDONLY, 0);
	out = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, 0666);

	posix_fadvise(in, 0, 0, POSIX_FADV_SEQUENTIAL);
	posix_fadvise(out, 0, 0, POSIX_FADV_SEQUENTIAL);

	pagesize = getpagesize();
	buf = malloc(pagesize);

	pos = 0;

	for (;;) {
		ssize_t count;

		count = read(in, buf, pagesize);
		if (!count || count == -1)
			break;

		write(out, buf, count);

		/* right usage pattern? */
		posix_fadvise(in, pos, count, POSIX_FADV_NOREUSE);
		posix_fadvise(out, pos, count, POSIX_FADV_NOREUSE);

		pos += count;
	}

	free(buf);
	close(in);
	close(out);

	return EXIT_SUCCESS;
}
all:
	gcc fadvise_cp.c -o fadvise_cp
	gcc working_set_simul.c -o working_set_simul



use-once-test.sh
Description: application/shellscript
#include fcntl.h
#include memory.h
#include stdio.h
#include stdlib.h
#include sys/mman.h
#include unistd.h 

int main(int argc, char *argv[])
{
	int fd;
	off_t size;
	char *mapping;
	unsigned r;
	unsigned i;

	if (argc != 2) {
		printf(Usage: %s file\n, argv[0]); 
		return EXIT_FAILURE;
	}

	fd = open(argv[1], O_RDONLY, 0);
	size = lseek(fd, 0, SEEK_END); 

	mapping = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);

	/* access (read) the file a couple of times*/
	for (r = 0; r  4; r++) {
		for (i = 0; i  size; i++) {
			char t = mapping[i];
		}
	}

	munmap(mapping, size);
	close(fd);

	return EXIT_SUCCESS;
}


Re: [patch] sched: make cpu_clock() not use the rq clock

2007-07-26 Thread Eric St-Laurent
On Thu, 2007-26-07 at 11:00 +0200, Ingo Molnar wrote:
> Subject: sched: make cpu_clock() not use the rq clock
> From: Ingo Molnar <[EMAIL PROTECTED]>
> 
> it is enough to disable interrupts to get the precise rq-clock
> of the local CPU.

Hi Ingo,

Those new fast nanoseconds resolution clock APIs are nice but it seems
to me that their naming and _where_ they are implemented in the tree is
a little odd, IMO.

We have:

1. sched_clock() is in kernel/sched.c (weak implementation)
2. sched_clock() is in arch/i386/kernel/tsc.c (architecture override)
3. rq_clock() is in kernel/sched.c
4. cpu_clock() is in kernel/sched.c

I would suggest:

1. rename sched_clock() (remove sched_ as it's not sched specific
anymore) and place it in kernel/time/...
2. rename the architecture specific version of it too 

This first function is the basic fast ns clock

3. base your rq_clock() on cpu_clock() (#4) or use the later directly.

This is local to sched.c

4. move cpu_clock() in kernel/time/...

This the per-cpu monotonic version.


See my point? Base the scheduler clock from a general kernel API, not
the other way around.

Just a suggestion.


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] sched: make cpu_clock() not use the rq clock

2007-07-26 Thread Eric St-Laurent
On Thu, 2007-26-07 at 11:00 +0200, Ingo Molnar wrote:
 Subject: sched: make cpu_clock() not use the rq clock
 From: Ingo Molnar [EMAIL PROTECTED]
 
 it is enough to disable interrupts to get the precise rq-clock
 of the local CPU.

Hi Ingo,

Those new fast nanoseconds resolution clock APIs are nice but it seems
to me that their naming and _where_ they are implemented in the tree is
a little odd, IMO.

We have:

1. sched_clock() is in kernel/sched.c (weak implementation)
2. sched_clock() is in arch/i386/kernel/tsc.c (architecture override)
3. rq_clock() is in kernel/sched.c
4. cpu_clock() is in kernel/sched.c

I would suggest:

1. rename sched_clock() (remove sched_ as it's not sched specific
anymore) and place it in kernel/time/...
2. rename the architecture specific version of it too 

This first function is the basic fast ns clock

3. base your rq_clock() on cpu_clock() (#4) or use the later directly.

This is local to sched.c

4. move cpu_clock() in kernel/time/...

This the per-cpu monotonic version.


See my point? Base the scheduler clock from a general kernel API, not
the other way around.

Just a suggestion.


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:

> 
> A new list could be a possibility. One problem with adding lists is just
> trying to work out how to balance scanning rates between them, another
> problem is CPU overhead of moving pages from one to another... 

Disk sizes seem to increase more rapidly that the ability to read them
quickly.  Fortunately the processing power increase greatly too.

It may be a good idea to spend more CPU cycles to better decide how the
VM should juggle with this data. We've got to keep those multi-cores cpu
busy.


> but don't
> let me stop you if you want to jump in and try something :)
> 

Well I might try a few things along the way.

But I prefer the thorough approach versus tinkering... 

- Read all research, check competition
- Build test virtual machines, with benchmarks and typical workloads
- Add (or use) some instrumentation to the pagecache
- Code a simulator
- Try all algorithms, tune them

This is way overkill for a part-time hobby.

If we don't see much work on this area it's surely because it's really
not a problem anymore for most workloads. Database have their own cache
management and disk scheduling, file servers just add more ram or
processors, etc.


> OK here is one which just changes the rate that the active and inactive
> lists get scanned. Data corruption bugs should be minimal ;)
> 

Will test.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: -mm merge plans for 2.6.23

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 08:47 +0200, Mike Galbraith wrote:

> Heh.  Here we have a VM developer expressing his interest in the problem
> space, and you offer him a steaming jug of STFU because he doesn't say
> what you want to hear.  I wonder how many killfiles you just entered.
> 

Agreed.

(a bit OT)

People should understand that it's not (I think) about a desktop
workload vs enterprise workloads war.

I see it mostly as a progression versus regressions trade-off. And
adding potentially useless or unmaintained code is a regression from the
maintainers POV.

The best way to justify a patch and have it integrated is to have a
scientific testing method with repeatable numbers.

Con has done so for his patch, his benchmark demonstrated good
improvements.

But I feel some of his supporters have indirectly harmed his cause by
their comments.  Also, the fact that Con recently stopped maintaining
his work out of frustration also don't help having his patch merged. 

Again I'm not personally pushing this patch, I don't need it.

Con has worked for many years on two area that still cause problems for
desktop users: scheduler interactivity and pagecache trashing.  Now that
the scheduler has been fixed, let's have the VM fixed too.

Sorry for the slightly OT post, and please don't start a flame war...


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 15:37 +1000, Nick Piggin wrote:

> OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix
> the updatedb problem very well, because if updatedb has caused swapout
> then it has filled memory, and swap prefetch doesn't run unless there
> is free memory (not to mention that updatedb would have paged out other
> files as well).
> 
> And drop behind doesn't fix your usual problem where you are downloading
> from a server, because that is use-once write(2) data which is the
> problem. And this readahead-based drop behind also doesn't help if data
> you were reading happened to be a sequence of small files, or otherwise
> not in good readahead order.
> 
> Not to say that neither fix some problems, but for such conceptually
> big changes, it should take a little more effort than a constructed test
> case and no consideration of the alternatives to get it merged.


Sorry for the confusion.

For swap prefetch I should have said "some people claim that it fix
their problem". I didn't want to hurt anybody feelings, some people are
tired to hear others speak hypothetically about this patch, as it
work-for-them (TM).

I don't experience the problem. Can't help.

For drop behind it fix half the problem. The read case is handled
perfectly by Peter's patch. And the copy (read+write) is unchanged. My
test case demonstrate it very easily, just look at the numbers.

So, I agree with you that drop behind doesn't fix the write() case.
Peter has said so himself when I offered to test his patch.

As I do experience this problem, I have written a small test program and
batch file to help push the patch for acceptance.  I'm very willing to
help improve the test cases, test patches and write code, time
permitting.

About this very subject, earlier this year this Andrew suggested me to
came up with a test case to demonstrate my problem, well finally I've
done so.

http://lkml.org/lkml/2007/3/3/164
http://lkml.org/lkml/2007/3/3/166

Lastly, I would go as far to say that the use-once read then copy fix
must also work with copies over NFS. I don't know if NFS change the
workload on the client station versus the local case, and I don't know
if it's still possible to consider data copied this way as use-once.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote:

> What *I* think is supposed to happen is that newly read in pages get
> put on the inactive list, and unless they get accessed againbefore
> being reclaimed, they are allowed to fall off the end of the list
> without disturbing active data too much.
> 
> I think there is a missing piece here, that we used to ease the reclaim
> pressure off the active list when the inactive list grows relatively
> much larger than it (which could indicate a lot of use-once pages in
> the system).

Maybe a new list should be added to put newly read pages in it. If they
are not used or used once after a certain period, they can be moved to
the inactive list (or whatever).

Newly read pages...

- ... not used after this period are excessive readahead, we discard
immediately.
- ... used only once after this period, we discard soon.
- ... used many/frequently are moved to active list.

Surely the scan rate (do I make sense?) should be different for this
newly-read list and the inactive list. 

I also remember your split mapped/unmapped active list patches from a
while ago.

Can someone point me to a up-to-date documentation about the Linux VM?
The books and documents I've seen are outdated.

> I think I've been banned from touching vmscan.c, but if you're keen to
> try a patch, I might be convinced to come out of retirement :)

I'm more than willing!  Now that CFS is merged, redirect your energies
from nicksched to nick-vm ;)

Patches against any tree (stable, linus, mm, rt) are good. But I prefer
the last stable release because it narrows down the possible problems
that a moving target like the development tree may have.

I test this on my main system, so patches with basic testing and
reasonable stability are preferred. I just want to avoid data corruption
bugs. FYI, I used to run the -rt tree most of the time.

> One man's trash is another's treasure: some people will want the
> files to remain in cache because they'll use them again (copy it
> somewhere else, or start editing it after being copied or whatever).
> 
> But yeah, we can probably do better at the sequential read/write
> case.

Sure, but there are many hints to detect this: *large* (> most of the
RAM), *streaming*, *used once*

But if a program mmap() a 3/4 of the RAM area and "play" in it, it's a
good sign that the streaming code shouldn't be active.



- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote:

 What *I* think is supposed to happen is that newly read in pages get
 put on the inactive list, and unless they get accessed againbefore
 being reclaimed, they are allowed to fall off the end of the list
 without disturbing active data too much.
 
 I think there is a missing piece here, that we used to ease the reclaim
 pressure off the active list when the inactive list grows relatively
 much larger than it (which could indicate a lot of use-once pages in
 the system).

Maybe a new list should be added to put newly read pages in it. If they
are not used or used once after a certain period, they can be moved to
the inactive list (or whatever).

Newly read pages...

- ... not used after this period are excessive readahead, we discard
immediately.
- ... used only once after this period, we discard soon.
- ... used many/frequently are moved to active list.

Surely the scan rate (do I make sense?) should be different for this
newly-read list and the inactive list. 

I also remember your split mapped/unmapped active list patches from a
while ago.

Can someone point me to a up-to-date documentation about the Linux VM?
The books and documents I've seen are outdated.

 I think I've been banned from touching vmscan.c, but if you're keen to
 try a patch, I might be convinced to come out of retirement :)

I'm more than willing!  Now that CFS is merged, redirect your energies
from nicksched to nick-vm ;)

Patches against any tree (stable, linus, mm, rt) are good. But I prefer
the last stable release because it narrows down the possible problems
that a moving target like the development tree may have.

I test this on my main system, so patches with basic testing and
reasonable stability are preferred. I just want to avoid data corruption
bugs. FYI, I used to run the -rt tree most of the time.

 One man's trash is another's treasure: some people will want the
 files to remain in cache because they'll use them again (copy it
 somewhere else, or start editing it after being copied or whatever).
 
 But yeah, we can probably do better at the sequential read/write
 case.

Sure, but there are many hints to detect this: *large* ( most of the
RAM), *streaming*, *used once*

But if a program mmap() a 3/4 of the RAM area and play in it, it's a
good sign that the streaming code shouldn't be active.



- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 15:37 +1000, Nick Piggin wrote:

 OK, this is where I start to worry. Swap prefetch AFAIKS doesn't fix
 the updatedb problem very well, because if updatedb has caused swapout
 then it has filled memory, and swap prefetch doesn't run unless there
 is free memory (not to mention that updatedb would have paged out other
 files as well).
 
 And drop behind doesn't fix your usual problem where you are downloading
 from a server, because that is use-once write(2) data which is the
 problem. And this readahead-based drop behind also doesn't help if data
 you were reading happened to be a sequence of small files, or otherwise
 not in good readahead order.
 
 Not to say that neither fix some problems, but for such conceptually
 big changes, it should take a little more effort than a constructed test
 case and no consideration of the alternatives to get it merged.


Sorry for the confusion.

For swap prefetch I should have said some people claim that it fix
their problem. I didn't want to hurt anybody feelings, some people are
tired to hear others speak hypothetically about this patch, as it
work-for-them (TM).

I don't experience the problem. Can't help.

For drop behind it fix half the problem. The read case is handled
perfectly by Peter's patch. And the copy (read+write) is unchanged. My
test case demonstrate it very easily, just look at the numbers.

So, I agree with you that drop behind doesn't fix the write() case.
Peter has said so himself when I offered to test his patch.

As I do experience this problem, I have written a small test program and
batch file to help push the patch for acceptance.  I'm very willing to
help improve the test cases, test patches and write code, time
permitting.

About this very subject, earlier this year this Andrew suggested me to
came up with a test case to demonstrate my problem, well finally I've
done so.

http://lkml.org/lkml/2007/3/3/164
http://lkml.org/lkml/2007/3/3/166

Lastly, I would go as far to say that the use-once read then copy fix
must also work with copies over NFS. I don't know if NFS change the
workload on the client station versus the local case, and I don't know
if it's still possible to consider data copied this way as use-once.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ck] Re: -mm merge plans for 2.6.23

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 08:47 +0200, Mike Galbraith wrote:

 Heh.  Here we have a VM developer expressing his interest in the problem
 space, and you offer him a steaming jug of STFU because he doesn't say
 what you want to hear.  I wonder how many killfiles you just entered.
 

Agreed.

(a bit OT)

People should understand that it's not (I think) about a desktop
workload vs enterprise workloads war.

I see it mostly as a progression versus regressions trade-off. And
adding potentially useless or unmaintained code is a regression from the
maintainers POV.

The best way to justify a patch and have it integrated is to have a
scientific testing method with repeatable numbers.

Con has done so for his patch, his benchmark demonstrated good
improvements.

But I feel some of his supporters have indirectly harmed his cause by
their comments.  Also, the fact that Con recently stopped maintaining
his work out of frustration also don't help having his patch merged. 

Again I'm not personally pushing this patch, I don't need it.

Con has worked for many years on two area that still cause problems for
desktop users: scheduler interactivity and pagecache trashing.  Now that
the scheduler has been fixed, let's have the VM fixed too.

Sorry for the slightly OT post, and please don't start a flame war...


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-25 Thread Eric St-Laurent
On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:

 
 A new list could be a possibility. One problem with adding lists is just
 trying to work out how to balance scanning rates between them, another
 problem is CPU overhead of moving pages from one to another... 

Disk sizes seem to increase more rapidly that the ability to read them
quickly.  Fortunately the processing power increase greatly too.

It may be a good idea to spend more CPU cycles to better decide how the
VM should juggle with this data. We've got to keep those multi-cores cpu
busy.


 but don't
 let me stop you if you want to jump in and try something :)
 

Well I might try a few things along the way.

But I prefer the thorough approach versus tinkering... 

- Read all research, check competition
- Build test virtual machines, with benchmarks and typical workloads
- Add (or use) some instrumentation to the pagecache
- Code a simulator
- Try all algorithms, tune them

This is way overkill for a part-time hobby.

If we don't see much work on this area it's surely because it's really
not a problem anymore for most workloads. Database have their own cache
management and disk scheduling, file servers just add more ram or
processors, etc.


 OK here is one which just changes the rate that the active and inactive
 lists get scanned. Data corruption bugs should be minimal ;)
 

Will test.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-24 Thread Eric St-Laurent
On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote:

> It certainly doesn't run for me ever. Always kind of a "that's not the 
> point" comment but I just keep wondering whenever I see anyone complain 
> about updatedb why the _hell_ they are running it in the first place. If 
> anyone who never uses "locate" for anything simply disable updatedb, the 
> problem will for a large part be solved.
> 
> This not just meant as a cheap comment; while I can think of a few similar 
> loads even on the desktop (scanning a browser cache, a media player indexing 
> a large amount of media files, ...) I've never heard of problems _other_ 
> than updatedb. So just junk that crap and be happy.

>From my POV there's two different problems discussed recently:

- updatedb type of workloads that add tons of inodes and dentries in the
slab caches which of course use the pagecache.

- streaming large files (read or copying) that fill the pagecache with
useless used-once data

swap prefetch fix the first case, drop-behind fix the second case.

Both have the same symptoms but the cause is different.

Personally updatedb doesn't really hurt me.  But I don't have that many
files on my desktop.  I've tried the swap prefetch patch in the past and
it was not so noticeable for me. (I don't doubt it's helpful for others)

But every time I read or copy a large file around (usually from a
server) the slowdown is noticeable for some moments.

I just wanted to point this out, if it wasn't clean enough for everyone.
I hope both problems get fixed.


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-24 Thread Eric St-Laurent
On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote:

> I don't like this kind of conditional information going from something
> like readahead into page reclaim. Unless it is for readahead _specific_
> data such as "I got these all wrong, so you can reclaim them" (which
> this isn't).
> 
> But I don't like it as a use-once thing. The VM should be able to get
> that right.
> 


Question: How work the use-once code in the current kernel? Is there
any? I doesn't quite work for me...

See my previous email today, I've done a small test case to demonstrate 
the problem and the effectiveness of Peter's patch.  The only piece
missing is the copy case (read once + write once).

Regardless of how it's implemented, I think a similar mechanism must be
added. This is a long standing issue.

In the end, I think it's a pagecache resources allocation problem. the
VM lacks fair-share limits between processes. The kernel doesn't have
enough information to make the right decisions.

You can refine or use more advanced page reclaim, but some fair-share
splitting (like the CPU scheduler) between the processes must be
present.  Of course some process should have large or unlimited VM
limits, like databases.

Maybe the "containers" patchset and memory controller can help.  With
some specific configuration and/or a userspace daemon to adjust the
limits on the fly.

Independently, the basic large file streaming read (or copy) once cases
should not trash the pagecache. Can we agree on that?

I say, let's add some code to fix the problem.  If we hear about any
regression in some workloads, we can add a tunable to limit or disable
its effects, _if_ a better compromised solution cannot be found.

Surely it's possible to have a acceptable solution.

Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] readahead: drop behind

2007-07-24 Thread Eric St-Laurent
On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:

> Use the read-ahead code to provide hints to page reclaim.
> 
> This patch has the potential to solve the streaming-IO trashes my
> desktop problem.
> 
> It tries to aggressively reclaim pages that were loaded in a strong
> sequential pattern and have been consumed. Thereby limiting the damage
> to the current resident set.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>

(sorry for the delay)

Ok, I've done some tests with your patches,

I came up with a test program that should approximate my use case. It
simply mmap() and scan (read) a 375M file which represent the usual used
memory on my desktop system.  This data is frequently used, and should
stay cached as much as possible in preference over the "used once" data
read in the page cache when copying large files. I don't claim that the
test program is perfect or even correct, I'm open for suggestions.

Test system:

- Linux x86_64 2.6.23-rc1
- 1G of RAM
- I use the basic drop behind and sysctl patches. The readahead size
patch is _not_ included.


Setting up:

dd if=/dev/zero of=/tmp/375M_file bs=1M count=375
dd if=/dev/zero of=/tmp/5G_file bs=1M count=5120

Tests with stock kernel (drop behind disabled):

echo 0 >/proc/sys/vm/drop_behind

Base test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.146s
2nd execution: 0m1.119s
3rd execution: 0m1.109s
4th execution: 0m1.105s

Reading a large file test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /dev/null
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.224s
2nd execution: 0m1.114s
3rd execution: 0m7.178s <<< Much slower
4th execution: 0m1.115s

Copying (read+write) a large file test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /tmp/copy_of_5G_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
rm /tmp/copy_of_5G_file

1st execution: 0m7.203s
2nd execution: 0m1.147s
3rd execution: 0m7.238s <<< Much slower
4th execution: 0m1.129s

Tests with drop behind enabled:

echo 1 >/proc/sys/vm/drop_behind

Base test:

[same tests as above]

1st execution: 0m7.206s
2nd execution: 0m1.110s
3rd execution: 0m1.102s
4th execution: 0m1.106s

Reading a large file test:

[same tests as above]

1st execution: 0m7.197s
2nd execution: 0m1.116s
3rd execution: 0m1.114s <<< Great!!!
4th execution: 0m1.111s

Copying (read+write) a large file test:

[same tests as above]

1st execution: 0m7.186s
2nd execution: 0m1.111s
3rd execution: 0m7.339s <<< Not fixed
4th execution: 0m1.121s


Conclusion:

- The drop-behind patch works and really prevents the page cache content
from being fulled with useless read-once data.

- It doesn't help the copy (read+write) case. This should also be fixed,
as it's a common workload.

Tested-By: Eric St-Laurent ([EMAIL PROTECTED])



Best regards,

- Eric

(*) Test program and batch file are attached.

diff -urN linux-2.6/include/linux/swap.h linux-2.6-drop-behind/include/linux/swap.h
--- linux-2.6/include/linux/swap.h	2007-07-21 18:26:00.0 -0400
+++ linux-2.6-drop-behind/include/linux/swap.h	2007-07-22 16:22:48.0 -0400
@@ -180,6 +180,7 @@
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_demote(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
diff -urN linux-2.6/kernel/sysctl.c linux-2.6-drop-behind/kernel/sysctl.c
--- linux-2.6/kernel/sysctl.c	2007-07-21 18:26:01.0 -0400
+++ linux-2.6-drop-behind/kernel/sysctl.c	2007-07-22 16:20:27.0 -0400
@@ -163,6 +163,7 @@
 
 extern int prove_locking;
 extern int lock_stat;
+extern int sysctl_dropbehind;
 
 /* The default sysctl tables: */
 
@@ -1048,6 +1049,14 @@
 		.extra1		= ,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "drop_behind",
+		.data		= _dropbehind,
+		.maxlen		= sizeof(sysctl_dropbehind),
+		.mode		= 0644,
+		.proc_handler	= _dointvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
diff -urN linux-2.6/mm/readahead.c linux-2.6-drop-behind/mm/readahead.c
--- linux-2.6/mm/readahead.c	2007-07-21 18:26:01.0 -0400
+++ linux-2.6-drop-behind/mm/readahead.c	2007-07-22 16:41:47.0 -0400
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void default

Re: [PATCH 1/3] readahead: drop behind

2007-07-24 Thread Eric St-Laurent
On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:

 Use the read-ahead code to provide hints to page reclaim.
 
 This patch has the potential to solve the streaming-IO trashes my
 desktop problem.
 
 It tries to aggressively reclaim pages that were loaded in a strong
 sequential pattern and have been consumed. Thereby limiting the damage
 to the current resident set.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]

(sorry for the delay)

Ok, I've done some tests with your patches,

I came up with a test program that should approximate my use case. It
simply mmap() and scan (read) a 375M file which represent the usual used
memory on my desktop system.  This data is frequently used, and should
stay cached as much as possible in preference over the used once data
read in the page cache when copying large files. I don't claim that the
test program is perfect or even correct, I'm open for suggestions.

Test system:

- Linux x86_64 2.6.23-rc1
- 1G of RAM
- I use the basic drop behind and sysctl patches. The readahead size
patch is _not_ included.


Setting up:

dd if=/dev/zero of=/tmp/375M_file bs=1M count=375
dd if=/dev/zero of=/tmp/5G_file bs=1M count=5120

Tests with stock kernel (drop behind disabled):

echo 0 /proc/sys/vm/drop_behind

Base test:

sync; echo 1 /proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.146s
2nd execution: 0m1.119s
3rd execution: 0m1.109s
4th execution: 0m1.105s

Reading a large file test:

sync; echo 1 /proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /dev/null
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.224s
2nd execution: 0m1.114s
3rd execution: 0m7.178s  Much slower
4th execution: 0m1.115s

Copying (read+write) a large file test:

sync; echo 1 /proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /tmp/copy_of_5G_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
rm /tmp/copy_of_5G_file

1st execution: 0m7.203s
2nd execution: 0m1.147s
3rd execution: 0m7.238s  Much slower
4th execution: 0m1.129s

Tests with drop behind enabled:

echo 1 /proc/sys/vm/drop_behind

Base test:

[same tests as above]

1st execution: 0m7.206s
2nd execution: 0m1.110s
3rd execution: 0m1.102s
4th execution: 0m1.106s

Reading a large file test:

[same tests as above]

1st execution: 0m7.197s
2nd execution: 0m1.116s
3rd execution: 0m1.114s  Great!!!
4th execution: 0m1.111s

Copying (read+write) a large file test:

[same tests as above]

1st execution: 0m7.186s
2nd execution: 0m1.111s
3rd execution: 0m7.339s  Not fixed
4th execution: 0m1.121s


Conclusion:

- The drop-behind patch works and really prevents the page cache content
from being fulled with useless read-once data.

- It doesn't help the copy (read+write) case. This should also be fixed,
as it's a common workload.

Tested-By: Eric St-Laurent ([EMAIL PROTECTED])



Best regards,

- Eric

(*) Test program and batch file are attached.

diff -urN linux-2.6/include/linux/swap.h linux-2.6-drop-behind/include/linux/swap.h
--- linux-2.6/include/linux/swap.h	2007-07-21 18:26:00.0 -0400
+++ linux-2.6-drop-behind/include/linux/swap.h	2007-07-22 16:22:48.0 -0400
@@ -180,6 +180,7 @@
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_demote(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
diff -urN linux-2.6/kernel/sysctl.c linux-2.6-drop-behind/kernel/sysctl.c
--- linux-2.6/kernel/sysctl.c	2007-07-21 18:26:01.0 -0400
+++ linux-2.6-drop-behind/kernel/sysctl.c	2007-07-22 16:20:27.0 -0400
@@ -163,6 +163,7 @@
 
 extern int prove_locking;
 extern int lock_stat;
+extern int sysctl_dropbehind;
 
 /* The default sysctl tables: */
 
@@ -1048,6 +1049,14 @@
 		.extra1		= zero,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= drop_behind,
+		.data		= sysctl_dropbehind,
+		.maxlen		= sizeof(sysctl_dropbehind),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
diff -urN linux-2.6/mm/readahead.c linux-2.6-drop-behind/mm/readahead.c
--- linux-2.6/mm/readahead.c	2007-07-21 18:26:01.0 -0400
+++ linux-2.6-drop-behind/mm/readahead.c	2007-07-22 16:41:47.0 -0400
@@ -15,6 +15,7 @@
 #include linux/backing-dev.h
 #include linux/task_io_accounting_ops.h
 #include linux/pagevec.h
+#include linux/swap.h
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct

Re: [PATCH 0/3] readahead drop behind and size adjustment

2007-07-24 Thread Eric St-Laurent
On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote:

 I don't like this kind of conditional information going from something
 like readahead into page reclaim. Unless it is for readahead _specific_
 data such as I got these all wrong, so you can reclaim them (which
 this isn't).
 
 But I don't like it as a use-once thing. The VM should be able to get
 that right.
 


Question: How work the use-once code in the current kernel? Is there
any? I doesn't quite work for me...

See my previous email today, I've done a small test case to demonstrate 
the problem and the effectiveness of Peter's patch.  The only piece
missing is the copy case (read once + write once).

Regardless of how it's implemented, I think a similar mechanism must be
added. This is a long standing issue.

In the end, I think it's a pagecache resources allocation problem. the
VM lacks fair-share limits between processes. The kernel doesn't have
enough information to make the right decisions.

You can refine or use more advanced page reclaim, but some fair-share
splitting (like the CPU scheduler) between the processes must be
present.  Of course some process should have large or unlimited VM
limits, like databases.

Maybe the containers patchset and memory controller can help.  With
some specific configuration and/or a userspace daemon to adjust the
limits on the fly.

Independently, the basic large file streaming read (or copy) once cases
should not trash the pagecache. Can we agree on that?

I say, let's add some code to fix the problem.  If we hear about any
regression in some workloads, we can add a tunable to limit or disable
its effects, _if_ a better compromised solution cannot be found.

Surely it's possible to have a acceptable solution.

Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: -mm merge plans for 2.6.23

2007-07-24 Thread Eric St-Laurent
On Wed, 2007-25-07 at 06:55 +0200, Rene Herman wrote:

 It certainly doesn't run for me ever. Always kind of a that's not the 
 point comment but I just keep wondering whenever I see anyone complain 
 about updatedb why the _hell_ they are running it in the first place. If 
 anyone who never uses locate for anything simply disable updatedb, the 
 problem will for a large part be solved.
 
 This not just meant as a cheap comment; while I can think of a few similar 
 loads even on the desktop (scanning a browser cache, a media player indexing 
 a large amount of media files, ...) I've never heard of problems _other_ 
 than updatedb. So just junk that crap and be happy.

From my POV there's two different problems discussed recently:

- updatedb type of workloads that add tons of inodes and dentries in the
slab caches which of course use the pagecache.

- streaming large files (read or copying) that fill the pagecache with
useless used-once data

swap prefetch fix the first case, drop-behind fix the second case.

Both have the same symptoms but the cause is different.

Personally updatedb doesn't really hurt me.  But I don't have that many
files on my desktop.  I've tried the swap prefetch patch in the past and
it was not so noticeable for me. (I don't doubt it's helpful for others)

But every time I read or copy a large file around (usually from a
server) the slowdown is noticeable for some moments.

I just wanted to point this out, if it wasn't clean enough for everyone.
I hope both problems get fixed.


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] readahead: drop behind

2007-07-21 Thread Eric St-Laurent
On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:
> plain text document attachment (readahead-useonce.patch)
> Use the read-ahead code to provide hints to page reclaim.
> 
> This patch has the potential to solve the streaming-IO trashes my
> desktop problem.
> 
> It tries to aggressively reclaim pages that were loaded in a strong
> sequential pattern and have been consumed. Thereby limiting the damage
> to the current resident set.
> 
> Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>

With the fadvise change, it looks like the right solution to me.

The patches are for which kernel? They doesn't apply cleanly to
2.6.22.1.

It would be useful to have a temporary /proc tunable to enable/disable
the heuristic to help test the effects.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] readahead: drop behind

2007-07-21 Thread Eric St-Laurent

> They are against git of a few hours ago and the latest readahead patches
> from Wu (which don't apply cleanly either, but the rejects are trivial).
> 
> > It would be useful to have a temporary /proc tunable to enable/disable
> > the heuristic to help test the effects.
> 
> Right, I had such a patch somewhere,.. won't apply cleanly but should be
> obvious..

Thanks, I will merge theses and report back with some results.

After copying large files, I find my system sluggish. I hope your
changes will help.

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] readahead: drop behind

2007-07-21 Thread Eric St-Laurent

 They are against git of a few hours ago and the latest readahead patches
 from Wu (which don't apply cleanly either, but the rejects are trivial).
 
  It would be useful to have a temporary /proc tunable to enable/disable
  the heuristic to help test the effects.
 
 Right, I had such a patch somewhere,.. won't apply cleanly but should be
 obvious..

Thanks, I will merge theses and report back with some results.

After copying large files, I find my system sluggish. I hope your
changes will help.

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/3] readahead: drop behind

2007-07-21 Thread Eric St-Laurent
On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:
 plain text document attachment (readahead-useonce.patch)
 Use the read-ahead code to provide hints to page reclaim.
 
 This patch has the potential to solve the streaming-IO trashes my
 desktop problem.
 
 It tries to aggressively reclaim pages that were loaded in a strong
 sequential pattern and have been consumed. Thereby limiting the damage
 to the current resident set.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]

With the fadvise change, it looks like the right solution to me.

The patches are for which kernel? They doesn't apply cleanly to
2.6.22.1.

It would be useful to have a temporary /proc tunable to enable/disable
the heuristic to help test the effects.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: v2.6.21.4-rt11

2007-06-12 Thread Eric St-Laurent
On Tue, 2007-12-06 at 06:00 -0700, Pallipadi, Venkatesh wrote:
> 
> >-Original Message-

> Yes. Force_hpet part is should have worked..
> Eric: Can you send me the output of 'lspci -n on your system.
> We need to double check we are covering all ICH7 ids.

Here it is:

00:00.0 0600: 8086:2770 (rev 02)
00:02.0 0300: 8086:2772 (rev 02)
00:1b.0 0403: 8086:27d8 (rev 01)
00:1c.0 0604: 8086:27d0 (rev 01)
00:1c.1 0604: 8086:27d2 (rev 01)
00:1d.0 0c03: 8086:27c8 (rev 01)
00:1d.1 0c03: 8086:27c9 (rev 01)
00:1d.2 0c03: 8086:27ca (rev 01)
00:1d.3 0c03: 8086:27cb (rev 01)
00:1d.7 0c03: 8086:27cc (rev 01)
00:1e.0 0604: 8086:244e (rev e1)
00:1f.0 0601: 8086:27b8 (rev 01)
00:1f.1 0101: 8086:27df (rev 01)
00:1f.2 0101: 8086:27c0 (rev 01)
00:1f.3 0c05: 8086:27da (rev 01)
01:0a.0 0604: 3388:0021 (rev 11)
02:0c.0 0c03: 1033:0035 (rev 41)
02:0c.1 0c03: 1033:0035 (rev 41)
02:0c.2 0c03: 1033:00e0 (rev 02)
02:0d.0 0c00: 1106:3044 (rev 46)
03:00.0 0200: 8086:109a

Adding the id for PCI_DEVICE_ID_INTEL_ICH7_0 (27b8) should do the trick.

I've patched my kernel and was ready to test it, but in the meantime I
did a BIOS upgrade (bad idea...) and with the new version the HPET timer
is detected via ACPI.

Unfortunately it seems that downgrading the BIOS is a lot more trouble
than upgrading it. So I cannot easily test the force enable anymore.

Anyway it works now. Here is my patch if it's any use to you:


diff -uprN linux-2.6.21.4.orig/arch/i386/kernel/quirks.c 
linux-2.6.21.4/arch/i386/kernel/quirks.c
--- linux-2.6.21.4.orig/arch/i386/kernel/quirks.c   Tue Jun 12 10:03:18 2007
+++ linux-2.6.21.4/arch/i386/kernel/quirks.cTue Jun 12 10:08:02 2007
@@ -149,6 +149,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
  ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,
+ ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v2.6.21.4-rt11

2007-06-12 Thread Eric St-Laurent
On Sat, 2007-09-06 at 23:05 +0200, Ingo Molnar wrote:
> i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be 
> downloaded from the usual place:
>  

I'm running 2.6.21.4-rt12-cfs-v17 (x86_64), so far no problems. I like
this kernel a lot, it's feels quite smooth.

One little thing, no HPET timer is detected. By looking at the patch,
even the force detect code is there, it should work.

The hpet timer is not available as a clocksource and only one hpet
related message is present in dmesg:

PM: Adding info for No Bus:hpet

This is on a Asus P5LD2-VM motherboard (ICH7)

Relevant config bits:

CONFIG_HPET_TIMER=y
# CONFIG_HPET_EMULATE_RTC is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y

Should I enable one of the two other options? Any ideas?


Best regards,

- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v2.6.21.4-rt11

2007-06-12 Thread Eric St-Laurent
On Sat, 2007-09-06 at 23:05 +0200, Ingo Molnar wrote:
 i'm pleased to announce the v2.6.21.4-rt11 kernel, which can be 
 downloaded from the usual place:
  

I'm running 2.6.21.4-rt12-cfs-v17 (x86_64), so far no problems. I like
this kernel a lot, it's feels quite smooth.

One little thing, no HPET timer is detected. By looking at the patch,
even the force detect code is there, it should work.

The hpet timer is not available as a clocksource and only one hpet
related message is present in dmesg:

PM: Adding info for No Bus:hpet

This is on a Asus P5LD2-VM motherboard (ICH7)

Relevant config bits:

CONFIG_HPET_TIMER=y
# CONFIG_HPET_EMULATE_RTC is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y

Should I enable one of the two other options? Any ideas?


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: v2.6.21.4-rt11

2007-06-12 Thread Eric St-Laurent
On Tue, 2007-12-06 at 06:00 -0700, Pallipadi, Venkatesh wrote:
 
 -Original Message-

 Yes. Force_hpet part is should have worked..
 Eric: Can you send me the output of 'lspci -n on your system.
 We need to double check we are covering all ICH7 ids.

Here it is:

00:00.0 0600: 8086:2770 (rev 02)
00:02.0 0300: 8086:2772 (rev 02)
00:1b.0 0403: 8086:27d8 (rev 01)
00:1c.0 0604: 8086:27d0 (rev 01)
00:1c.1 0604: 8086:27d2 (rev 01)
00:1d.0 0c03: 8086:27c8 (rev 01)
00:1d.1 0c03: 8086:27c9 (rev 01)
00:1d.2 0c03: 8086:27ca (rev 01)
00:1d.3 0c03: 8086:27cb (rev 01)
00:1d.7 0c03: 8086:27cc (rev 01)
00:1e.0 0604: 8086:244e (rev e1)
00:1f.0 0601: 8086:27b8 (rev 01)
00:1f.1 0101: 8086:27df (rev 01)
00:1f.2 0101: 8086:27c0 (rev 01)
00:1f.3 0c05: 8086:27da (rev 01)
01:0a.0 0604: 3388:0021 (rev 11)
02:0c.0 0c03: 1033:0035 (rev 41)
02:0c.1 0c03: 1033:0035 (rev 41)
02:0c.2 0c03: 1033:00e0 (rev 02)
02:0d.0 0c00: 1106:3044 (rev 46)
03:00.0 0200: 8086:109a

Adding the id for PCI_DEVICE_ID_INTEL_ICH7_0 (27b8) should do the trick.

I've patched my kernel and was ready to test it, but in the meantime I
did a BIOS upgrade (bad idea...) and with the new version the HPET timer
is detected via ACPI.

Unfortunately it seems that downgrading the BIOS is a lot more trouble
than upgrading it. So I cannot easily test the force enable anymore.

Anyway it works now. Here is my patch if it's any use to you:


diff -uprN linux-2.6.21.4.orig/arch/i386/kernel/quirks.c 
linux-2.6.21.4/arch/i386/kernel/quirks.c
--- linux-2.6.21.4.orig/arch/i386/kernel/quirks.c   Tue Jun 12 10:03:18 2007
+++ linux-2.6.21.4/arch/i386/kernel/quirks.cTue Jun 12 10:08:02 2007
@@ -149,6 +149,8 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_I
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH6_1,
  ich_force_enable_hpet);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_0,
+ ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_1,
  ich_force_enable_hpet);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_ICH7_31,


Best regards,

- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Eric St-Laurent
On Tue, 2007-20-03 at 10:15 +0100, Arjan van de Ven wrote:

> disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
> and we have to live with it. Disabling it without knowing what it does
> on your system is madness.
> 

Like Lee said, for "debugging", mainly trying to resolve unexplained
long latencies.

I've had a laptop that caused latency spikes with the cpu fan was turn
on. I tried disabling SMI to diagnose the problem with no success.

My current system has a BIOS feature to control fans speed according to
temperature. I presume this must a SMI to work right?  In this case it
should be possible to find and disable the related SMI and replace the
fan control with a user space software.

Of course it's not wise to blindly disable SMIs as we don't precisely
know what they do. 


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-20 Thread Eric St-Laurent
On Tue, 2007-20-03 at 10:15 +0100, Arjan van de Ven wrote:

 disabling that is a BAD idea. I'm no fan of SMM myself, but it's there,
 and we have to live with it. Disabling it without knowing what it does
 on your system is madness.
 

Like Lee said, for debugging, mainly trying to resolve unexplained
long latencies.

I've had a laptop that caused latency spikes with the cpu fan was turn
on. I tried disabling SMI to diagnose the problem with no success.

My current system has a BIOS feature to control fans speed according to
temperature. I presume this must a SMI to work right?  In this case it
should be possible to find and disable the related SMI and replace the
fan control with a user space software.

Of course it's not wise to blindly disable SMIs as we don't precisely
know what they do. 


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

> I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
> not to mention people trying to spec out hardware for RT
> applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] 2.6.21-rc1,2,3 regressions on my system that I found so far

2007-03-19 Thread Eric St-Laurent
On Tue, 2007-20-03 at 01:04 -0400, Lee Revell wrote:

 I think CONFIG_TRY_TO_DISABLE_SMI would be excellent for debugging,
 not to mention people trying to spec out hardware for RT
 applications...

There is a SMI disabling module in RTAI, check the smi-module.c in this:

https://www.rtai.org/RTAI/rtai-3.5.tar.bz2

More infos:

http://www.captain.at/rtai-smi-high-latency.php
http://www.captain.at/xenomai-smi-high-latency.php

It might make sense to merge this code, at least in the -rt tree.


- Eric


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: userspace pagecache management tool

2007-03-03 Thread Eric St-Laurent
On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:


> There is much more which could be done to make this code smarter, but I
> think the lesson here is that we can produce a far, far better result doing
> this work in userspace than we could ever hope to do with an in-kernel
> implementation.  There are some enhancement suggestions in the
> documentation file.

While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.

Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for "interactive feel", I don't know how to measure it. I
just feel it's a weak aspect of the OS.

Surely it's possible to make the kernel a little bit better to protect
the page cache from abuse, from simple or badly designed applications.

Why fairness is provided by the process scheduler with good results, yet
it somewhat easy for a process to cause slowdowns from page cache usage.

My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.

Maybe the situation should be reversed, much like the process scheduler.
Fairness by default, and the possibility to request for more system
resources by asking for them with necessary privileges. Much like
SCHED_FIFO policy.


- Eric



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: userspace pagecache management tool

2007-03-03 Thread Eric St-Laurent
On Sat, 2007-03-03 at 12:29 -0800, Andrew Morton wrote:


 There is much more which could be done to make this code smarter, but I
 think the lesson here is that we can produce a far, far better result doing
 this work in userspace than we could ever hope to do with an in-kernel
 implementation.  There are some enhancement suggestions in the
 documentation file.

While I think that more user space applications should use fadvise() to
avoid polluting the page cache with unneeded data, I still think the
kernel should be more fair in regard to page cache management.

Personally, I've experienced some sluggish performance after copying
large files around. Even more when using NFS. It's difficult to file a
bug report for interactive feel, I don't know how to measure it. I
just feel it's a weak aspect of the OS.

Surely it's possible to make the kernel a little bit better to protect
the page cache from abuse, from simple or badly designed applications.

Why fairness is provided by the process scheduler with good results, yet
it somewhat easy for a process to cause slowdowns from page cache usage.

My personal opinion is that the VM seem tuned for database types
workloads. Of course, making the page cache more fair to prevent one
process to use most of it will most likely slowdown database type
applications.

Maybe the situation should be reversed, much like the process scheduler.
Fairness by default, and the possibility to request for more system
resources by asking for them with necessary privileges. Much like
SCHED_FIFO policy.


- Eric



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-15 Thread Eric St-Laurent
On Fri, 2005-07-15 at 12:58 -0700, Stephen Pollei wrote:
> But If I understand Linus's points he wants jiffies to remain a memory
> fetch, and make sure it doesn't turn into a singing dancing christmas
> tree.

It seems it relatively easy to support dynamic tick, the ARM
architecture has it. But with the numerous users of jiffies through the
code, it seems to me that it's hard to ensure that everyone of them will
continue to work correctly if the jiffies_increment is changed during
runtime.

As Linus noted, the current tick code is flexible and powerful, but it
can be hard to get it right in all case. 

WinCE developers have similar problems/concerns:

http://blogs.msdn.com/ce_base/archive/2005/06/08/426762.aspx

With the previous cleanup like time_after()/time_before(), msleep() and
friends, unit conversion helpers, etc. it's a step in the right
direction.

I just wanted to point out that while it's good to preserve the current
efficient tick implementation, it may be worthwhile to add a relative
timeout API like Alan Cox proposed a year ago to better hide the
implementation details.


- Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-15 Thread Eric St-Laurent
On Fri, 2005-07-15 at 12:58 -0700, Stephen Pollei wrote:
 But If I understand Linus's points he wants jiffies to remain a memory
 fetch, and make sure it doesn't turn into a singing dancing christmas
 tree.

It seems it relatively easy to support dynamic tick, the ARM
architecture has it. But with the numerous users of jiffies through the
code, it seems to me that it's hard to ensure that everyone of them will
continue to work correctly if the jiffies_increment is changed during
runtime.

As Linus noted, the current tick code is flexible and powerful, but it
can be hard to get it right in all case. 

WinCE developers have similar problems/concerns:

http://blogs.msdn.com/ce_base/archive/2005/06/08/426762.aspx

With the previous cleanup like time_after()/time_before(), msleep() and
friends, unit conversion helpers, etc. it's a step in the right
direction.

I just wanted to point out that while it's good to preserve the current
efficient tick implementation, it may be worthwhile to add a relative
timeout API like Alan Cox proposed a year ago to better hide the
implementation details.


- Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-14 Thread Eric St-Laurent
On Thu, 2005-07-14 at 17:24 -0700, Linus Torvalds wrote:
> 
> On Thu, 14 Jul 2005, Lee Revell wrote:
> 
> Trust me. When I say that the right thing to do is to just have a fixed 
> (but high) HZ value, and just changing the timer rate, I'm -right-.
> 
> I'm always right. This time I'm just even more right than usual.

Of course you are, jiffies are simple and efficient.

But it may be worthwhile to provide better/simpler API for relative
timeouts and also better hide the implementation details of the tick
system.


If i sum-up the discussion from my POV:

- use a 32-bit tick counter on 32-bit platforms and use a 64-bit counter
on 64-bit platforms

- keep the constant HZ=1000 (mS resolution) on 32-bit platforms

- remove the assumption that timer interrupts and jiffies are 1:1 thing
(jiffies may be incremented by >1 ticks at timer interrupt)

- determine jiffies_increment at boot

- have a slow clock mode to help power management (adjust
jiffies_increment by the slowdown factor)

- it may be useful to bump up HZ to 1e6 (uS res.) or 1e9 (nS res.) on
64-bit platforms, if there are benefits such as better accuracy during
time units conversions or if a higher frequency timer hardware is
available/viable.

- it may be also useful to bump HZ on -RT (Real-time) kernels, or with
-HRT (High-resolution timers support). Users of those kernel are willing
to pay the cost of the overhead to have better resolution

- avoid direct usage of the jiffies variable, instead use jiffies()
(inline or MACRO), IMO monotonic_clock() would be a better name

- provide a relative timeout API (see my previous post, or Alan's
suggestions)

- remove most of the direct use of jiffies through the code and replace
them with msleep(), relative timer, etc

- use human units for those APIs


- Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-14 Thread Eric St-Laurent
On Thu, 2005-07-14 at 23:37 +0100, Alan Cox wrote:

> In actual fact you also want to fix users of
> 
>   while(time_before(foo, jiffies)) { whack(mole); }
> 
> to become
> 
>   init_timeout();
>   timeout.expires = jiffies + n
>   add_timeout();
>   while(!timeout_expired()) {}
> 
> Which is a trivial wrapper around timers as we have them now

Or something like this:

struct timeout_timer {
unsigned long expires;
};

static inline void timeout_set(struct timeout_timer *timer,
unsigned int msecs)
{
timer->expires = jiffies + msecs_to_jiffies(msecs);
}

static inline int timeout_expired(struct timeout_timer *timer)
{
return (time_after(jiffies, timer->expires));
}

It provides a nice API for relative timeouts without adding overhead.


- Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-14 Thread Eric St-Laurent
On Thu, 2005-07-14 at 23:37 +0100, Alan Cox wrote:

 In actual fact you also want to fix users of
 
   while(time_before(foo, jiffies)) { whack(mole); }
 
 to become
 
   init_timeout(timeout);
   timeout.expires = jiffies + n
   add_timeout(timeout);
   while(!timeout_expired(timeout)) {}
 
 Which is a trivial wrapper around timers as we have them now

Or something like this:

struct timeout_timer {
unsigned long expires;
};

static inline void timeout_set(struct timeout_timer *timer,
unsigned int msecs)
{
timer-expires = jiffies + msecs_to_jiffies(msecs);
}

static inline int timeout_expired(struct timeout_timer *timer)
{
return (time_after(jiffies, timer-expires));
}

It provides a nice API for relative timeouts without adding overhead.


- Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-14 Thread Eric St-Laurent
On Thu, 2005-07-14 at 17:24 -0700, Linus Torvalds wrote:
 
 On Thu, 14 Jul 2005, Lee Revell wrote:
 
 Trust me. When I say that the right thing to do is to just have a fixed 
 (but high) HZ value, and just changing the timer rate, I'm -right-.
 
 I'm always right. This time I'm just even more right than usual.

Of course you are, jiffies are simple and efficient.

But it may be worthwhile to provide better/simpler API for relative
timeouts and also better hide the implementation details of the tick
system.


If i sum-up the discussion from my POV:

- use a 32-bit tick counter on 32-bit platforms and use a 64-bit counter
on 64-bit platforms

- keep the constant HZ=1000 (mS resolution) on 32-bit platforms

- remove the assumption that timer interrupts and jiffies are 1:1 thing
(jiffies may be incremented by 1 ticks at timer interrupt)

- determine jiffies_increment at boot

- have a slow clock mode to help power management (adjust
jiffies_increment by the slowdown factor)

- it may be useful to bump up HZ to 1e6 (uS res.) or 1e9 (nS res.) on
64-bit platforms, if there are benefits such as better accuracy during
time units conversions or if a higher frequency timer hardware is
available/viable.

- it may be also useful to bump HZ on -RT (Real-time) kernels, or with
-HRT (High-resolution timers support). Users of those kernel are willing
to pay the cost of the overhead to have better resolution

- avoid direct usage of the jiffies variable, instead use jiffies()
(inline or MACRO), IMO monotonic_clock() would be a better name

- provide a relative timeout API (see my previous post, or Alan's
suggestions)

- remove most of the direct use of jiffies through the code and replace
them with msleep(), relative timer, etc

- use human units for those APIs


- Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-11 Thread Eric St-Laurent
On Mon, 2005-07-11 at 16:08 +0200, Arjan van de Ven wrote:

> Alan: you worked on this before, where did you end up with ?
> 

The last patch i've seen is 1 year old.

http://www.ussg.iu.edu/hypermail/linux/kernel/0407.3/0643.html

Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] i386: Selectable Frequency of the Timer Interrupt

2005-07-11 Thread Eric St-Laurent
On Mon, 2005-07-11 at 16:08 +0200, Arjan van de Ven wrote:

 Alan: you worked on this before, where did you end up with ?
 

The last patch i've seen is 1 year old.

http://www.ussg.iu.edu/hypermail/linux/kernel/0407.3/0643.html

Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Dynamic tick, version 050127-1

2005-02-01 Thread Eric St-Laurent
On Tue, 2005-02-01 at 15:20 -0500, Lee Revell wrote:

> I was wondering how Windows handles high res timers, if at all.  The
> reason I ask is because I have been reverse engineering a Windows ASIO
> driver, and I find that if the latency is set below about 5ms, by

By default, Windows "multimedia" timers have 10ms resolution (this
depends on the exact version of Windows used...).  You can call the
timeBeginPeriod() function to lower the resolution to 1ms.

This resolution seem related to the task scheduler timeslice.  After you
call this function, the Sleep() call has also a resolution of 1ms
instead of 10ms.

I remember reading that the multimedia timers are implemented as a high
priority thread.

You can found more details on this site :

http://www.geisswerks.com/ryan/FAQS/timing.html

Best regards,

Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Dynamic tick, version 050127-1

2005-02-01 Thread Eric St-Laurent
On Tue, 2005-02-01 at 15:20 -0500, Lee Revell wrote:

 I was wondering how Windows handles high res timers, if at all.  The
 reason I ask is because I have been reverse engineering a Windows ASIO
 driver, and I find that if the latency is set below about 5ms, by

By default, Windows multimedia timers have 10ms resolution (this
depends on the exact version of Windows used...).  You can call the
timeBeginPeriod() function to lower the resolution to 1ms.

This resolution seem related to the task scheduler timeslice.  After you
call this function, the Sleep() call has also a resolution of 1ms
instead of 10ms.

I remember reading that the multimedia timers are implemented as a high
priority thread.

You can found more details on this site :

http://www.geisswerks.com/ryan/FAQS/timing.html

Best regards,

Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/13] Qsort

2005-01-24 Thread Eric St-Laurent
On Mon, 2005-01-24 at 21:43 -0300, Horst von Brand wrote:
> AFAICS, this is just a badly implemented Shellsort (the 10/13 increment
> sequence starting with the number of elements is probably not very good,
> besides swapping stuff is inefficient (just juggling like Shellsort does
> gives you almost a third less copies)).
> 
> Have you found a proof for the O(n log n) claim?

"Why a Comb Sort is NOT a Shell Sort

A shell sort completely sorts the data for each gap size. A comb sort
takes a more optimistic approach and doesn't require data be completely
sorted at a gap size. The comb sort assumes that out-of-order data will
be cleaned-up by smaller gap sizes as the sort proceeds. "

Reference:

http://world.std.com/~jdveale/combsort.htm

Another good reference:

http://yagni.com/combsort/index.php

Personally, i've used it in the past because of it's small size.  With
C++ templates you can have a copy of the routine generated for a
specific datatype, thus skipping the costly function call used for each
compare.  With some C macro magic, i presume something similar can be
done, for time-critical applications.

Best regards,

Eric St-Laurent


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/13] Qsort

2005-01-24 Thread Eric St-Laurent
On Mon, 2005-01-24 at 21:43 -0300, Horst von Brand wrote:
 AFAICS, this is just a badly implemented Shellsort (the 10/13 increment
 sequence starting with the number of elements is probably not very good,
 besides swapping stuff is inefficient (just juggling like Shellsort does
 gives you almost a third less copies)).
 
 Have you found a proof for the O(n log n) claim?

Why a Comb Sort is NOT a Shell Sort

A shell sort completely sorts the data for each gap size. A comb sort
takes a more optimistic approach and doesn't require data be completely
sorted at a gap size. The comb sort assumes that out-of-order data will
be cleaned-up by smaller gap sizes as the sort proceeds. 

Reference:

http://world.std.com/~jdveale/combsort.htm

Another good reference:

http://yagni.com/combsort/index.php

Personally, i've used it in the past because of it's small size.  With
C++ templates you can have a copy of the routine generated for a
specific datatype, thus skipping the costly function call used for each
compare.  With some C macro magic, i presume something similar can be
done, for time-critical applications.

Best regards,

Eric St-Laurent


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/