date:20080116

Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Frans Pop

On Thursday 17 January 2008, David Miller wrote:
> From: "Brandeburg, Jesse" <[EMAIL PROTECTED]>
>
> > We spent Wednesday trying to reproduce (without the patch) these issues
> > without much luck, and have applied the patch cleanly and will continue
> > testing it.  Given the simplicity of the changes, and the community
> > testing, I'll give my ack and we will continue testing.
>
> You need a slow CPU, and you need to make sure you do actually
> trigger the TX limiting code there.

Hmmm. Is a dual core Pentium D 3.20GHz considered slow these days?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/10] ACPI: register ACPI Fan as generic thermal cooling device

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Register ACPI Fan as thermal cooling device.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/fan.c |   90 -
 1 file changed, 83 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/acpi/fan.c
===
--- linux-2.6.orig/drivers/acpi/fan.c
+++ linux-2.6/drivers/acpi/fan.c
@@ -30,7 +30,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 
@@ -64,9 +64,55 @@ static struct acpi_driver acpi_fan_drive
},
 };
 
+/* thermal cooling device callbacks */
+static int fan_get_max_state(struct thermal_cooling_device *cdev, char *buf)
+{
+   /* ACPI fan device only support two states: ON/OFF */
+   return sprintf(buf, "1\n");
+}
+
+static int fan_get_cur_state(struct thermal_cooling_device *cdev, char *buf)
+{
+   struct acpi_device *device = cdev->devdata;
+   int state;
+   int result;
+
+   if (!device)
+   return -EINVAL;
+
+   result = acpi_bus_get_power(device->handle, );
+   if (result)
+   return result;
+
+   return sprintf(buf, "%s\n", state == ACPI_STATE_D3 ? "0" :
+(state == ACPI_STATE_D0 ? "1" : "unknown"));
+}
+
+static int
+fan_set_cur_state(struct thermal_cooling_device *cdev, unsigned int state)
+{
+   struct acpi_device *device = cdev->devdata;
+   int result;
+
+   if (!device || (state != 0 && state != 1))
+   return -EINVAL;
+
+   result = acpi_bus_set_power(device->handle,
+   state ? ACPI_STATE_D0 : ACPI_STATE_D3);
+
+   return result;
+}
+
+static struct thermal_cooling_device_ops fan_cooling_ops = {
+   .get_max_state = fan_get_max_state,
+   .get_cur_state = fan_get_cur_state,
+   .set_cur_state = fan_set_cur_state,
+};
+
 /* --
   FS Interface (/proc)
-- 
*/
+#ifdef CONFIG_ACPI_PROCFS
 
 static struct proc_dir_entry *acpi_fan_dir;
 
@@ -167,7 +213,17 @@ static int acpi_fan_remove_fs(struct acp
 
return 0;
 }
+#else
+static int acpi_fan_add_fs(struct acpi_device *device)
+{
+   return 0;
+}
 
+static int acpi_fan_remove_fs(struct acpi_device *device)
+{
+   return 0;
+}
+#endif
 /* --
  Driver Interface
-- 
*/
@@ -175,9 +231,8 @@ static int acpi_fan_remove_fs(struct acp
 static int acpi_fan_add(struct acpi_device *device)
 {
int result = 0;
-   struct acpi_fan *fan = NULL;
int state = 0;
-
+   struct thermal_cooling_device *cdev;
 
if (!device)
return -EINVAL;
@@ -191,6 +246,25 @@ static int acpi_fan_add(struct acpi_devi
goto end;
}
 
+   cdev = thermal_cooling_device_register("Fan", device,
+   _cooling_ops);
+   if (cdev)
+   printk(KERN_INFO PREFIX
+   "%s is registered as cooling_device%d\n",
+   device->dev.bus_id, cdev->id);
+   else
+   goto end;
+   acpi_driver_data(device) = cdev;
+   result = sysfs_create_link(>dev.kobj, >device.kobj,
+   "thermal_cooling");
+   if (result)
+   return result;
+
+   result = sysfs_create_link(>device.kobj, >dev.kobj,
+   "device");
+if (result)
+return result;
+
result = acpi_fan_add_fs(device);
if (result)
goto end;
@@ -200,18 +274,20 @@ static int acpi_fan_add(struct acpi_devi
   !device->power.state ? "on" : "off");
 
   end:
-   if (result)
-   kfree(fan);
-
return result;
 }
 
 static int acpi_fan_remove(struct acpi_device *device, int type)
 {
-   if (!device || !acpi_driver_data(device))
+   struct thermal_cooling_device *cdev = acpi_driver_data(device);
+
+   if (!device || !cdev)
return -EINVAL;
 
acpi_fan_remove_fs(device);
+   sysfs_remove_link(>dev.kobj, "thermal_cooling");
+   sysfs_remove_link(>device.kobj, "device");
+   thermal_cooling_device_unregister(cdev);
 
return 0;
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/10] ACPI: ACPI thermal zone handle notification correctly

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Change the ACPI thermal action upon notification 0x81 and 0x82.

According to the ACPI spec, we should:
re-evaluate _PSV and _ACx methods upon notification 0x81
re-evaluate _PSL and _ALx and _TZD upon notificaiton 0x82.
But the current code re-evaluates all the trip points for 0x81 while
only re-evaluates _TZD for 0x82.

Fix this violation of ACPI spec.

TODO: devices in _PSL, _ALx and _TZD may change after a notification 0x82.
  At this time, we need to re-bind the cooling devices with the thermal 
zone.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]> 
---
 drivers/acpi/thermal.c |  323 +++--
 1 file changed, 183 insertions(+), 140 deletions(-)

Index: linux-2.6/drivers/acpi/thermal.c
===
--- linux-2.6.orig/drivers/acpi/thermal.c
+++ linux-2.6/drivers/acpi/thermal.c
@@ -323,173 +323,221 @@ static int acpi_thermal_set_cooling_mode
return 0;
 }
 
-static int acpi_thermal_get_trip_points(struct acpi_thermal *tz)
+#define ACPI_TRIPS_CRITICAL0x01
+#define ACPI_TRIPS_HOT 0x02
+#define ACPI_TRIPS_PASSIVE 0x04
+#define ACPI_TRIPS_ACTIVE  0x08
+#define ACPI_TRIPS_DEVICES 0x10
+
+#define ACPI_TRIPS_REFRESH_THRESHOLDS  (ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE)
+#define ACPI_TRIPS_REFRESH_DEVICES ACPI_TRIPS_DEVICES
+
+#define ACPI_TRIPS_INIT  (ACPI_TRIPS_CRITICAL | ACPI_TRIPS_HOT |   \
+ ACPI_TRIPS_PASSIVE | ACPI_TRIPS_ACTIVE |  \
+ ACPI_TRIPS_DEVICES)
+
+/*
+ * This exception is thrown out in two cases:
+ * 1.An invalid trip point becomes invalid or a valid trip point becomes 
invalid
+ *   when re-evaluating the AML code.
+ * 2.TODO: Devices listed in _PSL, _ALx, _TZD may change.
+ *   We need to re-bind the cooling devices of a thermal zone when this occurs.
+ */
+#define ACPI_THERMAL_TRIPS_EXCEPTION(flags, str)   \
+do {   \
+   if (flags != ACPI_TRIPS_INIT)   \
+   ACPI_EXCEPTION((AE_INFO, AE_ERROR,  \
+   "ACPI thermal trip point %s changed\n"  \
+   "Please send acpidump to [EMAIL PROTECTED]", str)); \
+} while (0)
+
+static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
 {
acpi_status status = AE_OK;
-   int i = 0;
-
-
-   if (!tz)
-   return -EINVAL;
+   struct acpi_handle_list devices;
+   int valid = 0;
+   int i;
 
/* Critical Shutdown (required) */
-
-   status = acpi_evaluate_integer(tz->device->handle, "_CRT", NULL,
-  >trips.critical.temperature);
-   if (ACPI_FAILURE(status)) {
-   tz->trips.critical.flags.valid = 0;
-   ACPI_EXCEPTION((AE_INFO, status, "No critical threshold"));
-   return -ENODEV;
-   } else {
-   tz->trips.critical.flags.valid = 1;
-   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
- "Found critical threshold [%lu]\n",
- tz->trips.critical.temperature));
-   }
-
-   if (tz->trips.critical.flags.valid == 1) {
-   if (crt == -1) {
+   if (flag & ACPI_TRIPS_CRITICAL) {
+   status = acpi_evaluate_integer(tz->device->handle,
+   "_CRT", NULL, >trips.critical.temperature);
+   if (ACPI_FAILURE(status)) {
tz->trips.critical.flags.valid = 0;
-   } else if (crt > 0) {
-   unsigned long crt_k = CELSIUS_TO_KELVIN(crt);
-
-   /*
-* Allow override to lower critical threshold
-*/
-   if (crt_k < tz->trips.critical.temperature)
-   tz->trips.critical.temperature = crt_k;
+   ACPI_EXCEPTION((AE_INFO, status,
+   "No critical threshold"));
+   return -ENODEV;
+   } else {
+   tz->trips.critical.flags.valid = 1;
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   "Found critical threshold [%lu]\n",
+   tz->trips.critical.temperature));
+   }
+   if (tz->trips.critical.flags.valid == 1) {
+   if (crt == -1) {
+   tz->trips.critical.flags.valid = 0;
+   } else if (crt > 0) {
+   unsigned long crt_k = CELSIUS_TO_KELVIN(crt);
+   /*
+* Allow override to lower critical threshold
+*/
+   if (crt_k < tz->trips.critical.temperature)
+

[PATCH 6/10] ACPI: register ACPI Video LCD as generic thermal cooling device

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Register ACPI video device as thermal cooling devices as they may be listed
in _TZD method and the backlight control can be used for throttling.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/video.c |   78 ++-
 1 file changed, 77 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/acpi/video.c
===
--- linux-2.6.orig/drivers/acpi/video.c
+++ linux-2.6/drivers/acpi/video.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -179,6 +180,7 @@ struct acpi_video_device {
struct acpi_device *dev;
struct acpi_video_device_brightness *brightness;
struct backlight_device *backlight;
+   struct thermal_cooling_device *cdev;
struct output_device *output_dev;
 };
 
@@ -334,6 +336,54 @@ static struct output_properties acpi_out
.set_state = acpi_video_output_set,
.get_status = acpi_video_output_get,
 };
+
+
+/* thermal cooling device callbacks */
+static int video_get_max_state(struct thermal_cooling_device *cdev, char *buf)
+{
+   struct acpi_device *device = cdev->devdata;
+   struct acpi_video_device *video = acpi_driver_data(device);
+
+   return sprintf(buf, "%d\n", video->brightness->count - 3);
+}
+
+static int video_get_cur_state(struct thermal_cooling_device *cdev, char *buf)
+{
+   struct acpi_device *device = cdev->devdata;
+   struct acpi_video_device *video = acpi_driver_data(device);
+   unsigned long level;
+   int state;
+
+   acpi_video_device_lcd_get_level_current(video, );
+   for (state = 2; state < video->brightness->count; state++)
+   if (level == video->brightness->levels[state])
+   return sprintf(buf, "%d\n",
+  video->brightness->count - state - 1);
+
+   return -EINVAL;
+}
+
+static int
+video_set_cur_state(struct thermal_cooling_device *cdev, unsigned int state)
+{
+   struct acpi_device *device = cdev->devdata;
+   struct acpi_video_device *video = acpi_driver_data(device);
+   int level;
+
+   if ( state >= video->brightness->count - 2)
+   return -EINVAL;
+
+   state = video->brightness->count - state;
+   level = video->brightness->levels[state -1];
+   return acpi_video_device_lcd_set_level(video, level);
+}
+
+static struct thermal_cooling_device_ops video_cooling_ops = {
+   .get_max_state = video_get_max_state,
+   .get_cur_state = video_get_cur_state,
+   .set_cur_state = video_set_cur_state,
+};
+
 /* --
Video Management
-- 
*/
@@ -653,6 +703,7 @@ static void acpi_video_device_find_cap(s
 
if (device->cap._BCL && device->cap._BCM && device->cap._BQC && 
max_level > 0){
unsigned long tmp;
+   int result;
static int count = 0;
char *name;
name = kzalloc(MAX_NAME_LEN, GFP_KERNEL);
@@ -666,8 +717,25 @@ static void acpi_video_device_find_cap(s
device->backlight->props.max_brightness = max_level;
device->backlight->props.brightness = (int)tmp;
backlight_update_status(device->backlight);
-
kfree(name);
+
+   device->cdev = thermal_cooling_device_register("LCD",
+   device->dev, _cooling_ops);
+   if (device->cdev) {
+   printk(KERN_INFO PREFIX
+   "%s is registered as cooling_device%d\n",
+   device->dev->dev.bus_id, device->cdev->id);
+   result = sysfs_create_link(>dev->dev.kobj,
+ >cdev->device.kobj,
+ "thermal_cooling");
+   if (result)
+   printk(KERN_ERR PREFIX "Create sysfs link\n");
+   result = sysfs_create_link(>cdev->device.kobj,
+ >dev->dev.kobj,
+ "device");
+if (result)
+   printk(KERN_ERR PREFIX "Create sysfs link\n");
+   }
}
if (device->cap._DCS && device->cap._DSS){
static int count = 0;
@@ -1729,6 +1797,14 @@ static int acpi_video_bus_put_one_device
ACPI_DEVICE_NOTIFY,
acpi_video_device_notify);
backlight_device_unregister(device->backlight);
+   if (device->cdev) {
+   sysfs_remove_link(>dev->dev.kobj,
+

[PATCH 5/10] ACPI: register ACPI Processor as generic thermal cooling device

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Register ACPI processor as thermal cooling devices.
A combination of processor T-state and P-state are used for thermal throttling.
the processor will reduce the frequency first and then set the T-state.

we use cpufreq_thermal_reduction_pctg to calculate the cpufreq limit,
and call cpufreq_verify_with_limit to set the cpufreq limit.
if cpufreq driver is loaded, then we have four cooling state for cpufreq 
control.
cooling state 0: cpufreq limit == max_freq
cooling state 1: cpufreq limit == max_freq * 80%
cooling state 2: cpufreq limit == max_freq * 60%
cooling state 3: cpufreq limit == max_freq * 40%

after the cpufreq limit is set to 40 percentage of the max_freq,
we use T-state for cooling.

eg. a processor has P-state support, and it has 8 T-state (T0-T7),
the max_state of the proceesor is 10:

state   cpufreq-limit  T-state
0:  max_freqT0
1:  max_freq * 80%  T0
2:  max_freq * 60%  T0
3:  max_freq * 40%  T0
4:  max_freq * 40%  T1
5:  max_freq * 40%  T2
6:  max_freq * 40%  T3
7:  max_freq * 40%  T4
8:  max_freq * 40%  T5
9:  max_freq * 40%  T6
10: max_freq * 40%  T7

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Zhao Yakui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/processor_core.c|   23 ++
 drivers/acpi/processor_thermal.c |  134 +--
 include/acpi/processor.h |6 -
 3 files changed, 155 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/acpi/processor_thermal.c
===
--- linux-2.6.orig/drivers/acpi/processor_thermal.c
+++ linux-2.6/drivers/acpi/processor_thermal.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -93,6 +94,9 @@ static int acpi_processor_apply_limit(st
  * _any_ cpufreq driver and not only the acpi-cpufreq driver.
  */
 
+#define CPUFREQ_THERMAL_MIN_STEP 0
+#define CPUFREQ_THERMAL_MAX_STEP 3
+
 static unsigned int cpufreq_thermal_reduction_pctg[NR_CPUS];
 static unsigned int acpi_thermal_cpufreq_is_init = 0;
 
@@ -109,8 +113,9 @@ static int acpi_thermal_cpufreq_increase
if (!cpu_has_cpufreq(cpu))
return -ENODEV;
 
-   if (cpufreq_thermal_reduction_pctg[cpu] < 60) {
-   cpufreq_thermal_reduction_pctg[cpu] += 20;
+   if (cpufreq_thermal_reduction_pctg[cpu] <
+   CPUFREQ_THERMAL_MAX_STEP) {
+   cpufreq_thermal_reduction_pctg[cpu]++;
cpufreq_update_policy(cpu);
return 0;
}
@@ -123,8 +128,9 @@ static int acpi_thermal_cpufreq_decrease
if (!cpu_has_cpufreq(cpu))
return -ENODEV;
 
-   if (cpufreq_thermal_reduction_pctg[cpu] > 20)
-   cpufreq_thermal_reduction_pctg[cpu] -= 20;
+   if (cpufreq_thermal_reduction_pctg[cpu] >
+   (CPUFREQ_THERMAL_MIN_STEP + 1))
+   cpufreq_thermal_reduction_pctg[cpu]--;
else
cpufreq_thermal_reduction_pctg[cpu] = 0;
cpufreq_update_policy(cpu);
@@ -143,7 +149,7 @@ static int acpi_thermal_cpufreq_notifier
 
max_freq =
(policy->cpuinfo.max_freq *
-(100 - cpufreq_thermal_reduction_pctg[policy->cpu])) / 100;
+(100 - cpufreq_thermal_reduction_pctg[policy->cpu] * 20)) / 100;
 
cpufreq_verify_within_limits(policy, 0, max_freq);
 
@@ -155,6 +161,32 @@ static struct notifier_block acpi_therma
.notifier_call = acpi_thermal_cpufreq_notifier,
 };
 
+static int cpufreq_get_max_state(unsigned int cpu)
+{
+   if (!cpu_has_cpufreq(cpu))
+   return 0;
+
+   return CPUFREQ_THERMAL_MAX_STEP;
+}
+
+static int cpufreq_get_cur_state(unsigned int cpu)
+{
+   if (!cpu_has_cpufreq(cpu))
+   return 0;
+
+   return cpufreq_thermal_reduction_pctg[cpu];
+}
+
+static int cpufreq_set_cur_state(unsigned int cpu, int state)
+{
+   if (!cpu_has_cpufreq(cpu))
+   return 0;
+
+   cpufreq_thermal_reduction_pctg[cpu] = state;
+   cpufreq_update_policy(cpu);
+   return 0;
+}
+
 void acpi_thermal_cpufreq_init(void)
 {
int i;
@@ -179,6 +211,20 @@ void acpi_thermal_cpufreq_exit(void)
 }
 
 #else  /* ! CONFIG_CPU_FREQ */
+static int cpufreq_get_max_state(unsigned int cpu)
+{
+   return 0;
+}
+
+static int cpufreq_get_cur_state(unsigned int cpu)
+{
+   return 0;
+}
+
+static int cpufreq_set_cur_state(unsigned int cpu, int state)
+{
+   return 0;
+}
 
 static int acpi_thermal_cpufreq_increase(unsigned int cpu)
 {
@@ -310,6 +356,84 @@ int acpi_processor_get_limit_info(struct
return 0;
 }
 
+/* thermal coolign device callbacks */
+static int acpi_processor_max_state(struct acpi_processor *pr)
+{
+   int max_state = 0;
+
+   /*
+* There exists four states according to
+* cpufreq_thermal_reduction_ptg. 0, 1, 2, 3
+

[PATCH 8/10] ACPI: CELSIUS_TO_KELVIN fixup

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Fix an imprecision in CELSIUS_TO_KELVIN and move these
two macroes to a proper place.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/thermal.c  |3 ---
 include/linux/thermal.h |4 
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/acpi/thermal.c
===
--- linux-2.6.orig/drivers/acpi/thermal.c
+++ linux-2.6/drivers/acpi/thermal.c
@@ -65,9 +65,6 @@
 #define ACPI_THERMAL_MAX_ACTIVE10
 #define ACPI_THERMAL_MAX_LIMIT_STR_LEN 65
 
-#define KELVIN_TO_CELSIUS(t)(long)(((long)t-2732>=0) ? ((long)t-2732+5)/10 
: ((long)t-2732-5)/10)
-#define CELSIUS_TO_KELVIN(t)   ((t+273)*10)
-
 #define _COMPONENT ACPI_THERMAL_COMPONENT
 ACPI_MODULE_NAME("thermal");
 
Index: linux-2.6/include/linux/thermal.h
===
--- linux-2.6.orig/include/linux/thermal.h
+++ linux-2.6/include/linux/thermal.h
@@ -61,6 +61,10 @@ struct thermal_cooling_device {
struct list_head node;
 };
 
+#define KELVIN_TO_CELSIUS(t)   (long)(((long)t-2732 >= 0) ?\
+   ((long)t-2732+5)/10 : ((long)t-2732-5)/10)
+#define CELSIUS_TO_KELVIN(t)   ((t)*10+2732)
+
 struct thermal_zone_device {
int id;
char type[THERMAL_NAME_LENGTH];


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/10] ACPI: attach thermal zone info

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Intel menlow driver needs to get the pointer of themal_zone_device
structure of an ACPI thermal zone.
Attach this to each ACPI thermal zone device object.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/bus.c  |   25 +
 drivers/acpi/thermal.c  |   11 +++
 include/acpi/acpi_bus.h |2 ++
 3 files changed, 38 insertions(+)

Index: linux-2.6/drivers/acpi/bus.c
===
--- linux-2.6.orig/drivers/acpi/bus.c
+++ linux-2.6/drivers/acpi/bus.c
@@ -122,6 +122,31 @@ int acpi_bus_get_status(struct acpi_devi
 
 EXPORT_SYMBOL(acpi_bus_get_status);
 
+void acpi_bus_private_data_handler(acpi_handle handle,
+  u32 function, void *context)
+{
+   return;
+}
+EXPORT_SYMBOL(acpi_bus_private_data_handler);
+
+int acpi_bus_get_private_data(acpi_handle handle, void **data)
+{
+   acpi_status status = AE_OK;
+
+   if (!*data)
+   return -EINVAL;
+
+   status = acpi_get_data(handle, acpi_bus_private_data_handler, data);
+   if (ACPI_FAILURE(status) || !*data) {
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO, "No context for object [%p]\n",
+   handle));
+   return -ENODEV;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(acpi_bus_get_private_data);
+
 /* --
  Power Management
-- 
*/
Index: linux-2.6/drivers/acpi/thermal.c
===
--- linux-2.6.orig/drivers/acpi/thermal.c
+++ linux-2.6/drivers/acpi/thermal.c
@@ -1101,6 +1101,7 @@ static int acpi_thermal_register_thermal
 {
int trips = 0;
int result;
+   acpi_status status;
int i;
 
if (tz->trips.critical.flags.valid)
@@ -1129,6 +1130,15 @@ static int acpi_thermal_register_thermal
if (result)
return result;
 
+   status = acpi_attach_data(tz->device->handle,
+ acpi_bus_private_data_handler,
+ tz->thermal_zone);
+   if (ACPI_FAILURE(status)) {
+   ACPI_DEBUG_PRINT((ACPI_DB_ERROR,
+   "Error attaching device data\n"));
+   return -ENODEV;
+   }
+
tz->tz_enabled = 1;
 
printk(KERN_INFO PREFIX "%s is registered as thermal_zone%d\n",
@@ -1142,6 +1152,7 @@ static void acpi_thermal_unregister_ther
sysfs_remove_link(>thermal_zone->device.kobj, "device");
thermal_zone_device_unregister(tz->thermal_zone);
tz->thermal_zone = NULL;
+   acpi_detach_data(tz->device->handle, acpi_bus_private_data_handler);
 }
 
 
Index: linux-2.6/include/acpi/acpi_bus.h
===
--- linux-2.6.orig/include/acpi/acpi_bus.h
+++ linux-2.6/include/acpi/acpi_bus.h
@@ -320,6 +320,8 @@ struct acpi_bus_event {
 
 extern struct kset acpi_subsys;
 extern int acpi_bus_generate_netlink_event(const char*, const char*, u8, int);
+void acpi_bus_private_data_handler(acpi_handle, u32, void *);
+int acpi_bus_get_private_data(acpi_handle, void **);
 /*
  * External Functions
  */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 9/10] introduce intel_menlow platform specific driver

2008-01-16 Thread Zhang Rui

From: Thomas Sujith <[EMAIL PROTECTED]>

Intel menlow platform specific driver for thermal management.

Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
---
 drivers/misc/Kconfig|   10 
 drivers/misc/Makefile   |1 
 drivers/misc/intel_menlow.c |  527 
 3 files changed, 538 insertions(+)

Index: linux-2.6/drivers/misc/intel_menlow.c
===
--- /dev/null
+++ linux-2.6/drivers/misc/intel_menlow.c
@@ -0,0 +1,527 @@
+/*
+*  intel_menlow.c - Intel menlow Driver for thermal management extension
+*
+*  Copyright (C) 2008 Intel Corp
+*  Copyright (C) 2008 Sujith Thomas <[EMAIL PROTECTED]>
+*  Copyright (C) 2008 Zhang Rui <[EMAIL PROTECTED]>
+*  ~~
+*
+*  This program is free software; you can redistribute it and/or modify
+*  it under the terms of the GNU General Public License as published by
+*  the Free Software Foundation; version 2 of the License.
+*
+*  This program is distributed in the hope that it will be useful, but
+*  WITHOUT ANY WARRANTY; without even the implied warranty of
+*  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+*  General Public License for more details.
+*
+*  You should have received a copy of the GNU General Public License along
+*  with this program; if not, write to the Free Software Foundation, Inc.,
+*  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+*
+* ~~
+*
+*  This driver creates the sys I/F for programming the sensors.
+*  It also implements the driver for intel menlow memory controller (hardware
+*  id is INT0002) which makes use of the platform specific ACPI methods
+*  to get/set bandwidth.
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Thomas Sujith");
+MODULE_AUTHOR("Zhang Rui");
+MODULE_DESCRIPTION("Intel Menlow platform specific driver");
+MODULE_LICENSE("GPL");
+
+/*
+ * Memory controller device control
+ */
+
+#define MEMORY_GET_BANDWIDTH "GTHS"
+#define MEMORY_SET_BANDWIDTH "STHS"
+#define MEMORY_ARG_CUR_BANDWIDTH 1
+#define MEMORY_ARG_MAX_BANDWIDTH 0
+
+static int
+memory_get_int_max_bandwidth(struct thermal_cooling_device *cdev,
+unsigned long *max_state)
+{
+   struct acpi_device *device = cdev->devdata;
+   acpi_handle handle = device->handle;
+   unsigned long value;
+   struct acpi_object_list arg_list;
+   union acpi_object arg;
+   acpi_status status = AE_OK;
+
+   arg_list.count = 1;
+   arg_list.pointer = 
+   arg.type = ACPI_TYPE_INTEGER;
+   arg.integer.value = MEMORY_ARG_MAX_BANDWIDTH;
+   status = acpi_evaluate_integer(handle, MEMORY_GET_BANDWIDTH,
+  _list, );
+   if (ACPI_FAILURE(status))
+   return -EFAULT;
+
+   *max_state = value - 1;
+   return 0;
+}
+
+static int
+memory_get_max_bandwidth(struct thermal_cooling_device *cdev, char *buf)
+{
+   unsigned long value;
+   if (memory_get_int_max_bandwidth(cdev,))
+   return -EINVAL;
+
+   return sprintf(buf, "%ld\n", value);
+}
+
+static int
+memory_get_cur_bandwidth(struct thermal_cooling_device *cdev, char *buf)
+{
+   struct acpi_device *device = cdev->devdata;
+   acpi_handle handle = device->handle;
+   unsigned long value;
+   struct acpi_object_list arg_list;
+   union acpi_object arg;
+   acpi_status status = AE_OK;
+
+   arg_list.count = 1;
+   arg_list.pointer = 
+   arg.type = ACPI_TYPE_INTEGER;
+   arg.integer.value = MEMORY_ARG_CUR_BANDWIDTH;
+   status = acpi_evaluate_integer(handle, MEMORY_GET_BANDWIDTH,
+  _list, );
+   if (ACPI_FAILURE(status))
+   return -EFAULT;
+
+   return sprintf(buf, "%ld\n", value);
+}
+
+static int
+memory_set_cur_bandwidth(struct thermal_cooling_device *cdev,
+unsigned int state)
+{
+   struct acpi_device *device = cdev->devdata;
+   acpi_handle handle = device->handle;
+   struct acpi_object_list arg_list;
+   union acpi_object arg;
+   acpi_status status;
+   int temp;
+   unsigned long max_state;
+
+   if (memory_get_int_max_bandwidth(cdev,_state))
+   return -EFAULT;
+
+   if (max_state < 0 || state > max_state)
+   return -EINVAL;
+
+   arg_list.count = 1;
+   arg_list.pointer = 
+   arg.type = ACPI_TYPE_INTEGER;
+   arg.integer.value = state;
+
+   status =
+   acpi_evaluate_integer(handle, MEMORY_SET_BANDWIDTH, _list,
+ (unsigned long *));
+
+   printk(KERN_INFO
+   "Bandwidth value was %d: status is %d\n", state, status);
+   if

[PATCH 10/10] ACPI: thermal fixup

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

The alias name may be used in _PSL, _ALx and _TZD,
so we bind the cooling device only if the acpi_device node matches.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/thermal.c |   42 --
 1 file changed, 24 insertions(+), 18 deletions(-)

Index: linux-2.6/drivers/acpi/thermal.c
===
--- linux-2.6.orig/drivers/acpi/thermal.c
+++ linux-2.6/drivers/acpi/thermal.c
@@ -1015,7 +1015,9 @@ static int acpi_thermal_cooling_device_c
 {
struct acpi_device *device = cdev->devdata;
struct acpi_thermal *tz = thermal->devdata;
-   acpi_handle handle = device->handle;
+   struct acpi_device *dev;
+   acpi_status status;
+   acpi_handle handle;
int i;
int j;
int trip = -1;
@@ -1031,12 +1033,13 @@ static int acpi_thermal_cooling_device_c
trip++;
for (i = 0; i < tz->trips.passive.devices.count;
i++) {
-   if (tz->trips.passive.devices.handles[i] !=
-   handle)
-   continue;
-   result = action(thermal, trip, cdev);
-   if (result)
-   goto failed;
+   handle = tz->trips.passive.devices.handles[i];
+   status = acpi_bus_get_device(handle, );
+   if (ACPI_SUCCESS(status) && (dev == device)) {
+   result = action(thermal, trip, cdev);
+   if (result)
+   goto failed;
+   }
}
}
 
@@ -1047,21 +1050,24 @@ static int acpi_thermal_cooling_device_c
for (j = 0;
j < tz->trips.active[i].devices.count;
j++) {
-   if (tz->trips.active[i].devices.
-   handles[j] != handle)
-   continue;
-   result = action(thermal, trip, cdev);
-   if (result)
-   goto failed;
+   handle = tz->trips.active[i].devices.handles[j];
+   status = acpi_bus_get_device(handle, );
+   if (ACPI_SUCCESS(status) && (dev == device)) {
+   result = action(thermal, trip, cdev);
+   if (result)
+   goto failed;
+   }
}
}
 
for (i = 0; i < tz->devices.count; i++) {
-   if (tz->devices.handles[i] != handle)
-   continue;
-   result = action(thermal, -1, cdev);
-   if (result)
-   goto failed;
+   handle = tz->devices.handles[i];
+   status = acpi_bus_get_device(handle, );
+   if (ACPI_SUCCESS(status) && (dev == device)) {
+   result = action(thermal, -1, cdev);
+   if (result)
+   goto failed;
+   }
}
 
 failed:


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/10] ACPI: register ACPI thermal zone as generic thermal zone devices

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

Register ACPI thermal zone as thermal zone device.

the new sys I/F for ACPI thermal zone will be like this:

/sys/class/thermal:
|thermal_zone1:
|-type: "ACPI thermal zone". RO
|-temp: the current temperature. RO
|-mode: the current working mode. RW.
the default value is "kernel"  which 
means  thermal
management is done by ACPI thermal 
driver.
"echo user > mode" prevents all the 
ACPI thermal driver
actions upon any trip points.
|-trip_point_0_temp:the threshold of trip point 0. RO.
|-trip_point_0_type:"critical". RO.
the type of trip point 0
This may be one of 
critical/hot/passive/active[x]
for an ACPI thermal zone.
...
|-trip_point_3_temp:
|-trip_point_3_type:"active[1]"

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 drivers/acpi/Kconfig   |1 
 drivers/acpi/thermal.c |  301 +++--
 2 files changed, 292 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/acpi/thermal.c
===
--- linux-2.6.orig/drivers/acpi/thermal.c
+++ linux-2.6/drivers/acpi/thermal.c
@@ -43,7 +43,7 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 #include 
 
@@ -195,6 +195,8 @@ struct acpi_thermal {
struct acpi_thermal_trips trips;
struct acpi_handle_list devices;
struct timer_list timer;
+   struct thermal_zone_device *thermal_zone;
+   int tz_enabled;
struct mutex lock;
 };
 
@@ -732,6 +734,9 @@ static void acpi_thermal_check(void *dat
if (result)
goto unlock;
 
+   if (!tz->tz_enabled)
+   goto unlock;
+
memset(>state, 0, sizeof(tz->state));
 
/*
@@ -825,6 +830,273 @@ static void acpi_thermal_check(void *dat
mutex_unlock(>lock);
 }
 
+/* sys I/F for generic thermal sysfs support */
+static int thermal_get_temp(struct thermal_zone_device *thermal, char *buf)
+{
+   struct acpi_thermal *tz = thermal->devdata;
+
+   if (!tz)
+   return -EINVAL;
+
+   return sprintf(buf, "%ld\n", KELVIN_TO_CELSIUS(tz->temperature));
+}
+
+static const char enabled[] = "kernel";
+static const char disabled[] = "user";
+static int thermal_get_mode(struct thermal_zone_device *thermal,
+   char *buf)
+{
+   struct acpi_thermal *tz = thermal->devdata;
+
+   if (!tz)
+   return -EINVAL;
+
+   return sprintf(buf, "%s\n", tz->tz_enabled ?
+   enabled : disabled);
+}
+
+static int thermal_set_mode(struct thermal_zone_device *thermal,
+   const char *buf)
+{
+   struct acpi_thermal *tz = thermal->devdata;
+   int enable;
+
+   if (!tz)
+   return -EINVAL;
+
+   /*
+* enable/disable thermal management from ACPI thermal driver
+*/
+   if (!strncmp(buf, enabled, sizeof enabled - 1))
+   enable = 1;
+   else if (!strncmp(buf, disabled, sizeof disabled - 1))
+   enable = 0;
+   else
+   return -EINVAL;
+
+   if (enable != tz->tz_enabled) {
+   tz->tz_enabled = enable;
+   ACPI_DEBUG_PRINT((ACPI_DB_INFO,
+   "%s ACPI thermal control\n",
+   tz->tz_enabled ? enabled : disabled));
+   acpi_thermal_check(tz);
+   }
+   return 0;
+}
+
+static int thermal_get_trip_type(struct thermal_zone_device *thermal,
+int trip, char *buf)
+{
+   struct acpi_thermal *tz = thermal->devdata;
+   int i;
+
+   if (!tz || trip < 0)
+   return -EINVAL;
+
+   if (tz->trips.critical.flags.valid) {
+   if (!trip)
+   return sprintf(buf, "critical\n");
+   trip--;
+   }
+
+   if (tz->trips.hot.flags.valid) {
+   if (!trip)
+   return sprintf(buf, "hot\n");
+   trip--;
+   }
+
+   if (tz->trips.passive.flags.valid) {
+   if (!trip)
+   return sprintf(buf, "passive\n");
+   trip--;
+   }
+
+   for (i = 0; i < ACPI_THERMAL_MAX_ACTIVE &&
+   tz->trips.active[i].flags.valid; i++) {
+   if (!trip)
+   return sprintf(buf, "active%d\n", i);
+   trip--;
+   }
+
+   return -EINVAL;
+}
+
+static int thermal_get_trip_temp(struct thermal_zone_device *thermal,
+

[PATCH 1/10] the generic thermal sysfs driver

2008-01-16 Thread Zhang Rui

From: Zhang Rui <[EMAIL PROTECTED]>

The Generic Thermal sysfs driver for thermal management.

Signed-off-by: Zhang Rui <[EMAIL PROTECTED]>
Signed-off-by: Thomas Sujith <[EMAIL PROTECTED]>
---
 Documentation/thermal/sysfs-api.txt |  247 
 drivers/Kconfig |2 
 drivers/Makefile|1 
 drivers/thermal/Kconfig |   16 
 drivers/thermal/Makefile|6 
 drivers/thermal/thermal.c   |  714 
 include/linux/thermal.h |   90 
 7 files changed, 1076 insertions(+)

Index: linux-2.6/drivers/thermal/thermal.c
===
--- /dev/null
+++ linux-2.6/drivers/thermal/thermal.c
@@ -0,0 +1,714 @@
+/*
+ *  thermal.c - Generic Thermal Management Sysfs support.
+ *
+ *  Copyright (C) 2008 Intel Corp
+ *  Copyright (C) 2008 Zhang Rui <[EMAIL PROTECTED]>
+ *  Copyright (C) 2008 Sujith Thomas <[EMAIL PROTECTED]>
+ *
+ *  ~~
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful, but
+ *  WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ *  General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License along
+ *  with this program; if not, write to the Free Software Foundation, Inc.,
+ *  59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * ~~
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Zhang Rui")
+MODULE_DESCRIPTION("Generic thermal management sysfs support");
+MODULE_LICENSE("GPL");
+
+#define PREFIX "Thermal: "
+
+struct thermal_cooling_device_instance {
+   int id;
+   char name[THERMAL_NAME_LENGTH];
+   struct thermal_zone_device *tz;
+   struct thermal_cooling_device *cdev;
+   int trip;
+   char attr_name[THERMAL_NAME_LENGTH];
+   struct device_attribute attr;
+   struct list_head node;
+};
+
+static DEFINE_IDR(thermal_tz_idr);
+static DEFINE_IDR(thermal_cdev_idr);
+static DEFINE_MUTEX(thermal_idr_lock);
+
+static LIST_HEAD(thermal_tz_list);
+static LIST_HEAD(thermal_cdev_list);
+static DEFINE_MUTEX(thermal_list_lock);
+
+static int get_idr(struct idr *idr, struct mutex *lock, int *id)
+{
+   int err;
+
+  again:
+   if (unlikely(idr_pre_get(idr, GFP_KERNEL) == 0))
+   return -ENOMEM;
+
+   if (lock)
+   mutex_lock(lock);
+   err = idr_get_new(idr, NULL, id);
+   if (lock)
+   mutex_unlock(lock);
+   if (unlikely(err == -EAGAIN))
+   goto again;
+   else if (unlikely(err))
+   return err;
+
+   *id = *id & MAX_ID_MASK;
+   return 0;
+}
+
+static void release_idr(struct idr *idr, struct mutex *lock, int id)
+{
+   if (lock)
+   mutex_lock(lock);
+   idr_remove(idr, id);
+   if (lock)
+   mutex_unlock(lock);
+}
+
+/* sys I/F for thermal zone */
+
+#define to_thermal_zone(_dev) \
+   container_of(_dev, struct thermal_zone_device, device)
+
+static ssize_t
+type_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+   struct thermal_zone_device *tz = to_thermal_zone(dev);
+
+   return sprintf(buf, "%s\n", tz->type);
+}
+
+static ssize_t
+temp_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+   struct thermal_zone_device *tz = to_thermal_zone(dev);
+
+   if (!tz->ops->get_temp)
+   return -EPERM;
+
+   return tz->ops->get_temp(tz, buf);
+}
+
+static ssize_t
+mode_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+   struct thermal_zone_device *tz = to_thermal_zone(dev);
+
+   if (!tz->ops->get_mode)
+   return -EPERM;
+
+   return tz->ops->get_mode(tz, buf);
+}
+
+static ssize_t
+mode_store(struct device *dev, struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct thermal_zone_device *tz = to_thermal_zone(dev);
+   int result;
+
+   if (!tz->ops->set_mode)
+   return -EPERM;
+
+   result = tz->ops->set_mode(tz, buf);
+   if (result)
+   return result;
+
+   return count;
+}
+
+static ssize_t
+trip_point_type_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct thermal_zone_device *tz = to_thermal_zone(dev);
+   int trip;
+
+   if (!tz->ops->get_trip_type)
+   return -EPERM;
+
+   if (!sscanf(attr->attr.name, "trip_point_%d_type", ))
+

[PATCH 0/10] generic thermal management

2008-01-16 Thread Zhang Rui

Hi, all,

This patch series introduces a new generic thermal sysfs driver
which provides a set of interfaces for thermal zone devices (sensors)
and thermal cooling devices (fan, processor...) to register with the
thermal management solution and to be a part of it.

And it also includes the implementation for ACPI thermal zone.
Standard sysfs I/F should be available for all ACPI thermal zones
with this patch series applied.

Patch 01 creates the new generic thermal sysfs driver.
 It defines two kinds of devices, thermal zone device and
 thermal cooling device.
 A thermal zone device usually contains a sensor to monitor the
 temperature, several trip points and a bunch of cooling devices
 associated with them.
 A thermal cooling device is a device that can be throttled
 to cool the system.
 The generic thermal sysfs driver creates the standard sysfs I/F
 for any registered thermal zone and thermal cooling device.
 And binds the cooling devices to thermal zones if possible.

Patch 02 registers ACPI thermal zone as thermal zone device.

Patch 03 is a fix of violations of ACPI spec in ACPI thermal driver.

Patch 04 registers ACPI Fan as thermal cooling device.

Patch 05 registers ACPI Processor as thermal cooling device.

Patch 06 registers ACPI Video LCD as thermal cooling device.
 Because throttling the backlight of LCD can cool the system as well.

Patch 09 creates a new platform specific driver, intel_menlow.
 which is the thermal enhancement driver for intel menlow platform.
 It programs the sensor of each thermal zone and registers the
 intel memory controller (hardware id INT0002) as thermal cooling 
device.

Patch 07 08 and 10 are minor fixes, please refer to the changelog of each patch.

I've tested them and they work well on several systems.
I'd like to get some feedbacks from the list. Any comments are appreciated. :)

Thanks,
Rui


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Andreas Dilger

On Jan 15, 2008  22:05 -0500, Rik van Riel wrote:
> With a filesystem that is compartmentalized and checksums metadata,
> I believe that an online fsck is absolutely worth having.
> 
> Instead of the filesystem resorting to mounting the whole volume
> read-only on certain errors, part of the filesystem can be offlined
> while an fsck runs.  This could even be done automatically in many
> situations.

In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an "error" flag in a single
group and have it become read-only.

As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group.  I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ACPI: EC: add leading zeros to debug messages

2008-01-16 Thread Németh Márton

From: Márton Németh <[EMAIL PROTECTED]>

Add leading zeros to pr_debug() calls. For example if x=0x0a, the format
"0x%2x" will result the string "0x a", the format "0x%2.2x" will result "0x0a".

Signed-off-by: Márton Németh <[EMAIL PROTECTED]>
---
--- linux-2.6.24-rc8/drivers/acpi/ec.c.orig 2008-01-16 07:25:33.0 
+0100
+++ linux-2.6.24-rc8/drivers/acpi/ec.c  2008-01-17 07:15:10.0 +0100
@@ -138,26 +138,26 @@ static struct acpi_ec {
 static inline u8 acpi_ec_read_status(struct acpi_ec *ec)
 {
u8 x = inb(ec->command_addr);
-   pr_debug(PREFIX "---> status = 0x%2x\n", x);
+   pr_debug(PREFIX "---> status = 0x%2.2x\n", x);
return x;
 }

 static inline u8 acpi_ec_read_data(struct acpi_ec *ec)
 {
u8 x = inb(ec->data_addr);
-   pr_debug(PREFIX "---> data = 0x%2x\n", x);
+   pr_debug(PREFIX "---> data = 0x%2.2x\n", x);
return inb(ec->data_addr);
 }

 static inline void acpi_ec_write_cmd(struct acpi_ec *ec, u8 command)
 {
-   pr_debug(PREFIX "<--- command = 0x%2x\n", command);
+   pr_debug(PREFIX "<--- command = 0x%2.2x\n", command);
outb(command, ec->command_addr);
 }

 static inline void acpi_ec_write_data(struct acpi_ec *ec, u8 data)
 {
-   pr_debug(PREFIX "<--- data = 0x%2x\n", data);
+   pr_debug(PREFIX "<--- data = 0x%2.2x\n", data);
outb(data, ec->data_addr);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ACPI: EC: "DEBUG" needs to be defined earlier

2008-01-16 Thread Németh Márton

From: Márton Németh <[EMAIL PROTECTED]>

The "DEBUG" symbol needs to be defined before #including  to
get the pr_debug() working.

Signed-off-by: Márton Németh <[EMAIL PROTECTED]>
---
--- linux-2.6.24-rc8/drivers/acpi/ec.c.orig 2008-01-16 07:25:33.0 
+0100
+++ linux-2.6.24-rc8/drivers/acpi/ec.c  2008-01-16 19:41:24.0 +0100
@@ -26,6 +26,9 @@
  * ~~
  */

+/* Uncomment next line to get verbose print outs*/
+/* #define DEBUG */
+
 #include 
 #include 
 #include 
@@ -47,9 +50,6 @@
 #undef PREFIX
 #define PREFIX "ACPI: EC: "

-/* Uncomment next line to get verbose print outs*/
-/* #define DEBUG */
-
 /* EC status register */
 #define ACPI_EC_FLAG_OBF   0x01/* Output buffer full */
 #define ACPI_EC_FLAG_IBF   0x02/* Input buffer full */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bluez-devel] Oops involving RFCOMM and sysfs

2008-01-16 Thread Dave Young

On Jan 17, 2008 7:06 AM, Gabor Gombas <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On Wed, Jan 16, 2008 at 09:02:05AM +0800, Dave Young wrote:
>
> > The rfcomm tty device will possibly retain even when conn is down,
> > and sysfs doesn't support zombie device moving, so this patch
> > move the tty device before conn device is destroyed.
> >
> > Signed-off-by: Dave Young <[EMAIL PROTECTED]>
>
> This seems to work, both the oops and the hang are gone. I get these
> messages in syslog when the Bluetooth link hangs and I want to kill pppd
> with "poff":
>
> Jan 16 23:55:59 twister kernel: unregister_netdevice: waiting for ppp0 to 
> become free. Usage count = 1
> Jan 16 23:56:09 twister kernel: unregister_netdevice: waiting for ppp0 to 
> become free. Usage count = 1
>
> But a "killall -9 pppd" seems to help and then the re-connect (after the
> phone got power-cycled) works.

Weird, I guess "device_move(dev, NULL) two times" cause the problem.

Anyway, device_move should check the old_parent and new_parent , if
they equal to each other then just return.

Am I right?

>
>
> Gabor
>
> --
>  -
>  MTA SZTAKI Computer and Automation Research Institute
> Hungarian Academy of Sciences
>  -
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread David Miller

From: "Brandeburg, Jesse" <[EMAIL PROTECTED]>
Date: Wed, 16 Jan 2008 23:09:47 -0800

> We spent Wednesday trying to reproduce (without the patch) these issues
> without much luck, and have applied the patch cleanly and will continue
> testing it.  Given the simplicity of the changes, and the community
> testing, I'll give my ack and we will continue testing.

You need a slow CPU, and you need to make sure you do actually
trigger the TX limiting code there.

I bet your cpus are fast enough that it simply never triggers.
:-)

> Acked-by: Jesse Brandeburg <[EMAIL PROTECTED]>

Thanks for reviewing Jesse.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang

2008-01-16 Thread Brandeburg, Jesse

David Miller wrote:
> From: "Brandeburg, Jesse" <[EMAIL PROTECTED]>
> Date: Tue, 15 Jan 2008 13:53:43 -0800
> 
>> The tx code has an "early exit" that tries to limit the amount of tx
>> packets handled in a single poll loop and requires napi or interrupt
>> rescheduling based on the return value from e1000_clean_tx_irq.
> 
> That explains everything, thanks Jesse.
> 
> Ok, here is the patch I'll propose to fix this.  The goal is to make
> it as simple as possible without regressing the thing we were trying
> to fix.

We spent Wednesday trying to reproduce (without the patch) these issues
without much luck, and have applied the patch cleanly and will continue
testing it.  Given the simplicity of the changes, and the community
testing, I'll give my ack and we will continue testing.

I think we should fix Robert's (unrelated, but in this thread) reported
issue before 2.6.24 final if we can, and I'll look at that tonight and
tomorrow.

Thanks for your work on this Dave,
 Jesse

Acked-by: Jesse Brandeburg <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] BUG_ON() bad input to request_irq

2008-01-16 Thread Rusty Russell

Is there any reason why these bugs should be treated gently?  The
caller might not want to check NR_IRQS and IRQ_NOREQUEST cases, but
a NULL handler or NULL dev_id w/ shared are coding bugs.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 kernel/irq/manage.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff -r c2eb8ef5a0be kernel/irq/manage.c
--- a/kernel/irq/manage.c   Thu Jan 17 15:48:03 2008 +1100
+++ b/kernel/irq/manage.c   Thu Jan 17 15:49:33 2008 +1100
@@ -532,13 +532,12 @@ int request_irq(unsigned int irq, irq_ha
 * which interrupt is which (messes up the interrupt freeing
 * logic etc).
 */
-   if ((irqflags & IRQF_SHARED) && !dev_id)
-   return -EINVAL;
+   BUG_ON((irqflags & IRQF_SHARED) && !dev_id);
+   BUG_ON(!handler);
+
if (irq >= NR_IRQS)
return -EINVAL;
if (irq_desc[irq].status & IRQ_NOREQUEST)
-   return -EINVAL;
-   if (!handler)
return -EINVAL;
 
action = kmalloc(sizeof(struct irqaction), GFP_ATOMIC);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0 of 8] x86: refactored paravirt mmu_ops

2008-01-16 Thread Ingo Molnar


* Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:

> Hi Ingo,
> 
> I refactored the paravirt.h mmu_ops patch into a number of smaller 
> ones.

thanks, applied.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] request_irq() always returns -EINVAL with a NULL handler.

2008-01-16 Thread Rusty Russell

I assume that these ancient network drivers were trying to find out if
an irq is available.  eepro.c expecting +EBUSY was doubly wrong.

I'm not sure that can_request_irq() is the right thing, but these drivers
are definitely wrong.

request_irq should BUG() on bad input, and these would have been found
earlier.

Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
---
 drivers/net/3c503.c |2 +-
 drivers/net/e2100.c |2 +-
 drivers/net/eepro.c |2 +-
 drivers/net/hp.c|2 +-
 kernel/irq/manage.c |1 +
 5 files changed, 5 insertions(+), 4 deletions(-)

diff -r 0b7e4fbb6238 drivers/net/3c503.c
--- a/drivers/net/3c503.c   Thu Jan 17 15:49:34 2008 +1100
+++ b/drivers/net/3c503.c   Thu Jan 17 16:40:28 2008 +1100
@@ -379,7 +379,7 @@ el2_open(struct net_device *dev)
 
outb(EGACFR_NORM, E33G_GACFR);  /* Enable RAM and interrupts. */
do {
-   if (request_irq (*irqp, NULL, 0, "bogus", dev) != -EBUSY) {
+   if (can_request_irq(*irqp, 0)) {
/* Twinkle the interrupt, and check if it's seen. */
unsigned long cookie = probe_irq_on();
outb_p(0x04 << ((*irqp == 9) ? 2 : *irqp), E33G_IDCFR);
diff -r 0b7e4fbb6238 drivers/net/e2100.c
--- a/drivers/net/e2100.c   Thu Jan 17 15:49:34 2008 +1100
+++ b/drivers/net/e2100.c   Thu Jan 17 16:40:28 2008 +1100
@@ -202,7 +202,7 @@ static int __init e21_probe1(struct net_
if (dev->irq < 2) {
int irqlist[] = {15,11,10,12,5,9,3,4}, i;
for (i = 0; i < 8; i++)
-   if (request_irq (irqlist[i], NULL, 0, "bogus", NULL) != 
-EBUSY) {
+   if (can_request_irq(irqlist[i], 0)) {
dev->irq = irqlist[i];
break;
}
diff -r 0b7e4fbb6238 drivers/net/eepro.c
--- a/drivers/net/eepro.c   Thu Jan 17 15:49:34 2008 +1100
+++ b/drivers/net/eepro.c   Thu Jan 17 16:40:28 2008 +1100
@@ -914,7 +914,7 @@ static int  eepro_grab_irq(struct net_dev
 
eepro_sw2bank0(ioaddr); /* Switch back to Bank 0 */
 
-   if (request_irq (*irqp, NULL, IRQF_SHARED, "bogus", dev) != 
EBUSY) {
+   if (can_request_irq(*irqp, IRQF_SHARED)) {
unsigned long irq_mask;
/* Twinkle the interrupt, and check if it's seen */
irq_mask = probe_irq_on();
diff -r 0b7e4fbb6238 drivers/net/hp.c
--- a/drivers/net/hp.c  Thu Jan 17 15:49:34 2008 +1100
+++ b/drivers/net/hp.c  Thu Jan 17 16:40:28 2008 +1100
@@ -170,7 +170,7 @@ static int __init hp_probe1(struct net_d
int *irqp = wordmode ? irq_16list : irq_8list;
do {
int irq = *irqp;
-   if (request_irq (irq, NULL, 0, "bogus", NULL) != 
-EBUSY) {
+   if (can_request_irq(irq, 0)) {
unsigned long cookie = probe_irq_on();
/* Twinkle the interrupt, and check if it's 
seen. */
outb_p(irqmap[irq] | HP_RUN, ioaddr + 
HP_CONFIGURE);
diff -r 0b7e4fbb6238 kernel/irq/manage.c
--- a/kernel/irq/manage.c   Thu Jan 17 15:49:34 2008 +1100
+++ b/kernel/irq/manage.c   Thu Jan 17 16:40:28 2008 +1100
@@ -252,6 +252,7 @@ int can_request_irq(unsigned int irq, un
 
return !action;
 }
+EXPORT_SYMBOL(can_request_irq);
 
 void compat_irq_chip_set_default_handler(struct irq_desc *desc)
 {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bitops source problem

2008-01-16 Thread KOSAKI Motohiro

Hi

> If that is indeed the source of your change_bit function then there is
> a problem.  However in my kernel tree there is a LOCK_PREFIX in the
> definition of the atomic version.  I don't have your exact source tree
> handy, but on a local RHEL4 system, the LOCK_PREFIX is still there:
> 
> static inline void change_bit(int nr, volatile unsigned long * addr)
> {
> __asm__ __volatile__( LOCK_PREFIX
> "btcl %1,%0"
> :"=m" (ADDR)
> :"Ir" (nr));
> }

2.6.24-rc6-mm1 have LOCK_PREFIX too :)


static inline void change_bit(int nr, volatile void *addr)
{
asm volatile(LOCK_PREFIX "btc %1,%0"
 : ADDR : "Ir" (nr));
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bitops source problem

2008-01-16 Thread Roland Dreier

 > Then, I think there is a problem with the function written below which is 
 > meant to be atomic.
 > 
 > static __inline__ void change_bit(int nr, volatile void * addr)
 > {
 > __asm__ __volatile__(
 > "btcl %1,%0"
 > :"=m" (ADDR)
 > :"Ir" (nr));
 > }

If that is indeed the source of your change_bit function then there is
a problem.  However in my kernel tree there is a LOCK_PREFIX in the
definition of the atomic version.  I don't have your exact source tree
handy, but on a local RHEL4 system, the LOCK_PREFIX is still there:

static inline void change_bit(int nr, volatile unsigned long * addr)
{
__asm__ __volatile__( LOCK_PREFIX
"btcl %1,%0"
:"=m" (ADDR)
:"Ir" (nr));
}

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86: remove casts

2008-01-16 Thread Kyle McMartin

On Wed, Jan 16, 2008 at 11:57:54PM +0100, Jan Engelhardt wrote:
> 
> On Jan 16 2008 17:20, Kyle McMartin wrote:
> >On Wed, Jan 16, 2008 at 10:15:39PM +0100, Jan Engelhardt wrote:
> >> parent a9f7faa5fd229a65747f02ab0f2d45ee35856760
> >> commit 
> >
> >^- did you just make that up? ;-)
> 
> Yes. git does not care anyway.
> 

right. the point was if you had generated a sha1 of all ones. that would
be kind of neat.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Bitops source problem

2008-01-16 Thread Pravin Nanaware

Thanks for the reply John. 

Then, I think there is a problem with the function written below which is meant 
to be atomic.

static __inline__ void change_bit(int nr, volatile void * addr)
{
__asm__ __volatile__(
"btcl %1,%0"
:"=m" (ADDR)
:"Ir" (nr));
}

Regards,
Pravin


-Original Message-
From: John Hubbard [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 17, 2008 11:17 AM
To: Pravin Nanaware
Cc: LKML
Subject: Re: Bitops source problem


Pravin Nanaware wrote:
> Hi,
> 
> I was just going through the include file in the /usr/include/asm/bitops.h
> 
> The function description describes it as non-atomic but it seems it is not. 
> 
> static __inline__ void __change_bit(int nr, volatile void * addr)
> {
> __asm__ __volatile__(
> "btcl %1,%0"
> :"=m" (ADDR)
> :"Ir" (nr));
> }
> 
> The kernel version I am using is 2.6.9-42. Is it right or am I missing 
> something ?  
> 
> Thanks,
> Pravin
> 

The bitops.h comments are correct: the btc IA-32 instruction is only 
atomic if used with the lock prefix. The function above does not use the 
lock prefix, so it is not atomic.

thanks,
John Hubbard


-**Nihilent***
" *** All information contained in this communication is confidential, 
proprietary, privileged
and is intended for the addressees only. If youhave received this E-mail in 
error please notify
mail administrator by telephone on +91-20-39846100 or E-mail the sender by 
replying to
this message, and then delete this E-mail and other copies of it from your 
computer system.
Any unauthorized dissemination,publication, transfer or use of the contents of 
this communication,
with or without modifications is punishable under the relevant law.

Nihilent has scanned this mail with current virus checking technologies. 
However, Nihilent makes no 
representations or warranties to the effect that this communication is 
virus-free.

Nihilent reserves the right to monitor all E-mail communications through its 
Corporate Network. *** "

*-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-rc7: memory leak?

2008-01-16 Thread CaT

Not sure where to begin so here goes anway. Today I did an rsync backup
of a server with 2million+ files. Before doing so the used memory on the
server this was initiated from was under 200meg (excluding buffers and
cache). During the rsync the memory used grew to just shy of 1.6gig and
now, about 2 hours after the rsync has well and truly finished, the used
memory is at 1.23gig. This is what free reports:

 total   used   free sharedbuffers cached
Mem:   20581281994468  63660  0 688604 11432
-/+ buffers/cache:1294432 763696
Swap:  1048568  01048568

There are 75 processes on the box of which almost 47 are kernel
processes + init. Of the rest, the top 3 have an RSS of 9.4meg, 6.2meg
and 4.8meg, 7 are at around 2meg and the rest are below 2meg with the
majority below 1. So unless I'm misunderstanding something, processess
alone do not account for the amount of used memory.

The destination of the rsync was an ext3 filesystem over raid5 over ahci
sata.

I've included /proc/meminfo, /proc/slabinfo and config.gz. If there's
anything else please shout.

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
ip_fib_alias  10 59 64   591 : tunables  120   608 : 
slabdata  1  1  0
ip_fib_hash   10 59 64   591 : tunables  120   608 : 
slabdata  1  1  0
raid5-md325626182492 : tunables   54   278 : 
slabdata 29 29  0
UNIX   5 22704   112 : tunables   54   278 : 
slabdata  2  2  0
xt_hashlimit   0  0 88   441 : tunables  120   608 : 
slabdata  0  0  0
flow_cache 0  0128   301 : tunables  120   608 : 
slabdata  0  0  0
dm_snap_pending_exception128136112   341 : tunables  120   60   
 8 : slabdata  4  4  0
dm_snap_exception  0  0 32  1121 : tunables  120   608 : 
slabdata  0  0  0
dm_crypt_io0  0 56   671 : tunables  120   608 : 
slabdata  0  0  0
dm_uevent  0  0   260832 : tunables   24   128 : 
slabdata  0  0  0
dm_target_io 827864 24  1441 : tunables  120   608 : 
slabdata  6  6  0
dm_io826828 40   921 : tunables  120   608 : 
slabdata  9  9  0
scsi_cmd_cache50 50384   101 : tunables   54   278 : 
slabdata  5  5  0
cfq_io_context92225152   251 : tunables  120   608 : 
slabdata  9  9  0
cfq_queue 98224136   281 : tunables  120   608 : 
slabdata  8  8  0
bsg_cmd0  0312   121 : tunables   54   278 : 
slabdata  0  0  0
mqueue_inode_cache  1  489641 : tunables   54   278 : 
slabdata  1  1  0
udf_inode_cache0  065661 : tunables   54   278 : 
slabdata  0  0  0
isofs_inode_cache  0  063261 : tunables   54   278 : 
slabdata  0  0  0
fat_inode_cache0  066461 : tunables   54   278 : 
slabdata  0  0  0
fat_cache  0  0 32  1121 : tunables  120   608 : 
slabdata  0  0  0
ext2_inode_cache   0  075251 : tunables   54   278 : 
slabdata  0  0  0
journal_handle32144 24  1441 : tunables  120   608 : 
slabdata  1  1  0
journal_head 129200 96   401 : tunables  120   608 : 
slabdata  5  5  0
revoke_table  10202 16  2021 : tunables  120   608 : 
slabdata  1  1  0
revoke_record  0  0 32  1121 : tunables  120   608 : 
slabdata  0  0  0
ext3_inode_cache  1235577 124056576851 : tunables   54   278 : 
slabdata 248113 248113  0
dnotify_cache  0  0 40   921 : tunables  120   608 : 
slabdata  0  0  0
inotify_event_cache  0  0 40   921 : tunables  120   608 : 
slabdata  0  0  0
inotify_watch_cache  0  0 72   531 : tunables  120   608 : 
slabdata  0  0  0
kioctx 0  0320   121 : tunables   54   278 : 
slabdata  0  0  0
kiocb  0  0256   151 : tunables  120   608 : 
slabdata  0  0  0
fasync_cache   0  0 24  1441 : tunables  120   608 : 
slabdata  0  0  0
shmem_inode_cache  6 1577651 : tunables   54

Re: [rfc] lockless get_user_pages for dio (and more)

2008-01-16 Thread Nick Piggin

On Thursday 17 January 2008 06:58, Dave Kleikamp wrote:
> On Wed, 2007-12-12 at 16:40 +1100, Nick Piggin wrote:
> > On Wednesday 12 December 2007 16:11, Dave Kleikamp wrote:
> > > On Wed, 2007-12-12 at 15:57 +1100, Nick Piggin wrote:
> > > > Anyway, I am hoping that someone will one day and test if this and
> > > > find it helps their workload, but on the other hand, if it doesn't
> > > > help anyone then we don't have to worry about adding it to the
> > > > kernel ;) I don't have any real setups that hammers DIO with threads.
> > > > I'm guessing DB2 and/or Oracle does?
> > >
> > > I'll try to get someone to run a DB2 benchmark and see what it looks
> > > like.
> >
> > That would be great if you could.
>
> We weren't able to get in any runs before the holidays, but we finally
> have some good news from our performance team:
>
> "To test the effects of the patch, an OLTP workload was run on an IBM
> x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
> 2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
> runs with and without the patch resulted in an overall performance
> benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
> __up_read and __down_read routines that is seen during thread contention
> for system resources was reduced from 2.8% down to .05%. Monitoring
> the /proc/vmstat output from the patched run showed that the counter for
> fast_gup contained a very high number while the fast_gup_slow value was
> zero."
>
> Great work, Nick!

Ah, excellent. Thanks for getting those numbers Dave. This will
be a great help towards getting the patch merged.

I'm just working on the final required piece for this thing (the
pte_special pte bit, required to distinguish whether or not we
can refcount a page without looking at the vma). It is strictly
just a correctness/security measure, which is why you were able
to run tests without it. And it won't add any significant cost to
the fastpaths, so the numbers remain valid.

FWIW, I cc'ed linux-arch: the lockless get_user_pages patch has
architecture specific elements, so it will need some attention
there. If other architectures are interested (eg. powerpc or
ia64), then I will be happy to work with maintainers to help
try to devise a way of fitting it into their tlb flushing scheme.
Ping me if you'd like to take up the offer.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-16 Thread David Newall

Alan Cox wrote:
>> If the hardware required an intermediate junk I/O, that would be a
>> reason to do one, but it doesn't, does it?  It requires a delay.  It's
>> written thus in all of the application notes.
>> 
>
> And the only instruction that is synchronized to the bus in question is
> an I/O instruction.
>   

This is a timing issue, isn't it?  How are we synchronising, other than
by delaying for a (bus-dependant) period?  The characteristics of each
bus are known so a number can be assigned for "one bus cycle", without
having to use the bus.

>> Wrong again.  Of course one knows how long the delay should be.  The bus
>> speed is known. 
>> 
>
> Wrong again. ISA bus speed is neither defined precisely, nor visible in a
> system portable fashion.
>   

You say, "system portable," but I think you mean, "automatically
determined."  We don't have to define this value automatically, if
that's so hard to do.  We can use a tunable kernel-parameter.

> I'm so glad you have nothing better to do than troll

I'm not trolling.  You know this is true because many people perceive
this to be a problem.  I'm working on fixing it.  Not all Linux problems
are solvable by diving into code, and there is anecdotal evidence to
believe this one has big performance considerations.  I don't understand
why you are opposed to even talking about it.

> if you
> actually wrote code I'd be worried it might get into something people
> used.

Speaking of writing code: I remember working on a bluetooth Oops. 
Lacking the hardware, I went to you for advice on how to get it before
someone for testing.  You never replied.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

echo mem > /sys/power/state

2008-01-16 Thread Andrew Morton


So I take everyone's latest and greatest product and injudiciously type the
above command.  The result five minutes later is at
http://userweb.kernel.org/~akpm/borkage.jpg.  See if you can count all the bugs.

Sorry, but I've had it with this stuff and I'm tired of fixing everyone else's
stuff.  I'm just going to ship it.  Good luck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Al Viro writes:
> After grep for locking-related things:
[...]

Thanks.  I'll start looking at these issues asap.

Erez.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [UNIONFS] 00/29 Unionfs and related patches pre-merge review (v2)

2008-01-16 Thread Al Viro

After grep for locking-related things:

* lock_parent(): who said that you won't get dentry moved
before managing to grab i_mutex on parent?  While we are at it,
who said that you won't get dentry moved between fetching d_parent
and doing dget()?  In that case parent could've been _freed_ before
you get to dget().

* in create_parents():
+   struct inode *inode = lower_dentry->d_inode;
+   /*
+* If we get here, it means that we created a new
+* dentry+inode, but copying permissions failed.
+* Therefore, we should delete this inode and dput
+* the dentry so as not to leave cruft behind.
+*/
+   if (lower_dentry->d_op && lower_dentry->d_op->d_iput)
+   lower_dentry->d_op->d_iput(lower_dentry,
+  inode);
+   else
+   iput(inode);
+   lower_dentry->d_inode = NULL;
+   dput(lower_dentry);
+   lower_dentry = ERR_PTR(err);
+   goto out;
Really?  So what happens if it had become positive after your test and
somebody had looked it up in lower layer and just now happens to be
in the middle of operations on it?  Will be thucking frilled by that...

* __unionfs_rename():
+   lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
+   err = vfs_rename(lower_old_dir_dentry->d_inode, lower_old_dentry,
+lower_new_dir_dentry->d_inode, lower_new_dentry);
+   unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);

Uh-huh...  To start with, what guarantees that your lower_old_dentry
is still a child of your lower_old_dir_dentry?  What's more, you are
not checking the result of lock_rename(), i.e. asking for serious trouble.

* revalidation stuff: err...  how the devil can it work for
directories, when there's nothing to prevent changes in underlying
layers between ->d_revalidate() and operation itself?  For the upper
layer (unionfs itself) everything's more or less fine, but the rest
of that...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bitops source problem

2008-01-16 Thread John Hubbard


Pravin Nanaware wrote:

Hi,

I was just going through the include file in the /usr/include/asm/bitops.h

The function description describes it as non-atomic but it seems it is not. 


static __inline__ void __change_bit(int nr, volatile void * addr)
{
__asm__ __volatile__(
"btcl %1,%0"
:"=m" (ADDR)
:"Ir" (nr));
}

The kernel version I am using is 2.6.9-42. Is it right or am I missing something ?  


Thanks,
Pravin



The bitops.h comments are correct: the btc IA-32 instruction is only 
atomic if used with the lock prefix. The function above does not use the 
lock prefix, so it is not atomic.


thanks,
John Hubbard

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Bitops source problem

2008-01-16 Thread Pravin Nanaware

Hi,

I was just going through the include file in the /usr/include/asm/bitops.h

The function description describes it as non-atomic but it seems it is not. 

static __inline__ void __change_bit(int nr, volatile void * addr)
{
__asm__ __volatile__(
"btcl %1,%0"
:"=m" (ADDR)
:"Ir" (nr));
}

The kernel version I am using is 2.6.9-42. Is it right or am I missing 
something ?  

Thanks,
Pravin

-**Nihilent***
" *** All information contained in this communication is confidential, 
proprietary, privileged
and is intended for the addressees only. If youhave received this E-mail in 
error please notify
mail administrator by telephone on +91-20-39846100 or E-mail the sender by 
replying to
this message, and then delete this E-mail and other copies of it from your 
computer system.
Any unauthorized dissemination,publication, transfer or use of the contents of 
this communication,
with or without modifications is punishable under the relevant law.

Nihilent has scanned this mail with current virus checking technologies. 
However, Nihilent makes no 
representations or warranties to the effect that this communication is 
virus-free.

Nihilent reserves the right to monitor all E-mail communications through its 
Corporate Network. *** "

*-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc8-rt1

2008-01-16 Thread Mark Knecht

On Jan 16, 2008 8:27 PM, Steven Rostedt <[EMAIL PROTECTED]> wrote:
> We are pleased to announce the 2.6.24-rc8-rt1 tree, which can be
> downloaded from the location:
>
>   http://rt.et.redhat.com/download/
>

Up and running fine here:

[EMAIL PROTECTED] ~ $ uname -a
Linux lightning 2.6.24-rc8-rt1 #1 PREEMPT RT Wed Jan 16 21:11:05 PST
2008 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
[EMAIL PROTECTED] ~ $

Cheers,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread David Chinner

On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote:
> On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> > On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > > if radix tree is not enough),
> > > > 
> > > > ordering of what?
> > > 
> > > Switch from time to location.
> > 
> > Note that data writeback may be adversely affected by location
> > based writeback rather than time based writeback - think of
> > the effect of location based data writeback on an app that
> > creates lots of short term (<30s) temp files and then removes
> > them before they are written back.
> 
> A small(e.g. 5s) time window can still be enforced, but...

Yes, you could, but that will then result in non-deterministic
performance for repeated workloads because the order of file
writeback will not be consistent.

e.g.  the first run is fast because the output file is at lower
offset than the temp file meaning the temp file gets deleted
without being written.

The second run is slow because the location of the files is
reversed and the temp file is written to disk before the
final output file and hence the run is much slower because
it writes much more.

The third run is also slow, but the files are like the first
fast run. However, pdflush tries to write the temp file back
within 5s of it being dirtied so it skips it and writes
the output file first.

The difference between the first+second case can be found by
knowing that inode number determines writeback order, but
there is no obvious clue as to why the first+third runs are
different.

This is exactly the sort of non-deterministic behaviour we 
want to avoid in a writeback algorithm.

> > H - I'm wondering if we'd do better to split data writeback from
> > inode writeback. i.e. we do two passes.  The first pass writes all
> > the data back in time order, the second pass writes all the inodes
> > back in location order.
> > 
> > Right now we interleave data and inode writeback, (i.e.  we do data,
> > inode, data, inode, data, inode, ). I'd much prefer to see all
> > data written out first, then the inodes. ->writepage often dirties
> > the inode and hence if we need to do multiple do_writepages() calls
> > on an inode to flush all the data (e.g. congestion, large amounts of
> > data to be written, etc), we really shouldn't be calling
> > write_inode() after every do_writepages() call. The inode
> > should not be written until all the data is written
> 
> That may do good to XFS. Another case is documented as follows:
> "the write_inode() function of a typical fs will perform no I/O, but
> will mark buffers in the blockdev mapping as dirty."

Yup, but in that situation ->write_inode() does not do any I/O, so
it will work with any high level inode writeback ordering or timing
scheme equally well.  As a result, that's not the case we need to
optimise at all.

FWIW, the NFS client is likely to work better with split data/
inode writeback as it also has to mark the inode dirty on async
write completion (to get ->write_inode called to issue a commit
RPC). Hence delaying the inode write until after all the data
is written makes sense there as well

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [TOMOYO #6 retry 02/21] Add struct vfsmount to struct task_struct.

2008-01-16 Thread Kentaro Takeda

Serge:
> Right, but one will be preferred by the community - and while I have my
> own preference, I wouldn't put too much faith on that, rather talk with
> the apparmor folks, look over the lkml logs for previous submissions,
> and then decide.
Thanks for your advice.
We got the same advice from [EMAIL PROTECTED] in Embedded Linux Conference 2007,
and contacted AppArmor folks but no action occurred. We'll try to contact again.

John Johansen:
Both AppArmor and TOMOYO need vfsmount in LSM hooks. Although we suggested
another solution in [TOMOYO #6], we can use AppArmor's approach.
How about submitting only vfsmount patches before submitting AppArmor/TOMOYO
main module?

We think the patches relate to not only LSM folks but also fsdevel folks.
So we are going to post the brief description of the patches to fsdevel.

Regards,
Kentaro Takeda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.23.14

2008-01-16 Thread Greg KH

On Wed, Jan 16, 2008 at 03:27:41PM +0100, markus reichelt wrote:
> * Greg Kroah-Hartman <[EMAIL PROTECTED]> wrote:
> 
> > It contains a single fix for a problem that could cause a local
> > user to cause file system corruption on some types of filesystems.
> 
> Some types of filesystems? Which ones? 

Lots of them, but not all :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.6.24-rc8-rt1

2008-01-16 Thread Steven Rostedt

We are pleased to announce the 2.6.24-rc8-rt1 tree, which can be
downloaded from the location:

  http://rt.et.redhat.com/download/

Information on the RT patch can be found at:

  http://rt.wiki.kernel.org/index.php/Main_Page

Changes since 2.6.24-rc7-rt3

  - ported to 2.6.24-rc8

  - PPC bootup notrace added for function trace (Luotao Fu)

  - MIPS remove duplicate Kconfig (Frank Rowand)

to build a 2.6.24-rc8-rt1 tree, the following patches should be applied:

  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.tar.bz2
  http://www.kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.24-rc8.bz2
  http://rt.et.redhat.com/download/patch-2.6.24-rc8-rt1.bz2


And like always, my RT version of Matt Mackall's ketchup will get this
for you nicely:

  http://people.redhat.com/srostedt/rt/tools/ketchup-0.9.8-rt3


The broken out patches are also available.

-- Steve



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/13] writeback: requeue_io() on redirtied inode

2008-01-16 Thread Fengguang Wu

On Wed, Jan 16, 2008 at 07:13:07PM +1100, David Chinner wrote:
> On Tue, Jan 15, 2008 at 08:36:46PM +0800, Fengguang Wu wrote:
> > Redirtied inodes could be seen in really fast writes.
> > They should really be synced as soon as possible.
> > 
> > redirty_tail() could delay the inode for up to 30s.
> > Kill the delay by using requeue_io() instead.
> 
> That's actually bad for anything that does delayed allocation
> or updates state on data I/o completion.
> 
> e.g. XFS when writing past EOF doing delalloc dirties the inode
> during writeout (allocation) and then updates the file size on data
> I/o completion hence dirtying the inode again.
> 
> With this change, writing the last pages out would result
> in hitting this code and causing the inode to be flushed very
> soon after the data write. Then, after the inode write is issued,
> we get data I/o completion which dirties the inode again,
> resulting in needing to write the inode again to clean it.
> i.e. it introduces a potential new and useless inode write
> I/O.
> 
> Also, the immediate inode write may be useless for XFS because the
> inode may be pinned in memory due to async transactions
> still in flight (e.g. from delalloc) so we've got two
> situations where flushing the inode immediately is suboptimal.
> 
> Hence I don't think this is an optimisation that should be made
> in the generic writeback code.

Thanks for the explanation.
I can confirm that many requeue_io() happened for the same XFS inode:
[  158.794562] requeue_io 328: inode 5243009 size 34647 at 03:03(hda3)
[  158.794827] mm/page-writeback.c 668 wb_kupdate: pdflush(183) 14209 global 
486 10 0 wc _M tw 1013 sk 0
[  158.795293] requeue_io 328: inode 5243009 size 34647 at 03:03(hda3)
[  158.795313] mm/page-writeback.c 668 wb_kupdate: pdflush(183) 14198 global 
486 10 0 wc _M tw 1024 sk 0
...
[  170.713900] requeue_io 328: inode 5243009 size 34647 at 03:03(hda3)
[  170.713925] mm/page-writeback.c 668 wb_kupdate: pdflush(183) 14198 global 
1875 0 0 wc _M tw 1024 sk 0
[  170.813584] mm/page-writeback.c 668 wb_kupdate: pdflush(183) 14198 global 
2855 0 0 wc __ tw 1024 sk 0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2.6.24-rc7 2/2] sysfs: fix bugs in sysfs_rename/move_dir()

2008-01-16 Thread Al Viro

On Wed, Jan 16, 2008 at 04:23:13PM +0900, Tejun Heo wrote:

> The two posted patches are bug fixes for apparent bugs which can be
> triggered by the current two users of the interface.  AFAICS, locking
> there is weird but correct for the current two users.  If you can find
> any problem there, please lemme know.

How about "what happens after that move-to-NULL if you have a cwd inside
the subtree", for starters?

>  We shouldn't hold this type of
> fixes for future clean ups.

No, but I'd rather see the rules for callers of sysfs/kobject primitives
spelled out - before cleanups or review become even possible.
 
> > As it is, I'm more than inclined
> > to propose ripping kobject_move() out, especially since it has only two
> > users - something s390-specific and rfcomm, with its shitloads of problems
> > beyond just sysfs interaction.
> 
> Can you please elaborate?  All sysfs problems discovered by the rfcomm
> are fixed by the posted patches.  Dave Young has a patch waiting for
> verification by the tester.

Umm...  IIRC, there'd been a lot of fun with tty and procfs sides of that;
will check.

> Furthermore, even if we rip out
> kobject_move() in the future, I don't think -rc7 is the right time to do it.

OK...  You do have a point, but at this stage I'm not convinced that this
thing is safe and usable.  I agree that patches do not make things worse,
but I suspect that the real problem with kobject_move() is that it's a
fundamentally broken interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Paul Mackerras ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers writes:
> 
> > Sorry for self-reply, but I thought, in the past, of a way to make this
> > possible.
> > 
> > It would imply the creation of a new vsyscall : vgetschedperiod
> > 
> > It would read a counter that would increment each time the thread is
> > scheduled out (or in). It would be a per thread counter
> 
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition.  You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read.  Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.
> 
> Paul.
> 

Crazy ideas :

Could we do something along the lines of the thread local storage ?

Or could we map a per-thread page that would contradict this
"definition" ?

Or can we move down the beginning of the user-space thread stack of 4
bytes (it's already put at a random address anyway) and use these 32
bits to put our variable ? We don't care if userspace also modifies it;
the kernel would blindly increment it, so there would be no security
concerns involved.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Extend sys_clone and sys_unshare system calls API

2008-01-16 Thread Al Viro

On Wed, Jan 16, 2008 at 07:23:40AM -0700, Jonathan Corbet wrote:
> Hi, Pavel,
> 
> [Adding Ulrich]
> 
> > I use the last bit in the clone_flags for CLONE_LONGARG. When set it
> > will denote that the child_tidptr is not a pointer to a tid storage,
> > but the pointer to the struct long_clone_struct which currently 
> > looks like this:
> 
> I'm probably just totally off the deep end, but something did occur to
> me: this looks an awful lot like a special version of the sys_indirect()
> idea.  Unless it has been somehow decided that sys_indirect() is the
> wrong idea, might it not be better to look at making that interface
> solve the extended clone() problem as well?

Nah, just put an XML parser into the kernel to have the form match the
contents...

Al "perhaps we should newgroup alt.tasteless.api for all that stuff" Viro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Thu, 17 Jan 2008, Paul Mackerras wrote:
>
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition.  You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read.  Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.

I was thinking about this. What would also work is just the ability to
read the schedule counter for the current cpu. Now this would require that
the task had a way to know which CPU it was currently on.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro

On Wed, Jan 16, 2008 at 04:09:30PM -0800, Andrew Morton wrote:
> On Thu, 17 Jan 2008 00:58:06 +0100 (CET) Jan Engelhardt <[EMAIL PROTECTED]> 
> wrote:
> 
> > 
> > On Jan 17 2008 00:43, Karel Zak wrote:
> > >> 
> > >> Seems like a plain bad idea to me.  There will be any number of home-made
> > >> /proc/mounts parsers and we don't know what they do.
> > >
> > > So, let's use /proc/mounts_v2  ;-)
> > 
> > Was not it like "don't use /proc for new things"?
> 
> Well yeah.  If we're going to do a brand new mechanism to expose
> per-mount data then we should hunker down and get it right.

Which automatically means "no sysfs".  We are NOT converting vfsmounts
to kobject-based lifetime rules.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Al Viro

On Wed, Jan 16, 2008 at 11:12:31PM +0100, Miklos Szeredi wrote:
> The alternative (and completely safe) solution is to add another file
> to proc.  Me no likey.

Since we need saner layout, I would strongly suggest exactly that.

> major:minor -- is the major minor number of the device hosting the filesystem

Bad description.  "Value of st_dev for files on that filesystem", please -
there might be no such thing as "the device hosting the filesystem" _and_
the value here may bloody well be unrelated to device actually holding
all data (for things like ext2meta, etc.).

> 1) The mount is a shared mount.
> 2) Its peer mount of mount with id 20
> 3) It is also a slave mount of the master-mount with the id  19
> 4) The filesystem on device with major/minor number 98:0 and subdirectory
>   mnt/1/abc makes the root directory of this mount.
> 5) And finally the mount with id 16 is its parent.

I'd suggest doing a new file that would *not* try to imitate /etc/mtab.
Another thing is, how much of propagation information do we want to
be exposed and what do we intend to do with it?  Note that "entire
propagation tree" is out of question - it spans many namespaces and
contains potentially sensitive information.  So we won't see all nodes.

What do we want to *do* with the information about propagation?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread Fengguang Wu

On Wed, Jan 16, 2008 at 10:55:28AM -0800, Michael Rubin wrote:
> On Jan 15, 2008 7:01 PM, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> 
> I think the flush_tree (which is a little more than just an rbtree)
> provides the same queuing mechanisms that the three or four lists
> heads do and manages to do it in one structure. The i_flushed_when
> provides the ability to have blocked inodes wait their turn so to
> speak.
> 
> Another motivation behind the rbtree patch is to unify the data
> structure that handles the priority and mechanism of how we write out
> the pages of the inodes. There are some ideas about introducing
> priority schemes for QOS and such in the future. I am not saying this
> patch is about making that happen, but the idea is to if possible
> unify the four stages of lists into a single structure to facilitate
> efforts like that.

Yeah, rbtree is better than list_heads after all. Let's make it happen.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [3/7] Use shorter addresses in i386 segfault printks

2008-01-16 Thread H. Peter Anvin


Harvey Harrison wrote:


Casting to (void *) and using %p is probably your best bet.  That's what 
it really is anyway.


Note: in the kernel right now, %p doesn't have the leading 0x prefix, 
which it probably should...


Well, that won't exactly be the nicest looking solution in places, maybe
a shorthand could be developed for this, or could another format
specifier be added that implicitly does the (void *) cast? (%P perhaps)



Not without losing the ability of gcc to type-check printk arguments.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc7-rt2 [PATCH] latency tracer fix for ppc32

2008-01-16 Thread Steven Rostedt



On Wed, 16 Jan 2008, Luotao Fu wrote:
>
> I found out that the tracer got stuck on ppc32 platforms because some early
> functions call _mcount before mcount_enabled is initialized at all. I made a
> patch, which marks these functions as notrace to solve this problem. With this
> patch I can successfully boot up our mpc5200b platform and make latency trace.
> (tested with -b switch in cyclictest). Please comment.
>
> I made my patch against the -rt2 tree since the dummy call early_printk() in
> -rt3 conflicts with our implementation of a functional early_printk(). It
> should also work with -rt3 though.
>

Thanks, applied.

But for future reference, if you attach your patch please name it with the
ending of .patch and not .diff. Also do it at a -p1 level and not -p0.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4/5] memory_pressure_notify() caller

2008-01-16 Thread KOSAKI Motohiro

Hi Daniel

> > Thank you for good point out!
> > Could you please post your test program and reproduced method?
> 
> Sure:
> 
> 1. Fill almost all available memory with page cache in a system without swap.
> 2. Run attached alloc-test program.
> 3. Notification fires when page cache is reclaimed.

Unfortunately, I can't reproduce it.

my machine
CPU:Pentium4 2.8GHz with HT
memory: 512M


1. I doubt ZONE_DMA, please shipment ignore zone_dma patch(below).
2. Could you please send your .config and /etc/sysctl.conf?
   I hope more reproduce challenge.

thanks.

- kosaki




Signed-off-by: KOSAKI Motohiro <[EMAIL PROTECTED]>

---
 include/linux/mem_notify.h |3 +++
 mm/page_alloc.c|6 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

Index: linux-2.6.24-rc6-mm1-memnotify/include/linux/mem_notify.h
===
--- linux-2.6.24-rc6-mm1-memnotify.orig/include/linux/mem_notify.h
 2008-01-16 21:31:09.0 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/include/linux/mem_notify.h
2008-01-16 21:34:24.0 +0900
@@ -22,6 +22,9 @@ static inline void memory_pressure_notif
unsigned long target;
unsigned long pages_high, pages_free, pages_reserve;

+   if (unlikely(zone->mem_notify_status == -1))
+   return;
+
if (pressure) {
target = atomic_long_read(_mem_notify) + MEM_NOTIFY_FREQ;
if (likely(time_before(jiffies, target)))
Index: linux-2.6.24-rc6-mm1-memnotify/mm/page_alloc.c
===
--- linux-2.6.24-rc6-mm1-memnotify.orig/mm/page_alloc.c 2008-01-13
19:50:27.0 +0900
+++ linux-2.6.24-rc6-mm1-memnotify/mm/page_alloc.c  2008-01-16
21:41:58.0 +0900
@@ -3467,7 +3467,11 @@ static void __meminit free_area_init_cor
zone->zone_pgdat = pgdat;

zone->prev_priority = DEF_PRIORITY;
-   zone->mem_notify_status = 0;
+
+   if (zone->present_pages < (pgdat->node_present_pages / 10))
+   zone->mem_notify_status = -1;
+   else
+   zone->mem_notify_status = 0;

zone_pcp_init(zone);
INIT_LIST_HEAD(>active_list);



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] mmaped copy too slow?

2008-01-16 Thread KOSAKI Motohiro

Hi

> > One thing you could also try is to pass MAP_POPULATE to mmap so that the 
> > page tables are filled in at the time of the mmap, avoiding a lot of 
> > page faults later.
> > 
> 
> OK, I will test your idea and report about tomorrow.
> but I don't think page fault is major performance impact.

I got more interesting result :)
MAP_POPULATE is harmful result at large copy.


1G copy
 elapse(sec)

mmap  71.54
mmap + madvice69.63
mmap + populate  100.87
mmap + populate + madvice101.16


more detail:
time command output of mmap copy
0.50user 3.59system 1:11.54elapsed 5%CPU (0avgtext+0avgdata 
0maxresident)k
2101192inputs+2097160outputs (32776major+491573minor)pagefaults 0swaps

time command output of mmap+populate copy
0.53user 5.13system 1:40.87elapsed 5%CPU (0avgtext+0avgdata 
0maxresident)k
4200808inputs+2097160outputs (49164major+737340minor)pagefaults 0swaps


input blocks increase about x2.
in fact, mmap(MAP_POPULATE) read disk to memory and drop it just after,
thus read again. 


of cource, when copy file size is enough small, MAP_POPULATE is effective.


100M copy
 elapse(sec)

mmap  7.38
mmap + madvice7.29
mmap + populate   7.13
mmap + populate + madvice 6.65


- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [3/7] Use shorter addresses in i386 segfault printks

2008-01-16 Thread Harvey Harrison

On Wed, 2008-01-16 at 22:11 -0500, H. Peter Anvin wrote:
> Harvey Harrison wrote:
> > On Wed, 2008-01-16 at 23:27 +0100, Andi Kleen wrote:
> >> Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>
> >>
> >> ---
> >>  arch/x86/mm/fault_32.c |2 +-
> > 
> > Could use exactly the same in fault_64.c
> > 
> >>  #ifdef CONFIG_X86_32
> >> -  "%s%s[%d]: segfault at %08lx ip %08lx sp %08lx error 
> >> %lx\n",
> >> +  "%s%s[%d]: segfault at %lx ip %08lx sp %08lx error 
> >> %lx\n",
> >>  #else
> >>"%s%s[%d]: segfault at %lx ip %lx sp %lx error %lx\n",
> >>  #endif
> > 
> > With the ongoing unification work, it would be nice if we could come
> > up with a way to unify printks like this.  Anyone have any bright ideas
> > on a format that will keep the current alignment on 32 and 64 bit with
> > the same syntax, or will these tiny ifdefs keep sprouting?
> > 
> 
> Casting to (void *) and using %p is probably your best bet.  That's what 
> it really is anyway.
> 
> Note: in the kernel right now, %p doesn't have the leading 0x prefix, 
> which it probably should...

Well, that won't exactly be the nicest looking solution in places, maybe
a shorthand could be developed for this, or could another format
specifier be added that implicitly does the (void *) cast? (%P perhaps)

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Paul Mackerras

Mathieu Desnoyers writes:

> Sorry for self-reply, but I thought, in the past, of a way to make this
> possible.
> 
> It would imply the creation of a new vsyscall : vgetschedperiod
> 
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter

It's very hard to do a per-thread counter in the VDSO, since threads
in the same process see the same memory, by definition.  You'd have to
have an array of counters and have some way for each thread to know
which entry to read.  Also you'd have to find space for tens or
hundreds of thousands of counters, since there can be that many
threads in a process sometimes.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [3/7] Use shorter addresses in i386 segfault printks

2008-01-16 Thread H. Peter Anvin


Harvey Harrison wrote:

On Wed, 2008-01-16 at 23:27 +0100, Andi Kleen wrote:

Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>

---
 arch/x86/mm/fault_32.c |2 +-


Could use exactly the same in fault_64.c


 #ifdef CONFIG_X86_32
-   "%s%s[%d]: segfault at %08lx ip %08lx sp %08lx error 
%lx\n",
+   "%s%s[%d]: segfault at %lx ip %08lx sp %08lx error 
%lx\n",
 #else
"%s%s[%d]: segfault at %lx ip %lx sp %lx error %lx\n",
 #endif


With the ongoing unification work, it would be nice if we could come
up with a way to unify printks like this.  Anyone have any bright ideas
on a format that will keep the current alignment on 32 and 64 bit with
the same syntax, or will these tiny ifdefs keep sprouting?



Casting to (void *) and using %p is probably your best bet.  That's what 
it really is anyway.


Note: in the kernel right now, %p doesn't have the leading 0x prefix, 
which it probably should...


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] Converting writeback linked lists to a tree based data structure

2008-01-16 Thread Fengguang Wu

On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > if radix tree is not enough),
> > > 
> > > ordering of what?
> > 
> > Switch from time to location.
> 
> Note that data writeback may be adversely affected by location
> based writeback rather than time based writeback - think of
> the effect of location based data writeback on an app that
> creates lots of short term (<30s) temp files and then removes
> them before they are written back.

A small(e.g. 5s) time window can still be enforced, but...

> Also, data writeback locatio cannot be easily derived from
> the inode number in pretty much all cases. "near" in terms
> of XFS means the same AG which means the data could be up to
> a TB away from the inode, and if you have >1TB filesystems
> usingthe default inode32 allocator, file data is *never*
> placed near the inode - the inodes are in the first TB of
> the filesystem, the data is rotored around the rest of the
> filesystem.
> 
> And with delayed allocation, you don't know where the data is even
> going to be written ahead of the filesystem ->writepage call, so you
> can't do optimal location ordering for data in this case.

Agreed.

> H - I'm wondering if we'd do better to split data writeback from
> inode writeback. i.e. we do two passes.  The first pass writes all
> the data back in time order, the second pass writes all the inodes
> back in location order.
> 
> Right now we interleave data and inode writeback, (i.e.  we do data,
> inode, data, inode, data, inode, ). I'd much prefer to see all
> data written out first, then the inodes. ->writepage often dirties
> the inode and hence if we need to do multiple do_writepages() calls
> on an inode to flush all the data (e.g. congestion, large amounts of
> data to be written, etc), we really shouldn't be calling
> write_inode() after every do_writepages() call. The inode
> should not be written until all the data is written

That may do good to XFS. Another case is documented as follows:
"the write_inode() function of a typical fs will perform no I/O, but
will mark buffers in the blockdev mapping as dirty."

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Is try_module_get buggy?

2008-01-16 Thread Rusty Russell

On Saturday 12 January 2008 15:35:27 rmingming wrote:
> Hi,
>  I have a problem about the try_module_get function, I don't know if
> someone removed the module just AFTER line 372, then what happens? Because
> in this situation, the variable module will be incorrect, and
> module_is_live function will lead to unpredicatable behaviour.
>
> 368 static inline int try_module_get(struct module *module)
> 369 {
> 370 int ret = 1;
> 371
> 372 if (module) {
> 373 unsigned int cpu = get_cpu();
> 374 if (likely(module_is_live(module)))
> 375 local_inc(>ref[cpu].count);
> 376 else
> 377 ret = 0;
> 378 put_cpu();
> 379 }
> 380 return ret;
> 381 }

Hi rminming,

try_module_get is designed to ensure that you don't call a function inside a 
module without a reference.  Like any reference function, it cannot handle 
the case where the argument is invalid (or invalidated partway through the 
call).

In this case, the module pointer is usually inside a registered structure.  
The pointer will be valid until the structure is unregistered, which the 
calling code presumably prevents while it's doing a lookup.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 3/5] add /dev/mem_notify device

2008-01-16 Thread KOSAKI Motohiro

Hi

> > I'd read mem_notify as "tell me when new memory is unplugged" or
> > something. /dev/oom_notify? Plus, /dev/ names usually do not have "_"
> > in them.
> 
> I don't think we should use oom in the name, since the notification is
> sent long before oom.

OK, I don't change name.
Of cource, I will change soon if anyone propose more good name.

thanks

- kosaki


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread H. Peter Anvin


Andrew Morton wrote:


Seems like a plain bad idea to me.  There will be any number of home-made
/proc/mounts parsers and we don't know what they do.



There is a lot of precedent for adding fields at the end.  Since the 
last fields in current /proc/*/mounts are dummy fields anyway, it 
doesn't matter if the homegrown parsers concatenate the additional 
information to those.


-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:

> It would imply the creation of a new vsyscall : vgetschedperiod
>
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter (not a per cpu
> counter) so we can deal appropriately with a stopped thread that would
> happen to come back running a lng time afterward (if we do per-cpu
> counters, we could get the same 32 bits counter value falsely if it is
> shared with other thread activity).
>
> Then, the clocksource read code would look like :
>
> int period;
>
> do {
>   period = vgetschedperiod();
>
>   perform the clocksource read..
>
> } while (period != vgetschedperiod());
>
> Therefore, we would be sure that we have not been scheduled out while
> reading the value. I think this new vsyscall could be useful for others.
> Actually, it would make implementation of RCU in user-space possible (as
> long as the read-side can retry the read operation).


This is something that I would agree is useful.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * john stultz ([EMAIL PROTECTED]) wrote:
> > 
> > On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > > Thanks John for doing this!
> > > 
> > > (comments imbedded)
> > > 
> > > On Wed, 16 Jan 2008, john stultz wrote:
> > > > +   int num = !cs->base_num;
> > > > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > > +   offset &= cs->mask;
> > > > +   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > > +   cs->base[num].cycle_base_last = now;
> > > 
> > > I would think that we would need some sort of barrier here. Otherwise,
> > > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > > is needed.
> > 
> > Hopefully addressed in the current version.
> > 
> > 
> > > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > > ===
> > > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
> > > > 12:21:46.0 -0800
> > > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
> > > > 14:15:31.0 -0800
> > > > @@ -71,10 +71,12 @@
> > > >   */
> > > >  static inline s64 __get_nsec_offset(void)
> > > >  {
> > > > -   cycle_t cycle_delta;
> > > > +   cycle_t now, cycle_delta;
> > > > s64 ns_offset;
> > > >
> > > > -   cycle_delta = clocksource_get_cycles(clock, 
> > > > clocksource_read(clock));
> > > > +   now = clocksource_read(clock);
> > > > +   cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > > +   cycle_delta += clock->cycle_accumulated;
> > > 
> > > Is the above just to decouple the two methods?
> > 
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> > 
> 
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.
> 

Sorry for self-reply, but I thought, in the past, of a way to make this
possible.

It would imply the creation of a new vsyscall : vgetschedperiod

It would read a counter that would increment each time the thread is
scheduled out (or in). It would be a per thread counter (not a per cpu
counter) so we can deal appropriately with a stopped thread that would
happen to come back running a lng time afterward (if we do per-cpu
counters, we could get the same 32 bits counter value falsely if it is
shared with other thread activity).

Then, the clocksource read code would look like :

int period;

do {
  period = vgetschedperiod();

  perform the clocksource read..

} while (period != vgetschedperiod());

Therefore, we would be sure that we have not been scheduled out while
reading the value. I think this new vsyscall could be useful for others.
Actually, it would make implementation of RCU in user-space possible (as
long as the read-side can retry the read operation).

Mathieu

> 
> > thanks
> > -john
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] [3/7] Use shorter addresses in i386 segfault printks

2008-01-16 Thread Harvey Harrison

On Wed, 2008-01-16 at 23:27 +0100, Andi Kleen wrote:
> Signed-off-by: Andi Kleen <[EMAIL PROTECTED]>
> 
> ---
>  arch/x86/mm/fault_32.c |2 +-

Could use exactly the same in fault_64.c

>  #ifdef CONFIG_X86_32
> - "%s%s[%d]: segfault at %08lx ip %08lx sp %08lx error 
> %lx\n",
> + "%s%s[%d]: segfault at %lx ip %08lx sp %08lx error 
> %lx\n",
>  #else
>   "%s%s[%d]: segfault at %lx ip %lx sp %lx error %lx\n",
>  #endif

With the ongoing unification work, it would be nice if we could come
up with a way to unify printks like this.  Anyone have any bright ideas
on a format that will keep the current alignment on 32 and 64 bit with
the same syntax, or will these tiny ifdefs keep sprouting?

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> >
>
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.

You'll have more than a hard time disabling preemption for vsyscall. We'll
need to come up with a better solution then. vsyscall can not modify any
kernel memory, nor can it disable preemption.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > Thanks John for doing this!
> > 
> > (comments imbedded)
> > 
> > On Wed, 16 Jan 2008, john stultz wrote:
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > + offset &= cs->mask;
> > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > + cs->base[num].cycle_base_last = now;
> > 
> > I would think that we would need some sort of barrier here. Otherwise,
> > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > is needed.
> 
> Hopefully addressed in the current version.
> 
> 
> > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > ===
> > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
> > > 12:21:46.0 -0800
> > > +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
> > > 14:15:31.0 -0800
> > > @@ -71,10 +71,12 @@
> > >   */
> > >  static inline s64 __get_nsec_offset(void)
> > >  {
> > > - cycle_t cycle_delta;
> > > + cycle_t now, cycle_delta;
> > >   s64 ns_offset;
> > >
> > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > > + now = clocksource_read(clock);
> > > + cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > + cycle_delta += clock->cycle_accumulated;
> > 
> > Is the above just to decouple the two methods?
> 
> Yep. clocksource_get_cycles() ended up not being as useful as an helper
> function (I was hoping the arch vsyscall implementations could use it,
> but they've done too much optimization - although that may reflect a
> need up the chain to the clocksource structure).
> 

The problem with vsyscall is that we will have a hard time disabling
preemption :( Therefore, insuring that the read of the data is done in a
timely manner is hard to do.

Mathieu

> thanks
> -john
> 
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > I would disable preemption in clocksource_get_basecycles. We would not
> > want to be scheduled out while we hold a pointer to the old array
> > element.
> > 
> > > + int num = cs->base_num;
> > 
> > Since you deal with base_num in a shared manner (not per cpu), you will
> > need a smp_read_barrier_depend() here after the cs->base_num read.
> > 
> > You should think about reading the cs->base_num first, and _after_ that
> > read the real clocksource. Here, the clocksource value is passed as
> > parameter. It means that the read clocksource may have been read in the
> > previous RCU window.
> 
> Here's an updated version of the patch w/ the suggested memory barrier
> changes and favored (1-x) inversion change. ;)  Let me know if you see
> any other holes, or have any other suggestions or ideas.
> 
> Still un-tested (my test box will free up soon, I promise!), but builds.
> 
> Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> 
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===
> --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> 12:22:04.0 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> 18:12:53.0 -0800
> @@ -87,9 +87,17 @@
>* more than one cache line.
>*/
>   struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } cacheline_aligned_in_smp;
> + cycle_t cycle_last, cycle_accumulated;
>  
> + /* base structure provides lock-free read
> +  * access to a virtualized 64bit counter
> +  * Uses RCU-like update.
> +  */
> + struct {
> + cycle_t cycle_base_last, cycle_base;
> + } base[2];
> + int base_num;
> + } cacheline_aligned_in_smp;
>   u64 xtime_nsec;
>   s64 error;
>  
> @@ -175,19 +183,29 @@
>  }
>  
>  /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
> value
>   * @cs:  pointer to clocksource being read
>   * @now: current cycle value
>   *
>   * Uses the clocksource to return the current cycle_t value.
>   * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
>   */
>  static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num;
> + cycle_t now, offset;
> +
> + preempt_disable();
> + num = cs->base_num;
> + smp_read_barrier_depends();
> + now = clocksource_read(cs);
> + offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> + preempt_enable();
> +
>   return offset;
>  }
>  
> @@ -197,14 +215,26 @@
>   * @now: current cycle value
>   *
>   * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
>   */
>  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t 
> now)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> +  * The dual array update method allows for lock-free reading.
> +  */
> + int num = 1 - cs->base_num;

(nitpick)
right here, you could probably express 1-num with cs->base_num, since we
are the only ones supposed to touch it.

> + cycle_t offset = (now - cs->base[1-num].cycle_base_last);
> + offset &= cs->mask;

here too.

> + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;
> + wmb();

As I just emailed : smp_smb() *should* be enough. I don't see which
architecture could reorder writes wrt local interrupts ? (please tell me
if I am grossly mistaken)

Mathieu

> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
>   cs->cycle_last = now;
>   cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
>  }
>  
>  /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
> 12:21:46.0 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
> 17:51:50.0 -0800
> @@ -71,10 +71,12 @@
>

[PATCH 5 of 8] x86/paravirt: common implementation for pmd value ops

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate __pmd/pmd_val functions.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   33 ++---
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -978,18 +978,37 @@ static inline pgdval_t pgd_val(pgd_t pgd
return ret;
 }
 
-#ifdef CONFIG_X86_PAE
-static inline pmd_t __pmd(unsigned long long val)
+#if PAGETABLE_LEVELS >= 3
+static inline pmd_t __pmd(pmdval_t val)
 {
-   return (pmd_t) { PVOP_CALL2(unsigned long long, pv_mmu_ops.make_pmd,
-   val, val >> 32) };
+   pmdval_t ret;
+
+   if (sizeof(pmdval_t) > sizeof(long))
+   ret = PVOP_CALL2(pmdval_t, pv_mmu_ops.make_pmd,
+val, (u64)val >> 32);
+   else
+   ret = PVOP_CALL1(pmdval_t, pv_mmu_ops.make_pmd,
+val);
+
+   return (pmd_t) { ret };
 }
 
-static inline unsigned long long pmd_val(pmd_t x)
+static inline pmdval_t pmd_val(pmd_t pmd)
 {
-   return PVOP_CALL2(unsigned long long, pv_mmu_ops.pmd_val,
- x.pmd, x.pmd >> 32);
+   pmdval_t ret;
+
+   if (sizeof(pmdval_t) > sizeof(long))
+   ret =  PVOP_CALL2(pmdval_t, pv_mmu_ops.pmd_val,
+ pmd.pmd, (u64)pmd.pmd >> 32);
+   else
+   ret =  PVOP_CALL1(pmdval_t, pv_mmu_ops.pmd_val,
+ pmd.pmd);
+
+   return ret;
 }
+#endif /* PAGETABLE_LEVELS >= 3 */
+
+#ifdef CONFIG_X86_PAE
 
 static inline void set_pte(pte_t *ptep, pte_t pteval)
 {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2 of 8] x86/paravirt: rearrange common mmu_ops

2008-01-16 Thread Jeremy Fitzhardinge

Rearrange the various pagetable mmu_ops to remove duplication.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -230,28 +230,32 @@ struct pv_mmu_ops {
void (*pte_update_defer)(struct mm_struct *mm,
 unsigned long addr, pte_t *ptep);
 
+   pteval_t (*pte_val)(pte_t);
+   pte_t (*make_pte)(pteval_t pte);
+
+   pgdval_t (*pgd_val)(pgd_t);
+   pgd_t (*make_pgd)(pgdval_t pgd);
+
+#if PAGETABLE_LEVELS >= 3
 #ifdef CONFIG_X86_PAE
void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
void (*set_pte_present)(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte);
-   void (*set_pud)(pud_t *pudp, pud_t pudval);
void (*pte_clear)(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep);
void (*pmd_clear)(pmd_t *pmdp);
 
-   unsigned long long (*pte_val)(pte_t);
-   unsigned long long (*pmd_val)(pmd_t);
-   unsigned long long (*pgd_val)(pgd_t);
+#endif /* CONFIG_X86_PAE */
 
-   pte_t (*make_pte)(unsigned long long pte);
-   pmd_t (*make_pmd)(unsigned long long pmd);
-   pgd_t (*make_pgd)(unsigned long long pgd);
-#else
-   unsigned long (*pte_val)(pte_t);
-   unsigned long (*pgd_val)(pgd_t);
+   void (*set_pud)(pud_t *pudp, pud_t pudval);
 
-   pte_t (*make_pte)(unsigned long pte);
-   pgd_t (*make_pgd)(unsigned long pgd);
-#endif
+   pmdval_t (*pmd_val)(pmd_t);
+   pmd_t (*make_pmd)(pmdval_t pmd);
+
+#if PAGETABLE_LEVELS == 4
+   pudval_t (*pud_val)(pud_t);
+   pud_t (*make_pud)(pudval_t pud);
+#endif /* PAGETABLE_LEVELS == 4 */
+#endif /* PAGETABLE_LEVELS >= 3 */
 
 #ifdef CONFIG_HIGHPTE
void *(*kmap_atomic_pte)(struct page *page, enum km_type type);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6 of 8] x86/paravirt: make set_pte operations common

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate set_pte* operations.  PAE still needs to have special
variants of some of these because it can't atomically update a 64-bit
pte, so there's still some duplication.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |  116 ++--
 1 file changed, 60 insertions(+), 56 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -978,6 +978,66 @@ static inline pgdval_t pgd_val(pgd_t pgd
return ret;
 }
 
+static inline void set_pte(pte_t *ptep, pte_t pte)
+{
+   if (sizeof(pteval_t) > sizeof(long))
+   PVOP_VCALL3(pv_mmu_ops.set_pte, ptep,
+   pte.pte, (u64)pte.pte >> 32);
+   else
+   PVOP_VCALL2(pv_mmu_ops.set_pte, ptep,
+   pte.pte);
+}
+
+static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte)
+{
+   if (sizeof(pteval_t) > sizeof(long))
+   /* 5 arg words */
+   pv_mmu_ops.set_pte_at(mm, addr, ptep, pte);
+   else
+   PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
+}
+
+#ifdef CONFIG_X86_PAE
+/* Special-case pte-setting operations for PAE, which can't update a
+   64-bit pte atomically */
+static inline void set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+   PVOP_VCALL3(pv_mmu_ops.set_pte_atomic, ptep,
+   pte.pte, pte.pte >> 32);
+}
+
+static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
+  pte_t *ptep, pte_t pte)
+{
+   /* 5 arg words */
+   pv_mmu_ops.set_pte_present(mm, addr, ptep, pte);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr,
+pte_t *ptep)
+{
+   PVOP_VCALL3(pv_mmu_ops.pte_clear, mm, addr, ptep);
+}
+#else  /* !CONFIG_X86_PAE */
+static inline void set_pte_atomic(pte_t *ptep, pte_t pte)
+{
+   set_pte(ptep, pte);
+}
+
+static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
+  pte_t *ptep, pte_t pte)
+{
+   set_pte(ptep, pte);
+}
+
+static inline void pte_clear(struct mm_struct *mm, unsigned long addr,
+pte_t *ptep)
+{
+   set_pte_at(mm, addr, ptep, __pte(0));
+}
+#endif /* CONFIG_X86_PAE */
+
 #if PAGETABLE_LEVELS >= 3
 static inline pmd_t __pmd(pmdval_t val)
 {
@@ -1010,31 +1070,6 @@ static inline pmdval_t pmd_val(pmd_t pmd
 
 #ifdef CONFIG_X86_PAE
 
-static inline void set_pte(pte_t *ptep, pte_t pteval)
-{
-   PVOP_VCALL3(pv_mmu_ops.set_pte, ptep, pteval.pte_low, pteval.pte_high);
-}
-
-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval)
-{
-   /* 5 arg words */
-   pv_mmu_ops.set_pte_at(mm, addr, ptep, pteval);
-}
-
-static inline void set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-   PVOP_VCALL3(pv_mmu_ops.set_pte_atomic, ptep,
-   pteval.pte_low, pteval.pte_high);
-}
-
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
-  pte_t *ptep, pte_t pte)
-{
-   /* 5 arg words */
-   pv_mmu_ops.set_pte_present(mm, addr, ptep, pte);
-}
-
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
PVOP_VCALL3(pv_mmu_ops.set_pmd, pmdp,
@@ -1047,28 +1082,12 @@ static inline void set_pud(pud_t *pudp, 
pudval.pgd.pgd, pudval.pgd.pgd >> 32);
 }
 
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
-{
-   PVOP_VCALL3(pv_mmu_ops.pte_clear, mm, addr, ptep);
-}
-
 static inline void pmd_clear(pmd_t *pmdp)
 {
PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
 }
 
 #else  /* !CONFIG_X86_PAE */
-
-static inline void set_pte(pte_t *ptep, pte_t pteval)
-{
-   PVOP_VCALL2(pv_mmu_ops.set_pte, ptep, pteval.pte_low);
-}
-
-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval)
-{
-   PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pteval.pte_low);
-}
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
 {
@@ -1080,21 +1099,6 @@ static inline void pmd_clear(pmd_t *pmdp
set_pmd(pmdp, __pmd(0));
 }
 
-static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
-{
-   set_pte_at(mm, addr, ptep, __pte(0));
-}
-
-static inline void set_pte_atomic(pte_t *ptep, pte_t pte)
-{
-   set_pte(ptep, pte);
-}
-
-static inline void set_pte_present(struct mm_struct *mm, unsigned long addr,
-  pte_t *ptep, pte_t pte)
-{
-   set_pte(ptep, pte);
-}
 #endif /* CONFIG_X86_PAE */
 
 /* Lazy mode for batching updates / context switch */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body

[PATCH 4 of 8] x86/paravirt: common implementation for pgd value ops

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate __pgd/pgd_val functions.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   50 
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -950,6 +950,34 @@ static inline pteval_t pte_val(pte_t pte
return ret;
 }
 
+static inline pgd_t __pgd(pgdval_t val)
+{
+   pgdval_t ret;
+
+   if (sizeof(pgdval_t) > sizeof(long))
+   ret = PVOP_CALL2(pgdval_t, pv_mmu_ops.make_pgd,
+val, (u64)val >> 32);
+   else
+   ret = PVOP_CALL1(pgdval_t, pv_mmu_ops.make_pgd,
+val);
+
+   return (pgd_t) { ret };
+}
+
+static inline pgdval_t pgd_val(pgd_t pgd)
+{
+   pgdval_t ret;
+
+   if (sizeof(pgdval_t) > sizeof(long))
+   ret =  PVOP_CALL2(pgdval_t, pv_mmu_ops.pgd_val,
+ pgd.pgd, (u64)pgd.pgd >> 32);
+   else
+   ret =  PVOP_CALL1(pgdval_t, pv_mmu_ops.pgd_val,
+ pgd.pgd);
+
+   return ret;
+}
+
 #ifdef CONFIG_X86_PAE
 static inline pmd_t __pmd(unsigned long long val)
 {
@@ -957,22 +985,10 @@ static inline pmd_t __pmd(unsigned long 
val, val >> 32) };
 }
 
-static inline pgd_t __pgd(unsigned long long val)
-{
-   return (pgd_t) { PVOP_CALL2(unsigned long long, pv_mmu_ops.make_pgd,
-   val, val >> 32) };
-}
-
 static inline unsigned long long pmd_val(pmd_t x)
 {
return PVOP_CALL2(unsigned long long, pv_mmu_ops.pmd_val,
  x.pmd, x.pmd >> 32);
-}
-
-static inline unsigned long long pgd_val(pgd_t x)
-{
-   return PVOP_CALL2(unsigned long long, pv_mmu_ops.pgd_val,
- x.pgd, x.pgd >> 32);
 }
 
 static inline void set_pte(pte_t *ptep, pte_t pteval)
@@ -1023,16 +1039,6 @@ static inline void pmd_clear(pmd_t *pmdp
 }
 
 #else  /* !CONFIG_X86_PAE */
-
-static inline pgd_t __pgd(unsigned long val)
-{
-   return (pgd_t) { PVOP_CALL1(unsigned long, pv_mmu_ops.make_pgd, val) };
-}
-
-static inline unsigned long pgd_val(pgd_t x)
-{
-   return PVOP_CALL1(unsigned long, pv_mmu_ops.pgd_val, x.pgd);
-}
 
 static inline void set_pte(pte_t *ptep, pte_t pteval)
 {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0 of 8] x86: refactored paravirt mmu_ops

2008-01-16 Thread Jeremy Fitzhardinge

Hi Ingo,

I refactored the paravirt.h mmu_ops patch into a number of smaller ones.

Thanks,
J


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7 of 8] x86/paravirt: make set_pmd operation common

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate set_pmd()s.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   43 ---
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -998,6 +998,16 @@ static inline void set_pte_at(struct mm_
PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+   pmdval_t val = native_pmd_val(pmd);
+
+   if (sizeof(pmdval_t) > sizeof(long))
+   PVOP_VCALL3(pv_mmu_ops.set_pmd, pmdp, val, (u64)val >> 32);
+   else
+   PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, val);
+}
+
 #ifdef CONFIG_X86_PAE
 /* Special-case pte-setting operations for PAE, which can't update a
64-bit pte atomically */
@@ -1019,6 +1029,11 @@ static inline void pte_clear(struct mm_s
 {
PVOP_VCALL3(pv_mmu_ops.pte_clear, mm, addr, ptep);
 }
+
+static inline void pmd_clear(pmd_t *pmdp)
+{
+   PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
+}
 #else  /* !CONFIG_X86_PAE */
 static inline void set_pte_atomic(pte_t *ptep, pte_t pte)
 {
@@ -1035,6 +1050,11 @@ static inline void pte_clear(struct mm_s
 pte_t *ptep)
 {
set_pte_at(mm, addr, ptep, __pte(0));
+}
+
+static inline void pmd_clear(pmd_t *pmdp)
+{
+   set_pmd(pmdp, __pmd(0));
 }
 #endif /* CONFIG_X86_PAE */
 
@@ -1070,33 +1090,10 @@ static inline pmdval_t pmd_val(pmd_t pmd
 
 #ifdef CONFIG_X86_PAE
 
-static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   PVOP_VCALL3(pv_mmu_ops.set_pmd, pmdp,
-   pmdval.pmd, pmdval.pmd >> 32);
-}
-
 static inline void set_pud(pud_t *pudp, pud_t pudval)
 {
PVOP_VCALL3(pv_mmu_ops.set_pud, pudp,
pudval.pgd.pgd, pudval.pgd.pgd >> 32);
-}
-
-static inline void pmd_clear(pmd_t *pmdp)
-{
-   PVOP_VCALL1(pv_mmu_ops.pmd_clear, pmdp);
-}
-
-#else  /* !CONFIG_X86_PAE */
-
-static inline void set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   PVOP_VCALL2(pv_mmu_ops.set_pmd, pmdp, pmdval.pud.pgd.pgd);
-}
-
-static inline void pmd_clear(pmd_t *pmdp)
-{
-   set_pmd(pmdp, __pmd(0));
 }
 
 #endif /* CONFIG_X86_PAE */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3 of 8] x86/paravirt: common implementation for pte value ops

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate __pte/pte_val functions.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   48 
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -920,15 +920,37 @@ static inline void pte_update_defer(stru
PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
-#ifdef CONFIG_X86_PAE
-static inline pte_t __pte(unsigned long long val)
+static inline pte_t __pte(pteval_t val)
 {
-   unsigned long long ret = PVOP_CALL2(unsigned long long,
-   pv_mmu_ops.make_pte,
-   val, val >> 32);
+   pteval_t ret;
+
+   if (sizeof(pteval_t) > sizeof(long))
+   ret = PVOP_CALL2(pteval_t,
+pv_mmu_ops.make_pte,
+val, (u64)val >> 32);
+   else
+   ret = PVOP_CALL1(pteval_t,
+pv_mmu_ops.make_pte,
+val);
+
return (pte_t) { .pte = ret };
 }
 
+static inline pteval_t pte_val(pte_t pte)
+{
+   pteval_t ret;
+
+   if (sizeof(pteval_t) > sizeof(long))
+   ret = PVOP_CALL2(pteval_t, pv_mmu_ops.pte_val,
+pte.pte, (u64)pte.pte >> 32);
+   else
+   ret = PVOP_CALL1(pteval_t, pv_mmu_ops.pte_val,
+pte.pte);
+
+   return ret;
+}
+
+#ifdef CONFIG_X86_PAE
 static inline pmd_t __pmd(unsigned long long val)
 {
return (pmd_t) { PVOP_CALL2(unsigned long long, pv_mmu_ops.make_pmd,
@@ -939,12 +961,6 @@ static inline pgd_t __pgd(unsigned long 
 {
return (pgd_t) { PVOP_CALL2(unsigned long long, pv_mmu_ops.make_pgd,
val, val >> 32) };
-}
-
-static inline unsigned long long pte_val(pte_t x)
-{
-   return PVOP_CALL2(unsigned long long, pv_mmu_ops.pte_val,
- x.pte_low, x.pte_high);
 }
 
 static inline unsigned long long pmd_val(pmd_t x)
@@ -1008,19 +1024,9 @@ static inline void pmd_clear(pmd_t *pmdp
 
 #else  /* !CONFIG_X86_PAE */
 
-static inline pte_t __pte(unsigned long val)
-{
-   return (pte_t) { PVOP_CALL1(unsigned long, pv_mmu_ops.make_pte, val) };
-}
-
 static inline pgd_t __pgd(unsigned long val)
 {
return (pgd_t) { PVOP_CALL1(unsigned long, pv_mmu_ops.make_pgd, val) };
-}
-
-static inline unsigned long pte_val(pte_t x)
-{
-   return PVOP_CALL1(unsigned long, pv_mmu_ops.pte_val, x.pte_low);
 }
 
 static inline unsigned long pgd_val(pgd_t x)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1 of 8] add native_pud_val and _pmd_val for 2 and 3 level pagetables

2008-01-16 Thread Jeremy Fitzhardinge

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/page.h |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/asm-x86/page.h b/include/asm-x86/page.h
--- a/include/asm-x86/page.h
+++ b/include/asm-x86/page.h
@@ -91,6 +91,11 @@ static inline pudval_t native_pud_val(pu
 }
 #else  /* PAGETABLE_LEVELS == 3 */
 #include 
+
+static inline pudval_t native_pud_val(pud_t pud)
+{
+   return native_pgd_val(pud.pgd);
+}
 #endif /* PAGETABLE_LEVELS == 4 */
 
 typedef struct { pmdval_t pmd; } pmd_t;
@@ -106,6 +111,11 @@ static inline pmdval_t native_pmd_val(pm
 }
 #else  /* PAGETABLE_LEVELS == 2 */
 #include 
+
+static inline pmdval_t native_pmd_val(pmd_t pmd)
+{
+   return native_pgd_val(pmd.pud.pgd);
+}
 #endif /* PAGETABLE_LEVELS >= 3 */
 
 static inline pte_t native_make_pte(pteval_t val)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8 of 8] x86/paravirt: make set_pud operation common

2008-01-16 Thread Jeremy Fitzhardinge

Remove duplicate set_pud()s.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
---
 include/asm-x86/paravirt.h |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -1086,17 +1086,19 @@ static inline pmdval_t pmd_val(pmd_t pmd
 
return ret;
 }
+
+static inline void set_pud(pud_t *pudp, pud_t pud)
+{
+   pudval_t val = native_pud_val(pud);
+
+   if (sizeof(pudval_t) > sizeof(long))
+   PVOP_VCALL3(pv_mmu_ops.set_pud, pudp,
+   val, (u64)val >> 32);
+   else
+   PVOP_VCALL2(pv_mmu_ops.set_pud, pudp,
+   val);
+}
 #endif /* PAGETABLE_LEVELS >= 3 */
-
-#ifdef CONFIG_X86_PAE
-
-static inline void set_pud(pud_t *pudp, pud_t pudval)
-{
-   PVOP_VCALL3(pv_mmu_ops.set_pud, pudp,
-   pudval.pgd.pgd, pudval.pgd.pgd >> 32);
-}
-
-#endif /* CONFIG_X86_PAE */
 
 /* Lazy mode for batching updates / context switch */
 enum paravirt_lazy_mode {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz


On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> Thanks John for doing this!
> 
> (comments imbedded)
> 
> On Wed, 16 Jan 2008, john stultz wrote:
> > +   int num = !cs->base_num;
> > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > +   offset &= cs->mask;
> > +   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > +   cs->base[num].cycle_base_last = now;
> 
> I would think that we would need some sort of barrier here. Otherwise,
> base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> is needed.

Hopefully addressed in the current version.


> > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > ===
> > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
> > 12:21:46.0 -0800
> > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
> > 14:15:31.0 -0800
> > @@ -71,10 +71,12 @@
> >   */
> >  static inline s64 __get_nsec_offset(void)
> >  {
> > -   cycle_t cycle_delta;
> > +   cycle_t now, cycle_delta;
> > s64 ns_offset;
> >
> > -   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > +   now = clocksource_read(clock);
> > +   cycle_delta = (now - clock->cycle_last) & clock->mask;
> > +   cycle_delta += clock->cycle_accumulated;
> 
> Is the above just to decouple the two methods?

Yep. clocksource_get_cycles() ended up not being as useful as an helper
function (I was hoping the arch vsyscall implementations could use it,
but they've done too much optimization - although that may reflect a
need up the chain to the clocksource structure).

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc7-rt2

2008-01-16 Thread Steven Rostedt


Thomas, can you look at this. He's getting APIC errors on bootup. I'm
wondering if this isn't another strange anomaly of this controller.

He also states that he doesn't get this with the non-rt kernel.

>
> -rt3 on top of 2.6.24-rc8 works fine without that sysfs problem (acpi warnings
> still there and full dmesg can be found from [1]), whatever causes this seems
> solved :)
>
> [1] http://cekirdek.pardus.org.tr/~caglar/dmesg.rt3

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > * john stultz ([EMAIL PROTECTED]) wrote:
> > > 
> > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> > > > > If you really want an seqlock free algorithm (I _do_ want this for
> > > > > tracing!) :) maybe going in the RCU direction could help (I refer to 
> > > > > my
> > > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > > > could be turned into the clocksource updater).
> > > > 
> > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > > > swing at this now.  The lock-free method still doesn't apply to the
> > > > update_wall_time function, but does work fine for the monotonic cycle
> > > > uses.  I'll send a patch for review as soon as I get things building.
> > > 
> > > So here's my first attempt at adding Mathieu's lock-free method to
> > > Steven's get_monotonic_cycles() interface. 
> > > 
> > > Completely un-tested, but it builds, so I figured I'd send it out for
> > > review.
> > > 
> > > I'm not super sure the update or the read doesn't need something
> > > additional to force a memory access, but as I didn't see anything 
> > > special in Mathieu's implementation, I'm going to guess this is ok.
> > > 
> > > Mathieu, Let me know if this isn't what you're suggesting.
> > > 
> > > Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> > > 
> > > Index: monotonic-cleanup/include/linux/clocksource.h
> > > ===
> > > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> > > 12:22:04.0 -0800
> > > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> > > 14:41:31.0 -0800
> > > @@ -87,9 +87,17 @@
> > >* more than one cache line.
> > >*/
> > >   struct {
> > > - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> > > - } cacheline_aligned_in_smp;
> > 
> > Shouldn't the cycle_last and cycle_accumulated by in the array too ?
> 
> No, we're leaving cycle_last and cycle_accumulated alone. They're
> relating to the update_wall_time conversion of cycles to xtime. 
> 
> > > + cycle_t cycle_last, cycle_accumulated;
> > >  
> > > + /* base structure provides lock-free read
> > > +  * access to a virtualized 64bit counter
> > > +  * Uses RCU-like update.
> > > +  */
> > > + struct {
> > 
> > We had cycle_raw before, why do we need the following two ?
> >
> > > + cycle_t cycle_base_last, cycle_base;
> > 
> > I'm not quite sure why you need both cycle_base_last and cycle_base...
> 
> So on my first shot at this, I tried to layer the concepts. Using the
> lock-free method to create a abstracted 64bit counter, as provided by
> get_monotonic_cycles(). Then I tried to use that abstraction directly in
> the update_wall_time() code, reading the abstracted 64bit counter and
> using it to update time.
> 
> However, then we start keeping cycle_last in 64bit cycles, rather then
> an actual counter read. This then caused changes to be needed in the
> arch vsyscall implementations, and that started to get ugly, as we had
> to also re-implement the abstracted 64bit counter w/ the lock free
> method as well. 
> 
> So I just backed off and tried to make it simple: We have two sets of
> data that counts cycles from the clocksource. One for timekeeping and
> one for get_monotoinc_cycles(). It is a little redundant, but I don't
> think you can escape that (the layering method above also has
> redundancy, but its just hidden until you implement the vsyscall gtod
> methods).
> 
> 
> > I think I'll need a bit of an explanation of what you are trying to
> > achieve here to see what to expect from the clock source. Are you trying
> > to deal with non-synchronized TSCs across CPUs in a way that will
> > generate a monotonic (sometimes stalling) clock ?
> 
> No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
> take clocksource counters, which may be of different bit-widths (ACPI PM
> is 24bits, for instance), and create lock-free method to translate that
> into a virtual 64bit wide counter (using an accumulation bucket,
> basically).
> 
> > What I am trying to say is : I know you are trying to make a virtual
> > clock source where time cannot go backward, but what are your
> > assumptions about the "real" clock source ?
> 
> The assumptions of the real clocksource is the same we keep in the
> timekeeping core. It counts forward, at a constant rate and only wraps
> after the mask value has been reached. 
> 
> > Is the intent to deal with an HPET suddenly reset to 0 or something
> > like this ?
> 
> Well, dealing with clocksources wrapping short of 64bits.
> 

Ah ok, then the problem is clearer :) The main difference between the
approach I use and yours is that, let's say your clocksource

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> I would disable preemption in clocksource_get_basecycles. We would not
> want to be scheduled out while we hold a pointer to the old array
> element.
> 
> > +   int num = cs->base_num;
> 
> Since you deal with base_num in a shared manner (not per cpu), you will
> need a smp_read_barrier_depend() here after the cs->base_num read.
> 
> You should think about reading the cs->base_num first, and _after_ that
> read the real clocksource. Here, the clocksource value is passed as
> parameter. It means that the read clocksource may have been read in the
> previous RCU window.

Here's an updated version of the patch w/ the suggested memory barrier
changes and favored (1-x) inversion change. ;)  Let me know if you see
any other holes, or have any other suggestions or ideas.

Still un-tested (my test box will free up soon, I promise!), but builds.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>

Index: monotonic-cleanup/include/linux/clocksource.h
===
--- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
12:22:04.0 -0800
+++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
18:12:53.0 -0800
@@ -87,9 +87,17 @@
 * more than one cache line.
 */
struct {
-   cycle_t cycle_last, cycle_accumulated, cycle_raw;
-   } cacheline_aligned_in_smp;
+   cycle_t cycle_last, cycle_accumulated;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,29 @@
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
-   offset += cs->cycle_accumulated;
+   int num;
+   cycle_t now, offset;
+
+   preempt_disable();
+   num = cs->base_num;
+   smp_read_barrier_depends();
+   now = clocksource_read(cs);
+   offset = (now - cs->base[num].cycle_base_last);
+   offset &= cs->mask;
+   offset += cs->base[num].cycle_base;
+   preempt_enable();
+
return offset;
 }
 
@@ -197,14 +215,26 @@
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   /* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+*/
+   int num = 1 - cs->base_num;
+   cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+   offset &= cs->mask;
+   cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+   cs->base[num].cycle_base_last = now;
+   wmb();
+   cs->base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
-   cs->cycle_raw += offset;
 }
 
 /**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===
--- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
12:21:46.0 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 
-0800
@@ -71,10 +71,12 @@
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle_delta = (now - clock->cycle_last) & clock->mask;
+   cycle_delta += clock->cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
 
return ns_offset;
@@ -105,35 +107,7 @@
 
 cycle_t notrace get_monotonic_cycles(void)
 {

Re: [PATCH 0 of 4] x86: some more patches

2008-01-16 Thread Andi Kleen

On Wednesday 16 January 2008 22:06:46 Ingo Molnar wrote:
> 
> * Andi Kleen <[EMAIL PROTECTED]> wrote:
> 
> > > truly stuck, or just an annoying message?
> > 
> > Just annoying message; system works after that for simple login etc. 
> > (haven't run anything complicated)
> 
> ok, if you see no other failures and if you have some time it would be 
> nice to figure out why that message is happening. Is NOHZ enabled 
> perhaps?

I updated now to latest git-x86 and the problem seems to have gone away.

But NO_HZ was enabled yes and the system also had a very unsynchronous
TSC (dual K8 opteron with cpufreq) 

Also I should say my patches were applied, but I didn't change
anything in this area.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] RT: remove duplicate time/Kconfig

2008-01-16 Thread Frank Rowand


On Wed, 2008-01-16 at 20:29 -0500, Steven Rostedt wrote:
> On Wed, 16 Jan 2008, Frank Rowand wrote:
> 
> >
> > time/Kconfig added by preempt-realtime-mips.patch duplicates other entry,
> > resulting in kernel make error:
> >
> > Signed-off-by: Frank Rowand <[EMAIL PROTECTED]>
> > ---
> >  arch/mips/Kconfig |2   0 + 2 - 0 !
> >  1 files changed, 2 deletions(-)
> >
> > Index: linux-2.6.24-rc7/arch/mips/Kconfig
> > ===
> > --- linux-2.6.24-rc7.orig/arch/mips/Kconfig
> > +++ linux-2.6.24-rc7/arch/mips/Kconfig
> > @@ -1001,8 +1001,6 @@ config BOOT_ELF64
> >
> >  menu "CPU selection"
> >
> > -source "kernel/time/Kconfig"
> > -
> >  choice
> > prompt "CPU type"
> > default CPU_R4X00
> 
> heh, This doesn't apply either. Or is this to be done before the patches
> are added?

Hmmm, I don't seem to be doing to well here.  It was created after the
-rt3 patch was applied.  Thanks for you patience with me here!!

> 
> Anyway, I did find the two konfig references:
> 
> ...
> 
>   This is purely to save memory - each supported CPU adds
>   approximately eight kilobytes to the kernel image.  For best
>   performance should round up your number of processors to the next
>   power of two.
> 
> source "kernel/time/Kconfig"
> 
> #
> # Timer Interrupt Frequency Configuration
> #
> 
> ...
> 
> config GENERIC_TIME
> bool
> default y
> 
> source "kernel/time/Kconfig"
> 
> config CPU_SPEED
> int "CPU speed used for clocksource/clockevent calculations"
> default 600
> endmenu
> 
> ...
> 
> 
> I'll apply your patch in quilt and then make the proper change. Which
> Kconfig do you want gone?

It would be good to get rid of the first one, but I'm especially picky.


> 
> -- Steve
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc7-rt2

2008-01-16 Thread Steven Rostedt


On Tue, 15 Jan 2008, Mariusz Kozlowski wrote:
> Ok. It works.
>
> I found this in dmesg:
>
> BUG: swapper:0 task might have lost a preemption check!
> Pid: 0, comm: swapper Not tainted 2.6.24-rc7-rt2 #3
>  [] show_trace_log_lvl+0x1d/0x3b
>  [] show_trace+0x12/0x14
>  [] dump_stack+0x6a/0x70
>  [] preempt_enable_no_resched+0x5c/0x5e

This is really really strange. cpu_idle calls __preempt_enable_no_resched
and not preempt_enable_no_resched (notice the prefixed underscores).
So I don't know how you got that output. Did you get any strance rejects
in applying this patch?

-- Steve


>  [] cpu_idle+0x6d/0x82
>  [] rest_init+0x66/0x68
>  [] start_kernel+0x20c/0x276
>  [<>] 0x0
>  ===
> ---
> | preempt count:  ]
> | 0-level deep critical section nesting:
> 
>
> Box runs fine though.
>
> Regards,
>
>   Mariusz
> -
> To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Daniel Phillips

On Jan 16, 2008 2:06 PM, Bryan Henderson <[EMAIL PROTECTED]> wrote:
> >The "disk motor as a generator" tale may not be purely folklore.  When
> >an IDE drive is not in writeback mode, something special needs to done
> >to ensure the last write to media is not a scribble.
>
> No it doesn't.  The last write _is_ a scribble.

Have you observed that in the wild?  A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revert "local_t Documentation update"

2008-01-16 Thread Li Zefan

Mathieu Desnoyers wrote:
> * Li Zefan ([EMAIL PROTECTED]) wrote:
>> This reverts commit e1265205c0ee3919c3f2c750662630154c8faab2.
>>
>> It's a duplicate commit of commit 74beb9db77930be476b267ec8518a642f39a04bf,
>> resulting in a duplicate section.
>>
>> Signed-off-by: Li Zefan <[EMAIL PROTECTED]>
>>
> 
> Thanks, I guess it's been merged twice somehow :S
> 
> Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
> 

Yeap, I guess so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Revert "local_t Documentation update"

2008-01-16 Thread Mathieu Desnoyers

* Li Zefan ([EMAIL PROTECTED]) wrote:
> This reverts commit e1265205c0ee3919c3f2c750662630154c8faab2.
> 
> It's a duplicate commit of commit 74beb9db77930be476b267ec8518a642f39a04bf,
> resulting in a duplicate section.
> 
> Signed-off-by: Li Zefan <[EMAIL PROTECTED]>
> 

Thanks, I guess it's been merged twice somehow :S

Acked-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

> ---
>  Documentation/local_ops.txt |   23 ---
>  1 files changed, 0 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/local_ops.txt b/Documentation/local_ops.txt
> index 1a45f11..4269a11 100644
> --- a/Documentation/local_ops.txt
> +++ b/Documentation/local_ops.txt
> @@ -68,29 +68,6 @@ typedef struct { atomic_long_t a; } local_t;
>variable can be read when reading some _other_ cpu's variables.
>  
>  
> -* Rules to follow when using local atomic operations
> -
> -- Variables touched by local ops must be per cpu variables.
> -- _Only_ the CPU owner of these variables must write to them.
> -- This CPU can use local ops from any context (process, irq, softirq, nmi, 
> ...)
> -  to update its local_t variables.
> -- Preemption (or interrupts) must be disabled when using local ops in
> -  process context to   make sure the process won't be migrated to a
> -  different CPU between getting the per-cpu variable and doing the
> -  actual local op.
> -- When using local ops in interrupt context, no special care must be
> -  taken on a mainline kernel, since they will run on the local CPU with
> -  preemption already disabled. I suggest, however, to explicitly
> -  disable preemption anyway to make sure it will still work correctly on
> -  -rt kernels.
> -- Reading the local cpu variable will provide the current copy of the
> -  variable.
> -- Reads of these variables can be done from any CPU, because updates to
> -  "long", aligned, variables are always atomic. Since no memory
> -  synchronization is done by the writer CPU, an outdated copy of the
> -  variable can be read when reading some _other_ cpu's variables.
> -
> -
>  * How to use local atomic operations
>  
>  #include 
> -- 
> 1.5.3.rc7
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Revert "local_t Documentation update"

2008-01-16 Thread Li Zefan

This reverts commit e1265205c0ee3919c3f2c750662630154c8faab2.

It's a duplicate commit of commit 74beb9db77930be476b267ec8518a642f39a04bf,
resulting in a duplicate section.

Signed-off-by: Li Zefan <[EMAIL PROTECTED]>

---
 Documentation/local_ops.txt |   23 ---
 1 files changed, 0 insertions(+), 23 deletions(-)

diff --git a/Documentation/local_ops.txt b/Documentation/local_ops.txt
index 1a45f11..4269a11 100644
--- a/Documentation/local_ops.txt
+++ b/Documentation/local_ops.txt
@@ -68,29 +68,6 @@ typedef struct { atomic_long_t a; } local_t;
   variable can be read when reading some _other_ cpu's variables.
 
 
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
-  to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
-  process context to   make sure the process won't be migrated to a
-  different CPU between getting the per-cpu variable and doing the
-  actual local op.
-- When using local ops in interrupt context, no special care must be
-  taken on a mainline kernel, since they will run on the local CPU with
-  preemption already disabled. I suggest, however, to explicitly
-  disable preemption anyway to make sure it will still work correctly on
-  -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
-  variable.
-- Reads of these variables can be done from any CPU, because updates to
-  "long", aligned, variables are always atomic. Since no memory
-  synchronization is done by the writer CPU, an outdated copy of the
-  variable can be read when reading some _other_ cpu's variables.
-
-
 * How to use local atomic operations
 
 #include 
-- 
1.5.3.rc7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Linus Torvalds ([EMAIL PROTECTED]) wrote:
> 
> 
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > 
> > !0 is not necessarily 1.
> 
> Incorrect.
> 

Hrm, *digging in my mailbox*, ah, here it is :

http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html

Richard Purdie reviewed my code back in 2006 and made this modification.
Maybe will he have something to add.


> !0 _is_ necessarily 1. It's how all C logical operators work. If you find 
> a compiler that turns !x into anything but 0/1, you found a compiler for 
> another language than C.
> 
> It's true that any non-zero value counts as "true", but the that does not 
> mean that a logical operator can return any non-zero value for true. As a 
> return value of the logical operations in C, true is *always* 1.
> 
> So !, ||, &&, when used as values, will *always* return either 0 or 1 (but 
> when used as part of a conditional, the compiler will often optimize out 
> unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
> the value itself was never used, only branched upon).
> 
> So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.
> 
> That's not to say it's necessarily the *best* way.
> 
> If you *know* that you started with 0/1 in the first place, the best way 
> to flip it tends to be to do (1-x) (or possibly (x^1)).
> 
> And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
> but you might also decide to use ((x+1)&1) for example.
> 
> And obviously, the compiler may sometimes surprise you, and if *it* also 
> knows it's always 0/1 (for something like the source being a single-bit 
> bitfield for example), it may end up doing something else than you coded 
> that is equivalent. And the particular choice of operation the compiler 
> chooses may well depend on the code _around_ that sequence.
> 
> (One reason to potentially prefer (1-x) over (x^1) is that it's often 
> easier to combine a subtraction with other operations, while an xor seldom 
> combines with anything around it)
> 

Ok, I'll adopt (1-x) then. Thanks!

Mathieu

>   Linus

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] RT: remove duplicate time/Kconfig

2008-01-16 Thread Steven Rostedt

On Wed, 16 Jan 2008, Frank Rowand wrote:

>
> time/Kconfig added by preempt-realtime-mips.patch duplicates other entry,
> resulting in kernel make error:
>
> Signed-off-by: Frank Rowand <[EMAIL PROTECTED]>
> ---
>  arch/mips/Kconfig |2 0 + 2 - 0 !
>  1 files changed, 2 deletions(-)
>
> Index: linux-2.6.24-rc7/arch/mips/Kconfig
> ===
> --- linux-2.6.24-rc7.orig/arch/mips/Kconfig
> +++ linux-2.6.24-rc7/arch/mips/Kconfig
> @@ -1001,8 +1001,6 @@ config BOOT_ELF64
>
>  menu "CPU selection"
>
> -source "kernel/time/Kconfig"
> -
>  choice
>   prompt "CPU type"
>   default CPU_R4X00

heh, This doesn't apply either. Or is this to be done before the patches
are added?

Anyway, I did find the two konfig references:

...

  This is purely to save memory - each supported CPU adds
  approximately eight kilobytes to the kernel image.  For best
  performance should round up your number of processors to the next
  power of two.

source "kernel/time/Kconfig"

#
# Timer Interrupt Frequency Configuration
#

...

config GENERIC_TIME
bool
default y

source "kernel/time/Kconfig"

config CPU_SPEED
int "CPU speed used for clocksource/clockevent calculations"
default 600
endmenu

...

I'll apply your patch in quilt and then make the proper change. Which
Kconfig do you want gone?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] SH/Dreamcast - add support for GD-Rom CDROM drive on SEGA Dreamcast

2008-01-16 Thread Paul Mundt

On Wed, Jan 16, 2008 at 11:57:57PM +, Adrian McMenamin wrote:
> 
> On Mon, 2008-01-14 at 23:17 +, Adrian McMenamin wrote:
> > On Mon, 2008-01-14 at 23:00 +, Adrian McMenamin wrote:
> > > From: Adrian McMenamin <[EMAIL PROTECTED]>
> > >  
> > > This patch adds support for the GD-Rom drive, SEGA's proprietary
> > > implementation of an IDE CD Rom for the SEGA Dreamcast. This driver
> > > implements Sega's Packet Interface (SPI) - at least partially. It will
> > > also read disks in SEGA's propreitary GD format.
> > > 
> > > Unlike previous drivers (which were never in mainline) this uses DMA and
> > > not PIO to read disks. It is a new driver, not a refactoring of old
> > > drivers. 
> > > 
> > > Hopefully this patch addresses some issues that have been raised by
> > > Andrew. Jens previously ack'ed this, but I've left that off (though the
> > > IO code hasn't really been touched).
> > >  
> > > Signed-off by: Adrian McMenamin <[EMAIL PROTECTED]>
> > >  
> > > (Jens - this awaits your ack before going in Paul's queue for the 2.6.25
> > > window)
> > >  
> > 
> I think I have now removed all the whitespace issues.
> 
> Signed-off by: Adrian McMenamin <[EMAIL PROTECTED]>
> 
Looks better. __devexit/__devexit_p()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/7] driver-core : convert semaphore to mutex in struct class

2008-01-16 Thread Dave Young

On Jan 16, 2008 11:27 PM, Alan Stern <[EMAIL PROTECTED]> wrote:
> On Wed, 16 Jan 2008, Dave Young wrote:
>
> > The lockdep warining was posted in the below thread, actually, I have
> > built and run this patced kernel for several days, there's no more
> > warnings.
> > http://lkml.org/lkml/2008/1/3/2
>
> Your meaning isn't clear.  Do you mean that your patch doesn't generate
> any lockdep warnings at all?  Or do you mean that it generates a single
> lockdep warning at boot time and then no more warnings afterward?

I means the latter one.

>
> Alan Stern
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/7] driver-core : convert semaphore to mutex in struct class

2008-01-16 Thread Dave Young

On Jan 16, 2008 4:34 PM, Jarek Poplawski <[EMAIL PROTECTED]> wrote:
> On Wed, Jan 16, 2008 at 09:03:03AM +0800, Dave Young wrote:
> ...
> > The lockdep warining was posted in the below thread, actually, I have
> > built and run this patced kernel for several days, there's no more
> > warnings.
> > http://lkml.org/lkml/2008/1/3/2
>
> Right... But, with something like this:
>
> ... have_some_fun(... cls)
> {
> mutex_lock_nested(>mutex, SINGLE_DEPTH_NESTING);
> have_other_fun(cls);
> mutex_unlock(>mutex);
>
> }
>
> ... have_more_fun(...)
> {
> ...
>
> mutex_init(>mutex);
>
> mutex_lock(>mutex);
> have_some_fun(cls);
> mutex_unlock(>mutex);
> }
>
> probably you wouldn't get any lockdep warning too...

Sorry for late reply.
Actually, I don't know much about lockdep. Could you tell how to use
it properly in this scenario?

>
> Of course, if we know all the locking is right such proper lockdep
> annotating shouldn't matter too much. (And of course this could be
> improved later.)
>
> Regards,
> Jarek P.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 3/3] Add bug/warn marker to generic report_bug()

2008-01-16 Thread Arjan van de Ven

On Wed, 16 Jan 2008 19:10:00 -0600
Olof Johansson <[EMAIL PROTECTED]> wrote:

> Powerpc uses the generic report_bug() from lib/bug.c to report
> warnings, and I'm guessing other arches do as well.
> 
> Add the module list as well as the end-of-trace marker to the output.
> This required making print_oops_end_marker() nonstatic.
> 
> 
> Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>
> 
> 

All three are
Acked-by: Arjan van de Ven <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Jan Engelhardt


On Jan 17 2008 11:33, Neil Brown wrote:
>On Thursday January 17, [EMAIL PROTECTED] wrote:
>> 
>> On Jan 17 2008 00:43, Karel Zak wrote:
>> >> 
>> >> Seems like a plain bad idea to me.  There will be any number of home-made
>> >> /proc/mounts parsers and we don't know what they do.
>> >
>> > So, let's use /proc/mounts_v2  ;-)
>> 
>> Was not it like "don't use /proc for new things"?
>
>I thought it was "don't use /proc for new things that aren't process
>related".
>
>And as the mount table is per process..

You are right. I'm still in the world where CLONE_NEWNS is not used all
that much in the daily routine, either by the distro or by me.

>In the tradition of stat, statm, status, maybe the former should be
> /proc/$PID/mountm

What next - /proc/pid/mountus? :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Linus Torvalds

On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> > +   int num = !cs->base_num;
> > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> 
> !0 is not necessarily 1.

Incorrect.

!0 _is_ necessarily 1. It's how all C logical operators work. If you find 
a compiler that turns !x into anything but 0/1, you found a compiler for 
another language than C.

It's true that any non-zero value counts as "true", but the that does not 
mean that a logical operator can return any non-zero value for true. As a 
return value of the logical operations in C, true is *always* 1.

So !, ||, &&, when used as values, will *always* return either 0 or 1 (but 
when used as part of a conditional, the compiler will often optimize out 
unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
the value itself was never used, only branched upon).

So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.

That's not to say it's necessarily the *best* way.

If you *know* that you started with 0/1 in the first place, the best way 
to flip it tends to be to do (1-x) (or possibly (x^1)).

And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
but you might also decide to use ((x+1)&1) for example.

And obviously, the compiler may sometimes surprise you, and if *it* also 
knows it's always 0/1 (for something like the source being a single-bit 
bitfield for example), it may end up doing something else than you coded 
that is equivalent. And the particular choice of operation the compiler 
chooses may well depend on the code _around_ that sequence.

(One reason to potentially prefer (1-x) over (x^1) is that it's often 
easier to combine a subtraction with other operations, while an xor seldom 
combines with anything around it)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: The SX4 challenge

2008-01-16 Thread Mark Lord


Jeff Garzik wrote:
..
Thus, the "SX4 challenge" is a challenge to developers to figure out the 
most optimal configuration for this hardware, given the existing MD and 
DM work going on.

..

This sort of RAID optimization hardware is not unique to the SX4,
so hopefully we can work out a way to take advantage of similar/different
RAID throughput features of other chipsets too (eventually).

This could be a good topic for discussion/beer in San Jose next month..

Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Set pnp_init_resource_table, pnp_resource_change, pnp_manual_config_dev deprecated

2008-01-16 Thread Len Brown

Thomas,
If you send me a checkpatch.pl clean version, I'll apply it for 2.6.25.

Also, how is the dynamic allocation of pnp resources coming along?
We'll be wanting to do that on day1 of 2.6.25 integration window.

thanks,
-Len

> I don't know how many externally built drivers, which are making use of
> this, could still be out there?
> What is the general policy for removing such old, rarely used and "being
> more a workaround than an interface" exported symbols?
> 
> This should be 2.6.24 material:
> 
> Mark pnp_init_resource_table, pnp_resource_change, pnp_manual_config_dev 
> deprecated
> 
> Thanks to Rene Herman, the remaining calls to those functions got eliminated
> in the sound/isa layer recently.
> Those functions are a workaround for wrong BIOS pnp information and give
> drivers the possibility to override BIOS exported PNP resources.
> This can be done through sysfs since 2.6, therefore these functions should
> vanish rather soon, as dynamic allocation for PNP resources is depending
> on it.
> 
> Signed-off-by: Thomas Renninger <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/pnp.h |   14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> Index: linux-2.6.24-rc3-mm2/include/linux/pnp.h
> ===
> --- linux-2.6.24-rc3-mm2.orig/include/linux/pnp.h
> +++ linux-2.6.24-rc3-mm2/include/linux/pnp.h
> @@ -387,8 +387,8 @@ int pnp_register_dma_resource(struct pnp
>  int pnp_register_port_resource(struct pnp_option *option,
>  struct pnp_port *data);
>  int pnp_register_mem_resource(struct pnp_option *option, struct pnp_mem 
> *data);
> -void pnp_init_resource_table(struct pnp_resource_table *table);
> -int pnp_manual_config_dev(struct pnp_dev *dev, struct pnp_resource_table 
> *res,
> +void __deprecated pnp_init_resource_table(struct pnp_resource_table *table);
> +int __deprecated pnp_manual_config_dev(struct pnp_dev *dev, struct 
> pnp_resource_table *res,
> int mode);
>  int pnp_auto_config_dev(struct pnp_dev *dev);
>  int pnp_validate_config(struct pnp_dev *dev);
> @@ -396,8 +396,8 @@ int pnp_start_dev(struct pnp_dev *dev);
>  int pnp_stop_dev(struct pnp_dev *dev);
>  int pnp_activate_dev(struct pnp_dev *dev);
>  int pnp_disable_dev(struct pnp_dev *dev);
> -void pnp_resource_change(struct resource *resource, resource_size_t start,
> -  resource_size_t size);
> +void __deprecated pnp_resource_change(struct resource *resource, 
> resource_size_t start,
> +   resource_size_t size);
>  
>  /* protocol helpers */
>  int pnp_is_active(struct pnp_dev *dev);
> @@ -436,15 +436,15 @@ static inline int pnp_register_irq_resou
>  static inline int pnp_register_dma_resource(struct pnp_option *option, 
> struct pnp_dma *data) { return -ENODEV; }
>  static inline int pnp_register_port_resource(struct pnp_option *option, 
> struct pnp_port *data) { return -ENODEV; }
>  static inline int pnp_register_mem_resource(struct pnp_option *option, 
> struct pnp_mem *data) { return -ENODEV; }
> -static inline void pnp_init_resource_table(struct pnp_resource_table *table) 
> { }
> -static inline int pnp_manual_config_dev(struct pnp_dev *dev, struct 
> pnp_resource_table *res, int mode) { return -ENODEV; }
> +static inline void __deprecated pnp_init_resource_table(struct 
> pnp_resource_table *table) { }
> +static inline int __deprecated pnp_manual_config_dev(struct pnp_dev *dev, 
> struct pnp_resource_table *res, int mode) { return -ENODEV; }
>  static inline int pnp_auto_config_dev(struct pnp_dev *dev) { return -ENODEV; 
> }
>  static inline int pnp_validate_config(struct pnp_dev *dev) { return -ENODEV; 
> }
>  static inline int pnp_start_dev(struct pnp_dev *dev) { return -ENODEV; }
>  static inline int pnp_stop_dev(struct pnp_dev *dev) { return -ENODEV; }
>  static inline int pnp_activate_dev(struct pnp_dev *dev) { return -ENODEV; }
>  static inline int pnp_disable_dev(struct pnp_dev *dev) { return -ENODEV; }
> -static inline void pnp_resource_change(struct resource *resource, 
> resource_size_t start, resource_size_t size) { }
> +static inline void __deprecated pnp_resource_change(struct resource 
> *resource, resource_size_t start, resource_size_t size) { }
>  
>  /* protocol helpers */
>  static inline int pnp_is_active(struct pnp_dev *dev) { return 0; }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 3/3] Add bug/warn marker to generic report_bug()

2008-01-16 Thread Olof Johansson

Powerpc uses the generic report_bug() from lib/bug.c to report warnings,
and I'm guessing other arches do as well.

Add the module list as well as the end-of-trace marker to the output. This
required making print_oops_end_marker() nonstatic.


Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>


diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94bc996..88d1aa3 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -133,6 +133,7 @@ NORET_TYPE void panic(const char * fmt, ...)
 extern void oops_enter(void);
 extern void oops_exit(void);
 extern int oops_may_print(void);
+extern void print_oops_end_marker(void);
 fastcall NORET_TYPE void do_exit(long error_code)
ATTRIB_NORET;
 NORET_TYPE void complete_and_exit(struct completion *, long)
diff --git a/kernel/panic.c b/kernel/panic.c
index d9e90cf..0269a7f 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -281,7 +281,7 @@ static int init_oops_id(void)
 }
 late_initcall(init_oops_id);
 
-static void print_oops_end_marker(void)
+void print_oops_end_marker(void)
 {
init_oops_id();
printk(KERN_WARNING "---[ end trace %016llx ]---\n",
diff --git a/lib/bug.c b/lib/bug.c
index 530f38f..3aa60a5 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -148,7 +148,9 @@ enum bug_trap_type report_bug(unsigned long bugaddr, struct 
pt_regs *regs)
   "[verbose debug info unavailable]\n",
   (void *)bugaddr);
 
+   print_modules();
show_regs(regs);
+   print_oops_end_marker();
return BUG_TRAP_TYPE_WARN;
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch 2/3] [POWERPC] switch to generic WARN_ON / BUG_ON

2008-01-16 Thread Olof Johansson

Not using the ppc-specific WARN_ON/BUG_ON constructs actually saves about
4K text on a ppc64_defconfig.  The main reason seems to be that prepping
the arguments to the conditional trap instructions is more work than just
doing a compare and branch.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Cc: Paul Mackerras <[EMAIL PROTECTED]>
---

 include/asm-powerpc/bug.h |   37 -
 1 file changed, 37 deletions(-)

Index: linux-2.6.24-rc6/include/asm-powerpc/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-powerpc/bug.h
+++ linux-2.6.24-rc6/include/asm-powerpc/bug.h
@@ -54,12 +54,6 @@
".previous\n"
 #endif
 
-/*
- * BUG_ON() and WARN_ON() do their best to cooperate with compile-time
- * optimisations. However depending on the complexity of the condition
- * some compiler versions may not produce optimal results.
- */
-
 #define BUG() do { \
__asm__ __volatile__(   \
"1: twi 31,0,0\n"   \
@@ -69,20 +63,6 @@
for(;;) ;   \
 } while (0)
 
-#define BUG_ON(x) do { \
-   if (__builtin_constant_p(x)) {  \
-   if (x)  \
-   BUG();  \
-   } else {\
-   __asm__ __volatile__(   \
-   "1: "PPC_TLNEI" %4,0\n" \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), "i" (0),\
- "i" (sizeof(struct bug_entry)),   \
- "r" ((__force long)(x))); \
-   }   \
-} while (0)
-
 #define __WARN() do {  \
__asm__ __volatile__(   \
"1: twi 31,0,0\n"   \
@@ -92,23 +72,6 @@
  "i" (sizeof(struct bug_entry)));  \
 } while (0)
 
-#define WARN_ON(x) ({  \
-   int __ret_warn_on = !!(x);  \
-   if (__builtin_constant_p(__ret_warn_on)) {  \
-   if (__ret_warn_on)  \
-   __WARN();   \
-   } else {\
-   __asm__ __volatile__(   \
-   "1: "PPC_TLNEI" %4,0\n" \
-   _EMIT_BUG_ENTRY \
-   : : "i" (__FILE__), "i" (__LINE__), \
- "i" (BUGFLAG_WARNING),\
- "i" (sizeof(struct bug_entry)),   \
- "r" (__ret_warn_on)); \
-   }   \
-   unlikely(__ret_warn_on);\
-})
-
 #endif /* __ASSEMBLY __ */
 #endif /* CONFIG_BUG */
 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-16 Thread Anton Salikhmetov

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) a new flag triggering update of the inode data;
2) a new field in the address_space structure for saving modification time;
3) a new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing lazy ctime and mtime update.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   43 +++--
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 +-
 include/linux/pagemap.h |3 +-
 mm/msync.c  |   61 +-
 mm/page-writeback.c |   54 ++---
 8 files changed, 124 insertions(+), 57 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping->mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, >flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..edd5bf4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
+ * @sync: whether to do synchronous update
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode->i_sb);
-   if (!timespec_equal(>i_mtime, )) {
-   inode->i_mtime = now;
+   if (timespec_compare(>i_mtime, ts) < 0) {
+   inode->i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(>i_ctime, )) {
-   inode->i_ctime = now;
+   if (timespec_compare(>i_ctime, ts) < 0) {
+   inode->i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, >flags)) {
+   struct inode *inode = mapping->host;
+   struct timespec *ts = >mtime;
+
+   if (S_ISBLK(inode->i_mode)) {
+   struct block_device *bdev = inode->i_bdev;
+
+   mutex_lock(>bd_mutex);
+   list_for_each_entry(inode, >bd_inodes, i_devices)
+   inode_update_time(inode, ts);
+   mutex_unlock(>bd_mutex);
+   } else
+   inode_update_time(inode, ts);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapping_update_time(mapping);
+
ret = filemap_fdatawrite(mapping);
 
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..f0d3ced 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -511,6 +511,7 @@ struct address_space {

[patch 1/3] bug.h: Remove HAVE_ARCH_BUG and HAVE_ARCH_WARN

2008-01-16 Thread Olof Johansson

No need to have the HAVE_ARCH_BUG.* / HAVE_ARCH_WARN.* defines, when
the generic implementation can just use #ifndef on the macros themselves.

Signed-off-by: Olof Johansson <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>

---

 include/asm-alpha/bug.h   |1 -
 include/asm-arm/bug.h |1 -
 include/asm-avr32/bug.h   |3 ---
 include/asm-frv/bug.h |1 -
 include/asm-generic/bug.h |   10 +-
 include/asm-ia64/bug.h|1 -
 include/asm-m68k/bug.h|1 -
 include/asm-mips/bug.h|4 
 include/asm-parisc/bug.h  |2 --
 include/asm-powerpc/bug.h |3 ---
 include/asm-s390/bug.h|2 --
 include/asm-sparc/bug.h   |1 -
 include/asm-sparc64/bug.h |1 -
 include/asm-v850/bug.h|1 -
 include/asm-x86/bug.h |1 -
 15 files changed, 5 insertions(+), 28 deletions(-)

Index: linux-2.6.24-rc6/include/asm-alpha/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-alpha/bug.h
+++ linux-2.6.24-rc6/include/asm-alpha/bug.h
@@ -10,7 +10,6 @@
   __asm__ __volatile__("call_pal %0  # bugchk\n\t"".long %1\n\t.8byte %2" \
   : : "i" (PAL_bugchk), "i"(__LINE__), "i"(__FILE__))
 
-#define HAVE_ARCH_BUG
 #endif
 
 #include 
Index: linux-2.6.24-rc6/include/asm-arm/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-arm/bug.h
+++ linux-2.6.24-rc6/include/asm-arm/bug.h
@@ -16,7 +16,6 @@ extern void __bug(const char *file, int 
 
 #endif
 
-#define HAVE_ARCH_BUG
 #endif
 
 #include 
Index: linux-2.6.24-rc6/include/asm-avr32/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-avr32/bug.h
+++ linux-2.6.24-rc6/include/asm-avr32/bug.h
@@ -63,9 +63,6 @@
unlikely(__ret_warn_on);\
})
 
-#define HAVE_ARCH_BUG
-#define HAVE_ARCH_WARN_ON
-
 #endif /* CONFIG_BUG */
 
 #include 
Index: linux-2.6.24-rc6/include/asm-frv/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-frv/bug.h
+++ linux-2.6.24-rc6/include/asm-frv/bug.h
@@ -32,7 +32,6 @@ do {  \
asm volatile("nop");\
 } while(0)
 
-#define HAVE_ARCH_BUG
 #define BUG()  \
 do {   \
_debug_bug_printk();\
Index: linux-2.6.24-rc6/include/asm-generic/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-generic/bug.h
+++ linux-2.6.24-rc6/include/asm-generic/bug.h
@@ -20,14 +20,14 @@ struct bug_entry {
 #define BUGFLAG_WARNING(1<<0)
 #endif /* CONFIG_GENERIC_BUG */
 
-#ifndef HAVE_ARCH_BUG
+#ifndef BUG
 #define BUG() do { \
printk("BUG: failure at %s:%d/%s()!\n", __FILE__, __LINE__, 
__FUNCTION__); \
panic("BUG!"); \
 } while (0)
 #endif
 
-#ifndef HAVE_ARCH_BUG_ON
+#ifndef BUG_ON
 #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while(0)
 #endif
 
@@ -49,15 +49,15 @@ extern void warn_on_slowpath(const char 
 #endif
 
 #else /* !CONFIG_BUG */
-#ifndef HAVE_ARCH_BUG
+#ifndef BUG
 #define BUG()
 #endif
 
-#ifndef HAVE_ARCH_BUG_ON
+#ifndef BUG_ON
 #define BUG_ON(condition) do { if (condition) ; } while(0)
 #endif
 
-#ifndef HAVE_ARCH_WARN_ON
+#ifndef WARN_ON
 #define WARN_ON(condition) ({  \
int __ret_warn_on = !!(condition);  \
unlikely(__ret_warn_on);\
Index: linux-2.6.24-rc6/include/asm-ia64/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-ia64/bug.h
+++ linux-2.6.24-rc6/include/asm-ia64/bug.h
@@ -6,7 +6,6 @@
 #define BUG() do { printk("kernel BUG at %s:%d!\n", __FILE__, __LINE__); 
ia64_abort(); } while (0)
 
 /* should this BUG be made generic? */
-#define HAVE_ARCH_BUG
 #endif
 
 #include 
Index: linux-2.6.24-rc6/include/asm-m68k/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-m68k/bug.h
+++ linux-2.6.24-rc6/include/asm-m68k/bug.h
@@ -21,7 +21,6 @@
 } while (0)
 #endif
 
-#define HAVE_ARCH_BUG
 #endif
 
 #include 
Index: linux-2.6.24-rc6/include/asm-mips/bug.h
===
--- linux-2.6.24-rc6.orig/include/asm-mips/bug.h
+++ linux-2.6.24-rc6/include/asm-mips/bug.h
@@ -12,8 +12,6 @@ do {  
\
__asm__ __volatile__("break %0" : : "i" (BRK_BUG)); \
 } while (0)
 
-#define HAVE_ARCH_BUG
-
 #if (_MIPS_ISA > _MIPS_ISA_MIPS1)
 
 #define BUG_ON(condition)  \
@@ -22,8 +20,6 @@ do {  
\

[PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-16 Thread Anton Salikhmetov

Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   74 +--
 1 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..44997bf 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,22 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = -EINVAL, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
@@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+
+   len = PAGE_ALIGN(len);
end = start + len;
-   if (end < start)
+   if (end < start) {
+   error = -ENOMEM;
goto out;
+   }
+
error = 0;
+
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma->vm_file;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(>mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
+   if (error)
goto out;
down_read(>mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma->vm_next;
+   } while (start < end);
up_read(>mmap_sem);
+
 out:
-   return error ? : unmapped_error;
+   return error ? error : unmapped_error;
 }
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

[PATCH -v5 0/2] Updating ctime and mtime for memory-mapped files

2008-01-16 Thread Anton Salikhmetov

This is the fifth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) the case of retouching an already-dirty page pointed out
   by Miklos Szeredi has been correctly addressed;

2) a few cosmetic changes according to the latest feedback;

3) fixed the error of calling a possibly sleeping function
   from an atomic context.

The design for the first item above was suggested by Peter Zijlstra:

> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Miklos' test program now produces the following output for
the repeated calls to msync() with the MS_ASYNC flag:

debian:~/miklos# ./miklos_test file
begin   1200529196  1200529196  1200528798
write   1200529197  1200529197  1200528798
mmap1200529197  1200529197  1200529198
b   1200529197  1200529197  1200529198
msync b 1200529199  1200529199  1200529198
c   1200529199  1200529199  1200529198
msync c 1200529201  1200529201  1200529198
d   1200529201  1200529201  1200529198
munmap  1200529201  1200529201  1200529198
close   1200529201  1200529201  1200529198
sync1200529204  1200529204  1200529198
debian:~/miklos#

Miklos' test program can be found using the following link:

http://lkml.org/lkml/2008/1/14/104
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

The SX4 challenge

2008-01-16 Thread Jeff Garzik



Promise just gave permission to post the docs for their PDC20621 (i.e. 
SX4) hardware:

http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-1.2.pdf.bz2

joining the existing PDC20621 DIMM and PLL docs:
http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-dimm-1.6.pdf.bz2
http://gkernel.sourceforge.net/specs/promise/pdc20621-pguide-pll-ata-timing-1.2.pdf.bz2


So, the SX4 is now open.  Yay :)  I am hoping to talk Mikael into 
becoming the sata_sx4 maintainer, and finally integrating my 'new-eh' 
conversion in libata-dev.git.


But now is a good time to remind people how lame the sata_sx4 driver 
software really is -- and I should know, I wrote it.


The SX4 hardware, simplified, is three pieces:  XOR engine (for raid5), 
host<->board memcpy engine, and several ATA engines (and some helpful 
transaction sequencing features).  Data for each WRITE command is first 
copied to the board RAM, then the ATA engines DMA to/from the board RAM. 
 Data for each READ command is copied to board RAM via the ATA engines, 
then DMA'd across PCI to your host memory.


Therefore, while it is not hardware RAID, the SX4 provides all the 
pieces necessary to offload RAID1 and RAID5, and handle other RAID 
levels optimally.  RAID1 and 5 copies can be offloaded (provided all 
copies go to SX4-attached devices of course).  RAID5 XOR gen and 
checking can be offloaded, allowing the OS to see a single request, 
while the hardware processes a sequence of low-level requests sent in a 
batch.


This hardware presents an interesting challenge:  it does not really fit 
into software RAID (i.e. no RAID) /or/ hardware RAID categories.  The 
sata_sx4 driver presents the no-RAID configuration, while is terribly 
inefficient:


WRITE:
submit host DMA (copy to board)
host DMA completion via interrupt
submit ATA command
ATA command completion via interrupt
READ:
submit ATA command
ATA command completion via interrupt
submit host DMA (copy from board)
host DMA completion via interrupt

Thus, the "SX4 challenge" is a challenge to developers to figure out the 
most optimal configuration for this hardware, given the existing MD and 
DM work going on.


Now, it must be noted that the SX4 is not current-gen technology.  Most 
vendors have moved towards an "IOP" model, where the hw vendor puts most 
of their hard work into an ARM/MIPS firmware, running on an embedded 
chip specially tuned for storage purposes.  (ref "hptiop" and "stex" 
drivers, very very small SCSI drivers)


I know Dan Williams @ Intel is working on very similar issues on the IOP 
-- async memcpy, XOR offload, etc. -- and I am hoping that, due to that 
current work, some of the good ideas can be reused with the SX4.


Anyway...  it's open, it's interesting, even if it's not current-gen 
tech anymore.  You can probably find them on Ebay or in an 
out-of-the-way computer shop somewhere.


Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 962 matches

Mail list logo