Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6

2013-12-21 Thread Zlatko Calusic

On 17.12.2013 22:23, Mel Gorman wrote:

On Tue, Dec 17, 2013 at 04:07:35PM +0100, Zlatko Calusic wrote:

On 13.12.2013 15:10, Mel Gorman wrote:

Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.



I'll start a 5-day test of this patchset in a few hours, unless you
can send an updated one in the meantime. I intend to test it on a
rather boring 4GB x86_64 machine that before Johannes' work had lots
of trouble balancing zones. Would you recommend to use the default
settings, i.e. don't mess with tunables at this point?



For me at least I would prefer you tested v3 of the series with the
default settings of not interleaving file-backed pages on remote nodes
by default. Johannes might request testing with that knob enabled if the
machine is NUMA although I doubt it is with 4G of RAM.



Tested v3 on UMA machine, with default setting. I see no regression, no 
issues whatsoever. From what I understand, this whole series is about 
fixing issues noticed on NUMA, so I wish you good luck with that (no 
such hardware here). Just be extra careful not to disturb finally very 
well balanced MM on more common machines (and especially those equipped 
with 4GB RAM). And once again thank you Johannes for your work, you did 
a great job.


Tested-by: Zlatko Calusic 
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/7] Configurable fair allocation zone policy v2r6

2013-12-17 Thread Zlatko Calusic

On 13.12.2013 15:10, Mel Gorman wrote:

Kicked this another bit today. It's still a bit half-baked but it restores
the historical performance and leaves the door open at the end for playing
nice with distributing file pages between nodes. Finishing this series
depends on whether we are going to make the remote node behaviour of the
fair zone allocation policy configurable or redefine MPOL_LOCAL. I'm in
favour of the configurable option because the default can be redefined and
tested while giving users a "compat" mode if we discover the new default
behaviour sucks for some workload.



I'll start a 5-day test of this patchset in a few hours, unless you can 
send an updated one in the meantime. I intend to test it on a rather 
boring 4GB x86_64 machine that before Johannes' work had lots of trouble 
balancing zones. Would you recommend to use the default settings, i.e. 
don't mess with tunables at this point?


Regards,
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

2013-07-31 Thread Zlatko Calusic

On 24.07.2013 13:18, Zlatko Calusic wrote:

On 22.07.2013 18:48, Zlatko Calusic wrote:

On 19.07.2013 22:55, Johannes Weiner wrote:

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing.  Andrea recommended to submit them
separately as they are bugfixes in their own right.



I have the patch applied and under testing. So far, so good. It looks
like it could finally fix the bug that I was chasing few months ago
(nicely described in your bullet #3). But, few more days of testing will
be needed before I can reach a quality verdict.



Well, only 2 days later it's already obvious that the patch is perfect! :)



Additionaly, on the patched kernel, kswapd burns 30% less CPU cycles. 
Nice to see that restored balance also eases kswapd's job, but that was 
to be expected. Measured on the real workload, twice, to be sure.


Regards,
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

2013-07-24 Thread Zlatko Calusic

On 24.07.2013 14:46, Hush Bensen wrote:

于 2013/7/24 19:18, Zlatko Calusic 写道:

On 22.07.2013 18:48, Zlatko Calusic wrote:

On 19.07.2013 22:55, Johannes Weiner wrote:

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing. Andrea recommended to submit them
separately as they are bugfixes in their own right.



I have the patch applied and under testing. So far, so good. It looks
like it could finally fix the bug that I was chasing few months ago
(nicely described in your bullet #3). But, few more days of testing will
be needed before I can reach a quality verdict.



Well, only 2 days later it's already obvious that the patch is
perfect! :)

In the attached image, in the left column are the graphs covering last
day and a half. It can be observed that zones are really balanced, and
that aging is practically perfect. Graphs on the right column cover
last 10 day period, and the left side of the upper graph shows how it
would look with the stock kernel after about 20 day uptime (although
only a few days is enough to reach such imbalance). File pages in the
Normal zone are extinct species (red) and the zone is choke full of
anon pages (blue). Having seen a lot of this graphs, I'm certain that
it won't happen anymore with your patch applied. The balance is
restored! Thank you for your work. Feel free to add:

Tested-by: Zlatko Calusic 


Thanks for your testing Zlatko, could you tell me which benchmark or
workload you are using? Btw, which tool is used to draw these nice
pictures? ;-)



Workload is mixed (various services, light load). What makes the biggest 
I/O load is backup procedure that goes every evening. The graphs are 
home-made, a little bit of rrd, a little bit of perl, nothing too 
complex. I'm actually slowly getting rid of these extra graphs, because 
I used them only for debugging this specific problem, which is now fixed 
thanks to Johannes.


--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

2013-07-22 Thread Zlatko Calusic

On 22.07.2013 19:01, Johannes Weiner wrote:

Hi Zlatko,

On Mon, Jul 22, 2013 at 06:48:52PM +0200, Zlatko Calusic wrote:

On 19.07.2013 22:55, Johannes Weiner wrote:

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing.  Andrea recommended to submit them
separately as they are bugfixes in their own right.



I have the patch applied and under testing. So far, so good. It
looks like it could finally fix the bug that I was chasing few
months ago (nicely described in your bullet #3). But, few more days
of testing will be needed before I can reach a quality verdict.


I should have remembered that you talked about this problem... Thanks
a lot for testing!

May I ask for the zone layout of your test machine(s)?  I.e. how many
nodes if NUMA, how big Normal and DMA32 (on Node 0) are.



I have been reading about NUMA hw for at least a decade, but I guess 
another one will pass before I actually see one. ;) Find /proc/zoneinfo 
attached.


If your patchset fails my case, then nr_{in,}active_file in Normal zone 
will drop close to zero in a matter of days. If it fixes this particular 
imbalance, and I have faith it will, then those two counters will stay 
in relative balance with nr_{in,}active_anon in the same zone. I also 
applied Konstantin's excellent lru-milestones-timestamps-and-ages, and 
graphing of interesting numbers on top of that, which is why I already 
have faith in your patchset. I can see much better balance between zones 
already. But, let's give it some more time...


--
Zlatko
Node 0, zone  DMA
  pages free 3975
min  132
low  165
high 198
scanned  0
spanned  4095
present  3998
managed  3977
nr_free_pages 3975
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 0
nr_active_file 0
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped0
nr_file_pages 0
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 2
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable  0
nr_bounce0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied   0
nr_written   0
nr_anon_transparent_hugepages 0
nr_free_cma  0
protection: (0, 3236, 3933, 3933)
  pagesets
cpu: 0
  count: 0
  high:  0
  batch: 1
  vm stats threshold: 4
cpu: 1
  count: 0
  high:  0
  batch: 1
  vm stats threshold: 4
  all_unreclaimable: 1
  start_pfn: 1
  inactive_ratio:1
  avg_age_inactive_anon: 0
  avg_age_active_anon:   0
  avg_age_inactive_file: 0
  avg_age_active_file:   0
Node 0, zoneDMA32
  pages free 83177
min  27693
low  34616
high 41539
scanned  0
spanned  1044480
present  847429
managed  829295
nr_free_pages 83177
nr_inactive_anon 2061
nr_active_anon 313380
nr_inactive_file 199460
nr_active_file 207097
nr_unevictable 0
nr_mlock 0
nr_anon_pages 239688
nr_mapped3
nr_file_pages 424978
nr_dirty 87
nr_writeback 0
nr_slab_reclaimable 9119
nr_slab_unreclaimable 2054
nr_page_table_pages 1795
nr_kernel_stack 144
nr_unstable  0
nr_bounce0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 18421
nr_dirtied   725414
nr_written   768505
nr_anon_transparent_hugepages 112
nr_free_cma  0
protection: (0, 0, 697, 697)
  pagesets
cpu: 0
  count: 132
  high:  186
  batch: 31
  vm stats threshold: 24
cpu: 1
  count: 146
  high:  186
  batch: 31
  vm stats threshold: 24
  all_unreclaimable: 0
  start_pfn: 4096
  inactive_ratio:5
  avg_age_inactive_anon: 5467648
  avg_age_active_anon:   5467648
  avg_age_inactive_file: 3184128
  avg_age_active_file:   5467648
Node 0, zone   Normal
  pages free 17164
min  5965
low  7456
hig

Re: [patch 0/3] mm: improve page aging fairness between zones/nodes

2013-07-22 Thread Zlatko Calusic

On 19.07.2013 22:55, Johannes Weiner wrote:

The way the page allocator interacts with kswapd creates aging
imbalances, where the amount of time a userspace page gets in memory
under reclaim pressure is dependent on which zone, which node the
allocator took the page frame from.

#1 fixes missed kswapd wakeups on NUMA systems, which lead to some
nodes falling behind for a full reclaim cycle relative to the other
nodes in the system

#3 fixes an interaction where kswapd and a continuous stream of page
allocations keep the preferred zone of a task between the high and
low watermark (allocations succeed + kswapd does not go to sleep)
indefinitely, completely underutilizing the lower zones and
thrashing on the preferred zone

These patches are the aging fairness part of the thrash-detection
based file LRU balancing.  Andrea recommended to submit them
separately as they are bugfixes in their own right.



I have the patch applied and under testing. So far, so good. It looks 
like it could finally fix the bug that I was chasing few months ago 
(nicely described in your bullet #3). But, few more days of testing will 
be needed before I can reach a quality verdict.


Good job!
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/9] Reduce system disruption due to kswapd V4

2013-05-18 Thread Zlatko Calusic

On 15.05.2013 22:37, Andrew Morton wrote:


 3.10.0-rc1  3.10.0-rc1
vanilla lessdisrupt-v4
Page Ins   1234608  101892
Page Outs 1244627211810468
Swap Ins283406   0
Swap Outs   698469   27882
Direct pages scanned 0  136480
Kswapd pages scanned   6266537 5369364
Kswapd pages reclaimed 1088989  930832
Direct pages reclaimed   0  120901
Kswapd efficiency  17% 17%
Kswapd velocity   5398.3714635.115
Direct efficiency 100% 88%
Direct velocity  0.000 117.817
Percentage direct scans 0%  2%
Page writes by reclaim 1655843 4009929
Page writes file957374 3982047
Page writes anon698469   27882
Page reclaim immediate52451745
Page rescued immediate   0   0
Slabs scanned33664   25216
Direct inode steals  0   0
Kswapd inode steals  19409 778


The reduction in inode steals might be a significant thing?
prune_icache_sb() does invalidate_mapping_pages() and can have the bad
habit of shooting down a vast number of pagecache pages (for a large
file) in a single hit.  Did this workload use large (and clean) files?
Did you run any test which would expose this effect?



I did not run specific tests, but I believe I observed exactly this 
issue on the real workload, where even at a moderate load sudden frees 
of pagecache happen quite often. I've attached a small graph where it 
can be easily seen. The snapshot was taken while the server was running 
an unpatched Linus kernel. After the Mel's patch series is applied, I 
can't see anything similar. So it seems that this issue is completely 
gone, Mel's done a wonderful job.


And BTW, V4 continues to be rock stable, running here on many different 
machines, so I look forward seeing this code merged in 3.11.

--
Zlatko
<>

Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-21 Thread Zlatko Calusic

On 22.04.2013 08:43, Simon Jeons wrote:

Hi Zlatko,
On 04/22/2013 02:37 PM, Zlatko Calusic wrote:

On 12.04.2013 22:07, Zlatko Calusic wrote:

On 12.04.2013 21:40, Mel Gorman wrote:

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:

On 09.04.2013 13:06, Mel Gorman wrote:


- The only slightly negative thing I observed is that with the patch
applied kswapd burns 10x - 20x more CPU. So instead of about 15
seconds, it has now spent more than 4 minutes on one particular
machine with a quite steady load (after about 12 days of uptime).
Admittedly, that's still nothing too alarming, but...



Would you happen to know what circumstances trigger the higher CPU
usage?



Really nothing special. The server is lightly loaded, but it does enough
reading from the disk so that pagecache is mostly populated and page
reclaiming is active. So, kswapd is no doubt using CPU time gradually,
nothing extraordinary.

When I sent my reply yesterday, the server uptime was 12 days, and
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
days uptime):

root23  0.0  0.0  0 0 ?SMar30   4:52
[kswapd0]

I will apply your v3 series soon and see if there's any improvement wrt
CPU usage, although as I said I don't see that as a big issue. It's
still only 0.013% of available CPU resources (dual core CPU).



JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU
time after 6 days 14h uptime.

And find attached another debugging graph that shows how ANON pages
are privileged in the ZONE_NORMAL on a 4GB machine. Take notice that
the number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph
nicely.



Could you tell me how you draw this picture?



It's a home made server monitoring system. I just added the code needed 
to graph the size of active + inactive LRU lists, per zone and per type. 
Check out http://oss.oetiker.ch/rrdtool/


--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-21 Thread Zlatko Calusic

On 12.04.2013 22:07, Zlatko Calusic wrote:

On 12.04.2013 21:40, Mel Gorman wrote:

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:

On 09.04.2013 13:06, Mel Gorman wrote:


- The only slightly negative thing I observed is that with the patch
applied kswapd burns 10x - 20x more CPU. So instead of about 15
seconds, it has now spent more than 4 minutes on one particular
machine with a quite steady load (after about 12 days of uptime).
Admittedly, that's still nothing too alarming, but...



Would you happen to know what circumstances trigger the higher CPU
usage?



Really nothing special. The server is lightly loaded, but it does enough
reading from the disk so that pagecache is mostly populated and page
reclaiming is active. So, kswapd is no doubt using CPU time gradually,
nothing extraordinary.

When I sent my reply yesterday, the server uptime was 12 days, and
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
days uptime):

root23  0.0  0.0  0 0 ?SMar30   4:52 [kswapd0]

I will apply your v3 series soon and see if there's any improvement wrt
CPU usage, although as I said I don't see that as a big issue. It's
still only 0.013% of available CPU resources (dual core CPU).



JFTR, v3 kswapd uses about 15% more CPU time than v2. 2:50 kswapd CPU 
time after 6 days 14h uptime.


And find attached another debugging graph that shows how ANON pages are 
privileged in the ZONE_NORMAL on a 4GB machine. Take notice that the 
number of pages in the ZONE_DMA32 is scaled (/5) to fit the graph nicely.


--
Zlatko
<>

Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-12 Thread Zlatko Calusic

On 12.04.2013 22:41, Mel Gorman wrote:

On Fri, Apr 12, 2013 at 10:07:54PM +0200, Zlatko Calusic wrote:

On 12.04.2013 21:40, Mel Gorman wrote:

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:

On 09.04.2013 13:06, Mel Gorman wrote:


- The only slightly negative thing I observed is that with the patch
applied kswapd burns 10x - 20x more CPU. So instead of about 15
seconds, it has now spent more than 4 minutes on one particular
machine with a quite steady load (after about 12 days of uptime).
Admittedly, that's still nothing too alarming, but...



Would you happen to know what circumstances trigger the higher CPU
usage?



Really nothing special. The server is lightly loaded, but it does
enough reading from the disk so that pagecache is mostly populated
and page reclaiming is active. So, kswapd is no doubt using CPU time
gradually, nothing extraordinary.

When I sent my reply yesterday, the server uptime was 12 days, and
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13
days uptime):

root23  0.0  0.0  0 0 ?SMar30   4:52 [kswapd0]



Ok, that's not too crazy.



Certainly.


I will apply your v3 series soon and see if there's any improvement
wrt CPU usage, although as I said I don't see that as a big issue.
It's still only 0.013% of available CPU resources (dual core CPU).



Excellent, thanks very much for testing and reporting back.


The pleasure is all mine. I really admire your work.


I read your
mail on the zone balancing and FWIW I would not have expected this series
to have any impact on it.


Good to know. At first I thought that your changes on the anon/file 
balance could make something different, obviously not.



I do not have a good theory yet as to what the
problem is but I'll give it some thought and se what I come up with. I'll
be at LSF/MM next week so it might take me a while.



Yeah, that's definitely not something to be solved quickly, let it wait 
until you have more time, and I'll also continue to test various things 
after a slight break.


It's a quite subtle issue, although the solution will probably be simple 
and obvious. But, I also think it'll take a lot of time to find it. I 
tried to develop an artificial test case to speed up debugging, but 
failed horribly. It seems that the issue can be seen only on real workloads.


--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-12 Thread Zlatko Calusic

On 12.04.2013 21:40, Mel Gorman wrote:

On Thu, Apr 11, 2013 at 10:55:13PM +0200, Zlatko Calusic wrote:

On 09.04.2013 13:06, Mel Gorman wrote:


- The only slightly negative thing I observed is that with the patch
applied kswapd burns 10x - 20x more CPU. So instead of about 15
seconds, it has now spent more than 4 minutes on one particular
machine with a quite steady load (after about 12 days of uptime).
Admittedly, that's still nothing too alarming, but...



Would you happen to know what circumstances trigger the higher CPU
usage?



Really nothing special. The server is lightly loaded, but it does enough 
reading from the disk so that pagecache is mostly populated and page 
reclaiming is active. So, kswapd is no doubt using CPU time gradually, 
nothing extraordinary.


When I sent my reply yesterday, the server uptime was 12 days, and 
kswapd had accumulated 4:28 CPU time. Now, approx 24 hours later (13 
days uptime):


root23  0.0  0.0  0 0 ?SMar30   4:52 [kswapd0]

I will apply your v3 series soon and see if there's any improvement wrt 
CPU usage, although as I said I don't see that as a big issue. It's 
still only 0.013% of available CPU resources (dual core CPU).


--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Reduce system disruption due to kswapd V2

2013-04-11 Thread Zlatko Calusic

On 09.04.2013 13:06, Mel Gorman wrote:

Posting V2 of this series got delayed due to trying to pin down an unrelated
regression in 3.9-rc where interactive performance is shot to hell. That
problem still has not been identified as it's resisting attempts to be
reproducible by a script for the purposes of bisection.

For those that looked at V1, the most important difference in this version
is how patch 2 preserves the proportional scanning of anon/file LRUs.

The series is against 3.9-rc6.

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY  (andi)
o Reformat comment in shrink_page_list  (andi)
o Clarify some comments (dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan



I believe this is what you had in your tree as kswapd-v2r9 branch? If 
I'm right, then I had this series under test for about 2 weeks on two 
different machines (one server, one desktop). Here's what I've found:


- while the series looks overwhelming, with a lot of intricate changes 
(at least from my POV), it proved completely stable and robust. I had 
ZERO issues with it. I'd encourage everybody to test it, even on the 
production!


- I've just sent to you and to the linux-mm list a longish report of the 
issue I tracked last few months that is unfortunately NOT solved with 
this patch series (although at first it looked like it would be). 
Occasionaly I still see large parts of memory freed for no good reason, 
except I explained in the report how it happens. What I still don't know 
is what's the real cause of the heavy imbalance in the pagecache 
utilization between DMA32/NORMAL zones. Seen only on 4GB RAM machines, 
but I suppose that is a quite popular configuration these days.


- The only slightly negative thing I observed is that with the patch 
applied kswapd burns 10x - 20x more CPU. So instead of about 15 seconds, 
it has now spent more than 4 minutes on one particular machine with a 
quite steady load (after about 12 days of uptime). Admittedly, that's 
still nothing too alarming, but...


- I like VERY much how you cleaned up the code so it is more readable 
now. I'd like to see it in the Linus tree as soon as possible. Very good 
job there!


Regards,
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness round 2

2013-03-07 Thread Zlatko Calusic

On 08.03.2013 07:42, Hillf Danton wrote:

On Fri, Mar 8, 2013 at 3:37 AM, Jiri Slaby  wrote:

On 03/01/2013 03:02 PM, Hillf Danton wrote:

On Fri, Mar 1, 2013 at 1:02 AM, Jiri Slaby  wrote:


Ok, no difference, kswap is still crazy. I'm attaching the output of
"grep -vw '0' /proc/vmstat" if you see something there.


Thanks to you for test and data.

Lets try to restore the deleted nap, then.


Oh, it seems to be nice now:
root   579  0.0  0.0  0 0 ?SMar04   0:13 [kswapd0]


Double thanks.

But Mel does not like it, probably.
Lets try nap in another way.

Hillf

--- a/mm/vmscan.c   Thu Feb 21 20:01:02 2013
+++ b/mm/vmscan.c   Fri Mar  8 14:36:10 2013
@@ -2793,6 +2793,10 @@ loop_again:
 * speculatively avoid congestion waits
 */
zone_clear_flag(zone, ZONE_CONGESTED);
+
+   else if (sc.priority > 2 &&
+sc.priority < DEF_PRIORITY - 2)
+   wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
}

/*
--



There's another bug in there, which I'm still chasing. Artificial sleeps 
like this just mask the real bug and introduce new problems (on my 4GB 
server kswapd spends all the time in those congestion wait calls). The 
problem is that the bug needs about 5 days of uptime to reveal it's ugly 
head. So far I can only tell that it was introduced somewhere between 
3.1 & 3.4.


Also, check shrink_inactive_list(), it already sleeps if really needed:

if (nr_writeback && nr_writeback >=
(nr_taken >> (DEF_PRIORITY - sc->priority)))
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

Regards,
--
Zlatko

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Inactive memory keep growing and how to release it?

2013-03-04 Thread Zlatko Calusic

On 04.03.2013 10:52, Lenky Gao wrote:

Hi,

When i just run a test on Centos 6.2 as follows:

#!/bin/bash

while true
do

file="/tmp/filetest"

echo $file

dd if=/dev/zero of=${file} bs=512 count=204800 &> /dev/null

sleep 5
done

the inactive memory keep growing:

#cat /proc/meminfo | grep Inactive\(fi
Inactive(file):   420144 kB
...
#cat /proc/meminfo | grep Inactive\(fi
Inactive(file):   911912 kB
...
#cat /proc/meminfo | grep Inactive\(fi
Inactive(file):  1547484 kB
...

and i cannot reclaim it:

# cat /proc/meminfo | grep Inactive\(fi
Inactive(file):  1557684 kB
# echo 3 > /proc/sys/vm/drop_caches
# cat /proc/meminfo | grep Inactive\(fi
Inactive(file):  1520832 kB

I have tested on other version kernel, such as 2.6.30 and .6.11, the
problom also exists.

When in the final situation, i cannot kmalloc a larger contiguous
memory, especially in interrupt context.
Can you give some tips to avoid this?



The drop_caches mechanism doesn't free dirty page cache pages. And your 
bash script is creating a lot of dirty pages. Run it like this and see 
if it helps your case:


sync; echo 3 > /proc/sys/vm/drop_caches

Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmotm 2013-01-18-15-48 uploaded

2013-01-19 Thread Zlatko Calusic

On 19.01.2013 00:49, a...@linux-foundation.org wrote:

The mm-of-the-moment snapshot 2013-01-18-15-48 has been uploaded to

http://www.ozlabs.org/~akpm/mmotm/



WARNING: vmlinux.o(.text+0x43f025): Section mismatch in reference from 
the function release_firmware_map_entry() to the function 
.meminit.text:firmware_map_find_entry_in_list()

The function release_firmware_map_entry() references
the function __meminit firmware_map_find_entry_in_list().
This is often because release_firmware_map_entry lacks a __meminit
annotation or the annotation of firmware_map_find_entry_in_list is wrong.

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lockdep, rwsem: fix down_write_nest_lock() if !CONFIG_DEBUG_LOCK_ALLOC

2013-01-15 Thread Zlatko Calusic

On 15.01.2013 20:12, Jiri Kosina wrote:

Commit 1b963c81b1 ("lockdep, rwsem: provide down_write_nest_lock()")
contains a bug in a codepath when CONFIG_DEBUG_LOCK_ALLOC is disabled,
which causes down_read() to be called instead of down_write() by mistake
on such configurations. Fix that.

Reported-by: Andrew Clayton 
Reported-by: Zlatko Calusic 
Tested-by: Andrew Clayton 
Signed-off-by: Jiri Kosina 


Reported-and-tested-by: Zlatko Calusic 

kvm starts just fine now. Thanks Jiri!

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: don't wait on congested zones in balance_pgdat()

2013-01-14 Thread Zlatko Calusic
From: Zlatko Calusic 

Commit 92df3a72 (mm: vmscan: throttle reclaim if encountering too many
dirty pages under writeback) introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list(). What this means
is that there's no more need for throttling and additional heuristics
in balance_pgdat(). So, let's remove it and tidy up the code.

Signed-off-by: Zlatko Calusic 
---
 include/linux/vm_event_item.h |  1 -
 mm/vmscan.c   | 29 +
 mm/vmstat.c   |  1 -
 3 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e84a25e..d4b7a18 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -36,7 +36,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
-   KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 #ifdef CONFIG_NUMA_BALANCING
NUMA_PTE_UPDATES,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32fbfdb..fea5a0b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2619,7 +2619,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
int *classzone_idx)
 {
bool pgdat_is_balanced = false;
-   struct zone *unbalanced_zone;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
unsigned long total_scanned;
@@ -2650,9 +2649,6 @@ loop_again:
 
do {
unsigned long lru_pages = 0;
-   int has_under_min_watermark_zone = 0;
-
-   unbalanced_zone = NULL;
 
/*
 * Scan in the highmem->dma direction for the highest
@@ -2792,17 +2788,7 @@ loop_again:
continue;
}
 
-   if (!zone_balanced(zone, testorder, 0, end_zone)) {
-   unbalanced_zone = zone;
-   /*
-* We are still under min water mark.  This
-* means that we have a GFP_ATOMIC allocation
-* failure risk. Hurry up!
-*/
-   if (!zone_watermark_ok_safe(zone, order,
-   min_wmark_pages(zone), end_zone, 0))
-   has_under_min_watermark_zone = 1;
-   } else {
+   if (zone_balanced(zone, testorder, 0, end_zone))
/*
 * If a zone reaches its high watermark,
 * consider it to be no longer congested. It's
@@ -2811,8 +2797,6 @@ loop_again:
 * speculatively avoid congestion waits
 */
zone_clear_flag(zone, ZONE_CONGESTED);
-   }
-
}
 
/*
@@ -2830,17 +2814,6 @@ loop_again:
}
 
/*
-* OK, kswapd is getting into trouble.  Take a nap, then take
-* another pass across the zones.
-*/
-   if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
-   if (has_under_min_watermark_zone)
-   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-   else if (unbalanced_zone)
-   wait_iff_congested(unbalanced_zone, 
BLK_RW_ASYNC, HZ/10);
-   }
-
-   /*
 * We do this so kswapd doesn't build up large priorities for
 * example when it is freeing in parallel with allocators. It
 * matches the direct reclaim path behaviour in terms of impact
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 58e3da5..bb492b5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -769,7 +769,6 @@ const char * const vmstat_text[] = {
"kswapd_inodesteal",
"kswapd_low_wmark_hit_quickly",
"kswapd_high_wmark_hit_quickly",
-   "kswapd_skip_congestion_wait",
"pageoutrun",
"allocstall",
 
-- 
1.8.1

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: wait for congestion to clear on all zones

2013-01-14 Thread Zlatko Calusic
On 13.01.2013 01:46, Simon Jeons wrote:
> On Fri, 2013-01-11 at 12:25 +0100, Zlatko Calusic wrote:
>> On 11.01.2013 02:25, Simon Jeons wrote:
>>> On Wed, 2013-01-09 at 22:41 +0100, Zlatko Calusic wrote:
>>>> From: Zlatko Calusic 
>>>>
>>>> Currently we take a short nap (HZ/10) and wait for congestion to clear
>>>> before taking another pass with lower priority in balance_pgdat(). But
>>>> we do that only for the highest zone that we encounter is unbalanced
>>>> and congested.
>>>>
>>>> This patch changes that to wait on all congested zones in a single
>>>> pass in the hope that it will save us some scanning that way. Also we
>>>> take a nap as soon as congested zone is encountered and sc.priority <
>>>> DEF_PRIORITY - 2 (aka kswapd in trouble).
>>>
>>> But you still didn't explain what's the problem you meat and what
>>> scenario can get benefit from your change.
>>>
>>
>> I did in my reply to Andrew. Here's the relevant part:
>>
>>> I have an observation that without it, under some circumstances that
>>> are VERY HARD to repeat (many days need to pass and some stars to align
>>> to see the effect), the page cache gets hit hard, 2/3 of it evicted in
>>> a split second. And it's not even under high load! So, I'm still
>>> monitoring it, but so far the memory utilization really seems better
>>> with the patch applied (no more mysterious page cache shootdowns).
>>
>> The scenario that should get benefit is everyday. I observed problems during
>> light but constant reading from disk (< 10MB/s). And sending that data
>> over the network at the same time. Think backup that compresses data on the
>> fly before pushing it over the network (so it's not very fast).
>>
>> The trouble is that you can't just fix up a quick benchmark and measure the
>> impact, because many days need to pass for the bug to show up in all it's 
>> beauty.
>>
>> Is there anybody out there who'd like to comment on the patch logic? I.e. do
>> you think that waiting on every congested zone is the more correct solution
>> than waiting on only one (only the highest one, and ignoring the fact that
>> there may be other even more congested zones)?
> 
> What's the benefit of waiting on every congested zone than waiting on
> only one against your scenario?
> 

The good:

Actually, we are _already_ waiting on every congested zone. And have
been for more than a year. So, all this discussion is... moot.

Andrew, ignore this patch, I'll send you a much better one in a minute.
There shouldn't be nearly so many questions about that one. ;)

The bad:

Obviously then, this patch didn't fix my issue. It just took a little
bit longer for it to appear again.

The ugly:

Here's what I observe on one of my machines:

Node 0, zone  DMA
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
Node 0, zoneDMA32
nr_vmscan_write 23164
nr_vmscan_immediate_reclaim 582038
Node 0, zone   Normal
nr_vmscan_write 16584344  <-- ugh!
nr_vmscan_immediate_reclaim 1118415

But that's just a sneak peek, I'll open a proper thread to discuss this
when I collect a little bit more data. BTW, that Normal zone with
extraordinary amount of writebacks under memory pressure is 4 times
smaller than DMA32 zone, that's why I consider it ugly. :P
-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mmotm 2013-01-11-15-47 (trouble starting kvm)

2013-01-12 Thread Zlatko Calusic

On 12.01.2013 00:48, a...@linux-foundation.org wrote:

A git tree which contains the memory management portion of this tree is
maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
by Michal Hocko.  It contains the patches which are between the


The last commit I see in this tree is:

commit a0d271cbfed1dd50278c6b06bead3d00ba0a88f9
Author: Linus Torvalds 
Date:   Sun Sep 30 16:47:46 2012 -0700

Linux 3.6

Is it dead? Or am I doing something wrong?


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/?p=linux-mmotm.git;a=summary

This mmotm tree contains the following patches against 3.8-rc3:
(patches marked "*" will be included in linux-next)

* lockdep-rwsem-provide-down_write_nest_lock.patch
* mm-mmap-annotate-vm_lock_anon_vma-locking-properly-for-lockdep.patch


Had to revert the above two patches to start KVM (win7) successfully. 
Otherwise it would livelock on some semaphore, it seems. Couldn't kill 
it, ps output would stuck, even reboot didn't work (had to use SysRQ).


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: wait for congestion to clear on all zones

2013-01-11 Thread Zlatko Calusic
On 11.01.2013 02:25, Simon Jeons wrote:
> On Wed, 2013-01-09 at 22:41 +0100, Zlatko Calusic wrote:
>> From: Zlatko Calusic 
>>
>> Currently we take a short nap (HZ/10) and wait for congestion to clear
>> before taking another pass with lower priority in balance_pgdat(). But
>> we do that only for the highest zone that we encounter is unbalanced
>> and congested.
>>
>> This patch changes that to wait on all congested zones in a single
>> pass in the hope that it will save us some scanning that way. Also we
>> take a nap as soon as congested zone is encountered and sc.priority <
>> DEF_PRIORITY - 2 (aka kswapd in trouble).
> 
> But you still didn't explain what's the problem you meat and what
> scenario can get benefit from your change.
> 

I did in my reply to Andrew. Here's the relevant part:

> I have an observation that without it, under some circumstances that 
> are VERY HARD to repeat (many days need to pass and some stars to align
> to see the effect), the page cache gets hit hard, 2/3 of it evicted in
> a split second. And it's not even under high load! So, I'm still
> monitoring it, but so far the memory utilization really seems better
> with the patch applied (no more mysterious page cache shootdowns). 

The scenario that should get benefit is everyday. I observed problems during
light but constant reading from disk (< 10MB/s). And sending that data
over the network at the same time. Think backup that compresses data on the
fly before pushing it over the network (so it's not very fast).

The trouble is that you can't just fix up a quick benchmark and measure the
impact, because many days need to pass for the bug to show up in all it's 
beauty.

Is there anybody out there who'd like to comment on the patch logic? I.e. do
you think that waiting on every congested zone is the more correct solution
than waiting on only one (only the highest one, and ignoring the fact that
there may be other even more congested zones)?

Regards,
-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: wait for congestion to clear on all zones

2013-01-09 Thread Zlatko Calusic

On 09.01.2013 22:48, Andrew Morton wrote:

On Wed, 09 Jan 2013 22:41:48 +0100
Zlatko Calusic  wrote:


Currently we take a short nap (HZ/10) and wait for congestion to clear
before taking another pass with lower priority in balance_pgdat(). But
we do that only for the highest zone that we encounter is unbalanced
and congested.

This patch changes that to wait on all congested zones in a single
pass in the hope that it will save us some scanning that way. Also we
take a nap as soon as congested zone is encountered and sc.priority <
DEF_PRIORITY - 2 (aka kswapd in trouble).

...

The patch is against the mm tree. Make sure that
mm-avoid-calling-pgdat_balanced-needlessly.patch is applied first (not
yet in the mmotm tree). Tested on half a dozen systems with different
workloads for the last few days, working really well!


But what are the user-observable effcets of this change?  Less kernel
CPU consumption, presumably?  Did you quantify it?



And I forgot to answer all the questions... :(

Actually, I did record kswapd CPU usage after 5 days of uptime and I 
intend to compare it with the new data (after few more days pass). I 
expect maybe slightly better results.


But, I think it's obvious from my first reply that my primary goal with 
this patch is correctness, not optimization. So, I won't be dissapointed 
a little bit if kswapd CPU usage stays the same, so long as the memory 
utilization remains this smooth. ;)


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: wait for congestion to clear on all zones

2013-01-09 Thread Zlatko Calusic

On 09.01.2013 22:48, Andrew Morton wrote:

On Wed, 09 Jan 2013 22:41:48 +0100
Zlatko Calusic  wrote:


Currently we take a short nap (HZ/10) and wait for congestion to clear
before taking another pass with lower priority in balance_pgdat(). But
we do that only for the highest zone that we encounter is unbalanced
and congested.

This patch changes that to wait on all congested zones in a single
pass in the hope that it will save us some scanning that way. Also we
take a nap as soon as congested zone is encountered and sc.priority <
DEF_PRIORITY - 2 (aka kswapd in trouble).

...

The patch is against the mm tree. Make sure that
mm-avoid-calling-pgdat_balanced-needlessly.patch is applied first (not
yet in the mmotm tree). Tested on half a dozen systems with different
workloads for the last few days, working really well!


But what are the user-observable effcets of this change?  Less kernel
CPU consumption, presumably?  Did you quantify it?



I have an observation that without it, under some circumstances that are 
VERY HARD to repeat (many days need to pass and some stars to align to 
see the effect), the page cache gets hit hard, 2/3 of it evicted in a 
split second. And it's not even under high load! So, I'm still 
monitoring it, but so far the memory utilization really seems better 
with the patch applied (no more mysterious page cache shootdowns).


Other than that, it just seems more correct to wait on all congested 
zones, not just the highest one. When I sent my first patch that 
replaced congestion_wait() I didn't have much time to do elaborate 
analysis (3.7.0 was released in a matter of hours). So, I just plugged 
the hole and continued working on the proper solution.


I do think that this is my last patch in this particular area 
(balance_pgdat() & friends). But, I'll continue investigating for the 
root cause of this interesting debalance that happens only on this 
particular system. Because I think balance_pgdat() behaviour was just 
revealing it, but the real problem is somewhere else.

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: wait for congestion to clear on all zones

2013-01-09 Thread Zlatko Calusic
From: Zlatko Calusic 

Currently we take a short nap (HZ/10) and wait for congestion to clear
before taking another pass with lower priority in balance_pgdat(). But
we do that only for the highest zone that we encounter is unbalanced
and congested.

This patch changes that to wait on all congested zones in a single
pass in the hope that it will save us some scanning that way. Also we
take a nap as soon as congested zone is encountered and sc.priority <
DEF_PRIORITY - 2 (aka kswapd in trouble).

Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Signed-off-by: Zlatko Calusic 
---
The patch is against the mm tree. Make sure that
mm-avoid-calling-pgdat_balanced-needlessly.patch is applied first (not
yet in the mmotm tree). Tested on half a dozen systems with different
workloads for the last few days, working really well!

 mm/vmscan.c | 35 ---
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 002ade6..1c5d38a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2565,7 +2565,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int 
order,
int *classzone_idx)
 {
bool pgdat_is_balanced = false;
-   struct zone *unbalanced_zone;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
unsigned long total_scanned;
@@ -2596,9 +2595,6 @@ loop_again:
 
do {
unsigned long lru_pages = 0;
-   int has_under_min_watermark_zone = 0;
-
-   unbalanced_zone = NULL;
 
/*
 * Scan in the highmem->dma direction for the highest
@@ -2739,15 +2735,20 @@ loop_again:
}
 
if (!zone_balanced(zone, testorder, 0, end_zone)) {
-   unbalanced_zone = zone;
-   /*
-* We are still under min water mark.  This
-* means that we have a GFP_ATOMIC allocation
-* failure risk. Hurry up!
-*/
+   if (total_scanned && sc.priority < DEF_PRIORITY - 
2) {
+   /* OK, kswapd is getting into trouble. */
if (!zone_watermark_ok_safe(zone, order,
min_wmark_pages(zone), end_zone, 0))
-   has_under_min_watermark_zone = 1;
+   /*
+* We are still under min water mark.
+* This means that we have a GFP_ATOMIC
+* allocation failure risk. Hurry up!
+*/
+   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
+   else
+   /* Take a nap if a zone is congested. */
+   wait_iff_congested(zone, BLK_RW_ASYNC, 
HZ/10);
+   }
} else {
/*
 * If a zone reaches its high watermark,
@@ -2758,7 +2759,6 @@ loop_again:
 */
zone_clear_flag(zone, ZONE_CONGESTED);
}
-
}
 
/*
@@ -2776,17 +2776,6 @@ loop_again:
}
 
/*
-* OK, kswapd is getting into trouble.  Take a nap, then take
-* another pass across the zones.
-*/
-   if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
-   if (has_under_min_watermark_zone)
-   count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-   else if (unbalanced_zone)
-   wait_iff_congested(unbalanced_zone, 
BLK_RW_ASYNC, HZ/10);
-   }
-
-   /*
 * We do this so kswapd doesn't build up large priorities for
 * example when it is freeing in parallel with allocators. It
 * matches the direct reclaim path behaviour in terms of impact
-- 
1.8.1

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion

2012-12-29 Thread Zlatko Calusic

On 29.12.2012 08:25, Hillf Danton wrote:

On Thu, Dec 27, 2012 at 11:42 PM, Zlatko Calusic
 wrote:

On 21.12.2012 12:51, Hillf Danton wrote:


On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic 
wrote:


   static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
  int
*classzone_idx)
   {
-   int all_zones_ok;
+   struct zone *unbalanced_zone;



nit: less hunks if not erase that mark

Hillf



This one left unanswered and forgotten because I didn't understand what you
meant. Could you elaborate?


Sure, the patch looks simpler(and nicer) if we dont
erase all_zones_ok.



Ah, yes. I gave it a good thought. But, when I introduced 
unbalanced_zone it just didn't make much sense to me to have two 
variables with very similar meaning. If I decided to keep all_zones_ok, 
it would be either:


all_zones_ok = true
unbalanced_zone = NULL
(meaning: if no zone in unbalanced, then all zones must be ok)

or

all_zones_ok = false
unbalanced_zone = struct zone *
(meaning: if there's an unbalanced zone, then certainly not all zones 
are ok)


So I decided to use only unbalanced_zone (because I had to!), and remove 
all_zones_ok to avoid redundancy. I hope it makes sense.


If you check my latest (and still queued) optimization (mm: avoid 
calling pgdat_balanced() needlessly), there again popped up a need for a 
boolean, but I called it pgdat_is_balanced this time, just to match the 
name of two other functions. It could've also been called all_zones_ok 
if you prefer the name? Of course, I have no strong feelings about the 
name, both are OK, so if you want me to redo the patch, just say.


Generally speaking, while I always attempt to make a smaller patch (less 
hunks and less changes = easier to review), before that I'll always try 
to make the code that results from the commit cleaner, simpler, more 
readable.


For example, I'll always check that I don't mess with whitespace 
needlessly, unless I think it's actually desirable, here's just one example:


"mm: avoid calling pgdat_balanced() needlessly" changes

---
} while (--sc.priority >= 0);
out:

if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
---

to

---
} while (--sc.priority >= 0);

out:
if (!pgdat_is_balanced) {
---

because I find the latter more correct place for the label "out".

Thanks for the comment.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-28 Thread Zlatko Calusic

On 28.12.2012 10:01, Zhouping Liu wrote:

On 12/28/2012 10:45 AM, Zhouping Liu wrote:

Thank you for the report Zhouping!

Would you be so kind to test the following patch and report results?
Apply the patch to the latest mainline.

Hello Zlatko,

I have tested the below patch(applied it on mainline directly),
but IMO, I'd like to say it maybe don't fix the issue completely.


Hi Zlatko,

I re-tested it on another machine, which has 60+ Gb RAM and 4 numa nodes,
without your patch, it's easy to reproduce the 'NULL pointer' error,
after applying your patch, I couldn't reproduce the issue any more.

depending on the above, it implied that your patch fixed the issue.



Yes, that's exactly what I expected. Just wanted to doublecheck this 
time. Live and learn. ;)



but in my last mail, I tested it on two machines, which caused hung task
with your patch,
so I'm confusing is it your patch block some oom-killer performance? if
it's not, your patch is good for me.



From what I know, the patch shouldn't have much influence on the oom 
killer, if any. But, as all those subsystems are closely interconnected, 
both oom & vmscan code is mm after all, there could be some 
interference. Is the hung-task issue repeatable?

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix null pointer dereference in wait_iff_congested()

2012-12-28 Thread Zlatko Calusic

On 28.12.2012 03:49, Minchan Kim wrote:

Hello Zlatko,

On Fri, Dec 28, 2012 at 03:16:38AM +0100, Zlatko Calusic wrote:

From: Zlatko Calusic 

The unintended consequence of commit 4ae0a48b is that
wait_iff_congested() can now be called with NULL struct zone*
producing kernel oops like this:


For good description, it would be better to write simple pseudo code
flow to show how NULL-zone pass into wait_iff_congested because
kswapd code flow is too complex.

As I see the code, we have following line above wait_iff_congested.

if (!unbalanced_zone || blah blah)
 break;

How can NULL unbalanced_zone reach wait_iff_congested?



Hello Minchan, and thanks for the comment.

That line was there before commit 4ae0a48b got in, and you're right, 
it's what was protecting wait_iff_congested() from being called with 
NULL zone*. But then all that logic got colapsed to a simple 
pgdat_balanced() call and that's when I introduced the bug, I lost the 
protection.


What I _think_ is happening (pseudo code following...) is that after 
scanning the zone in the dma->highmem direction, and concluding that all 
zones are balanced (unbalanced_zone remains NULL!), 
wake_up(&pgdat->pfmemalloc_wait) wakes up a lot of memory hungry 
processes (especially true in various aggressive test/benchmarks) that 
immediately drain and unbalance one or more zones. Then pgdat_balanced() 
call which immediately follows will be false, but we still have 
unbalanced_zone = NULL, rememeber? Oops...


But, all that is a speculation that I can't prove atm. Of course, if 
anybody thinks that's a credible explanation, I could add it as a commit 
comment, or even as a code comment, but I didn't want to be overly 
imaginative. The fix itself is simple and real.


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-28 Thread Zlatko Calusic

On 28.12.2012 03:45, Zhouping Liu wrote:


Thank you for the report Zhouping!

Would you be so kind to test the following patch and report results?
Apply the patch to the latest mainline.


Hello Zlatko,

I have tested the below patch(applied it on mainline directly),
but IMO, I'd like to say it maybe don't fix the issue completely.

run the reproducer[1] on two machine, one machine has 2 numa nodes(8Gb RAM),
another one has 4 numa nodes(8Gb RAM), then the system hung all the time, such 
as the dmesg log:

[  713.066937] Killed process 6085 (oom01) total-vm:18880768kB, 
anon-rss:7915612kB, file-rss:4kB
[  959.555269] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[  959.562144] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1079.382018] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1079.388872] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1199.209709] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1199.216562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1319.036939] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1319.043794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1438.864797] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1438.871649] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1558.691611] INFO: task kworker/13:2:147 blocked for more than 120 seconds.
[ 1558.698466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
..

I'm not sure whether it's your patch triggering the hung task or not, but 
reverted cda73a10eb3,
the reproducer(oom01) can PASS without both 'NULL pointer dereference at 
0500' and hung task issues.

but some time, it's possible that the reproducer(oom01) cause hung task on a 
box with large RAM(100Gb+), so I can't judge it...



Thanks for the test.

Yes, close to OOM things get quite unstable and it's hard to get 
reliable test results. Maybe you could run it a few times, and see if 
you can get any meaningful statistics out of a few runs. I need to check 
oom.c myself and see what it's doing. Thanks for the link.


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: fix null pointer dereference in wait_iff_congested()

2012-12-27 Thread Zlatko Calusic
From: Zlatko Calusic 

The unintended consequence of commit 4ae0a48b is that
wait_iff_congested() can now be called with NULL struct zone*
producing kernel oops like this:

BUG: unable to handle kernel NULL pointer dereference
IP: [] wait_iff_congested+0x59/0x140

This trivial patch fixes it.

Reported-by: Zhouping Liu 
Reported-and-tested-by: Sedat Dilek 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Signed-off-by: Zlatko Calusic 
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 02bcfa3..e55ce55 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2782,7 +2782,7 @@ loop_again:
if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-   else
+   else if (unbalanced_zone)
wait_iff_congested(unbalanced_zone, 
BLK_RW_ASYNC, HZ/10);
}
 
-- 
1.8.1.rc3

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic

On 28.12.2012 01:37, Sedat Dilek wrote:

On Fri, Dec 28, 2012 at 1:33 AM, Zlatko Calusic  wrote:

On 28.12.2012 01:24, Sedat Dilek wrote:


On Fri, Dec 28, 2012 at 12:51 AM, Zlatko Calusic
 wrote:


On 28.12.2012 00:42, Sedat Dilek wrote:



On Fri, Dec 28, 2012 at 12:39 AM, Zlatko Calusic
 wrote:



On 28.12.2012 00:30, Sedat Dilek wrote:




Hi Zlatko,

I am not sure if I hit the same problem as described in this thread.

Under heavy load, while building a customized toolchain for the Freetz
router project I got a BUG || NULL pointer derefence || kswapd ||
zone_balanced || pgdat_balanced() etc. (details see my screenshot).

I will try your patch from [1] ***only*** on top of my last
Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).



Yes, that's the same bug. It should be fixed with my latest patch, so
I'd
appreciate you testing it, to be on the safe side this time. There
should
be
no difference if you apply it to anything newer than 3.8-rc1, so go for
it.
Thanks!



Not sure how I can really reproduce this bug as one build worked fine
within my last v3.8-rc1 kernel.
I increased the parallel-make-jobs-number from "4" to "8" to stress a
bit harder.
Just building right now... and will report.

If you have any test-case (script or whatever), please let me/us know.



Unfortunately not, I haven't reproduced it yet on my machines. But it
seems
that bug will hit only under heavy memory pressure. When close to OOM, or
possibly with lots of writing to disk. It's also possible that
fragmentation
of memory zones could provoke it, that means testing it for a longer
time.



I tested successfully by doing simultaneously...
- building Freetz with 8 parallel make-jobs
- building Linux GIT with 1 make-job
- 9 tabs open in firefox
- In one tab I ran YouTube music video
- etc.

I am reading [1] and [2] where another user reports success by reverting
this...

commit cda73a10eb3f493871ed39f468db50a65ebeddce
"mm: do not sleep in balance_pgdat if there's no i/o congestion"

BTW, this machine has also 4GiB RAM (Ubuntu/precise AMD64).

Feel free to add a "Reported-by/Tested-by" if you think this is a
positive report.



Thanks for the testing! And keep running it in case something interesting
pops up. ;)

No need to revert cda73a10eb because it fixes another bug. And the patch
you're now running fixes the new bug I introduced with a combination of my
latest 2 patches. Nah, it gets complicated... :)

But, at least I found the culprit and as soon as Linus applies the fix,
everything will be hunky dory again, at least on this front. :P



I am not subscribed to LKML and linux-mm,,,
Do you have a patch with a proper subject and descriptive text? URL?



Soon to follow. I'd appreciate Zhouping Liu testing it too, though.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic

On 28.12.2012 01:24, Sedat Dilek wrote:

On Fri, Dec 28, 2012 at 12:51 AM, Zlatko Calusic
 wrote:

On 28.12.2012 00:42, Sedat Dilek wrote:


On Fri, Dec 28, 2012 at 12:39 AM, Zlatko Calusic
 wrote:


On 28.12.2012 00:30, Sedat Dilek wrote:



Hi Zlatko,

I am not sure if I hit the same problem as described in this thread.

Under heavy load, while building a customized toolchain for the Freetz
router project I got a BUG || NULL pointer derefence || kswapd ||
zone_balanced || pgdat_balanced() etc. (details see my screenshot).

I will try your patch from [1] ***only*** on top of my last
Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).



Yes, that's the same bug. It should be fixed with my latest patch, so I'd
appreciate you testing it, to be on the safe side this time. There should
be
no difference if you apply it to anything newer than 3.8-rc1, so go for
it.
Thanks!



Not sure how I can really reproduce this bug as one build worked fine
within my last v3.8-rc1 kernel.
I increased the parallel-make-jobs-number from "4" to "8" to stress a
bit harder.
Just building right now... and will report.

If you have any test-case (script or whatever), please let me/us know.



Unfortunately not, I haven't reproduced it yet on my machines. But it seems
that bug will hit only under heavy memory pressure. When close to OOM, or
possibly with lots of writing to disk. It's also possible that fragmentation
of memory zones could provoke it, that means testing it for a longer time.



I tested successfully by doing simultaneously...
- building Freetz with 8 parallel make-jobs
- building Linux GIT with 1 make-job
- 9 tabs open in firefox
- In one tab I ran YouTube music video
- etc.

I am reading [1] and [2] where another user reports success by reverting this...

commit cda73a10eb3f493871ed39f468db50a65ebeddce
"mm: do not sleep in balance_pgdat if there's no i/o congestion"

BTW, this machine has also 4GiB RAM (Ubuntu/precise AMD64).

Feel free to add a "Reported-by/Tested-by" if you think this is a
positive report.



Thanks for the testing! And keep running it in case something 
interesting pops up. ;)


No need to revert cda73a10eb because it fixes another bug. And the patch 
you're now running fixes the new bug I introduced with a combination of 
my latest 2 patches. Nah, it gets complicated... :)


But, at least I found the culprit and as soon as Linus applies the fix, 
everything will be hunky dory again, at least on this front. :P


Thanks,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic

On 28.12.2012 00:55, David R. Piegdon wrote:

Hi,

NOTE to everyone debugging this: reproduced quickly with X + firefox +
youtube (adobe flash plugin)


Would you be so kind to test the following patch and report results?
Apply the patch to the latest mainline.


I've had probably the same problem (dmesg below) and currently am trying
your patch applied to current mainline (101e5c7470eb7f). so far it looks
very good. (before: bug after 5-30 minutes, right now 1h and counting)



That's good news, except the oops you've attached belongs to another 
bug, it seems. :P


People report good results when applying Hillf Danton suggestion to 
revert 5a505085f0 and 4fc3f1d66b1. So, if the bug reappears, you could 
help testing with the same procedure.


[Cc: linux-mm list]


thanks!


[  105.164610] [ cut here ]
[  105.164614] kernel BUG at mm/huge_memory.c:1798!
[  105.164617] invalid opcode:  [#1] PREEMPT SMP
[  105.164621] Modules linked in: fuse sha256_generic xt_owner xt_LOG xt_limit 
xt_recent xt_conntrack xt_multiport iptable_mangle xt_DSCP iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fbcon font 
bitblit softcursor fb fbdev hwmon_vid btrfs zlib_deflate zlib_inflate xfs 
libcrc32c snd_usb_audio uvcvideo snd_usbmidi_lib videobuf2_core snd_rawmidi 
videobuf2_vmalloc videobuf2_memops hid_kensington iTCO_wdt joydev gpio_ich 
iTCO_vendor_support raid1 fglrx(PO) coretemp kvm_intel kvm skge acpi_cpufreq 
lpc_ich serio_raw asus_atk0110 snd_hda_codec_hdmi intel_agp snd_hda_intel mperf 
intel_gtt processor snd_hda_codec sky2 agpgart snd_hwdep [last unloaded: 
iTCO_wdt]
[  105.164672] CPU 1
[  105.164677] Pid: 4091, comm: XPCOM CC Tainted: P   O 3.8.0-rc1+ #43 
System manufacturer System Product Name/P5B-Deluxe
[  105.164679] RIP: 0010:[]  [] 
__split_huge_page+0x216/0x240
[  105.164688] RSP: 0018:880091511c48  EFLAGS: 00010297
[  105.164690] RAX: 0001 RBX: 8800a210c000 RCX: 0042
[  105.164692] RDX: 00cb RSI: 0046 RDI: 81b28a20
[  105.164694] RBP: 880091511ca8 R08:  R09: 
[  105.164696] R10: 043d R11: 0001 R12: 8800a2295c60
[  105.164698] R13: ea00021e R14:  R15: 0007f5134600
[  105.164701] FS:  7f514991e700() GS:8800bfc8() 
knlGS:
[  105.164703] CS:  0010 DS:  ES:  CR0: 8005003b
[  105.164705] CR2: 7f5123bff000 CR3: 9531b000 CR4: 07e0
[  105.164707] DR0:  DR1:  DR2: 
[  105.164709] DR3:  DR6: 0ff0 DR7: 0400
[  105.164712] Process XPCOM CC (pid: 4091, threadinfo 88009151, task 
8800953616b0)
[  105.164713] Stack:
[  105.164715]  8800 8800b9c834b0 7f513480 
8158c4a5
[  105.164719]  8800a210c064 7f513460 880091511ca8 
ea00021e
[  105.164723]  8800b9c83480 8800a210c000 88009fdc1d18 
8800a210c064
[  105.164727] Call Trace:
[  105.164732]  [] split_huge_page+0x68/0xb0
[  105.164736]  [] __split_huge_page_pmd+0x1a8/0x220
[  105.164740]  [] unmap_page_range+0x1b6/0x2d0
[  105.164744]  [] unmap_single_vma+0x5b/0xe0
[  105.164747]  [] zap_page_range+0xbc/0x120
[  105.164752]  [] ? futex_wake+0x116/0x130
[  105.164756]  [] ? pick_next_task_fair+0x36/0xb0
[  105.164760]  [] madvise_vma+0xf7/0x140
[  105.164764]  [] ? find_vma_prev+0x12/0x60
[  105.164767]  [] sys_madvise+0x23d/0x330
[  105.164772]  [] system_call_fastpath+0x16/0x1b
[  105.164774] Code: 48 89 df e8 ed 10 ff ff e9 ab fe ff ff 0f 0b 41 8b 55 18 8b 75 
bc ff c2 48 c7 c7 38 0e 7d 81 31 c0 e8 13 9b 46 00 e9 15 ff ff ff <0f> 0b 41 8b 
4d 18 89 da ff c1 8b 75 bc 48 c7 c7 58 0e 7d 81 31
[  105.164814] RIP  [] __split_huge_page+0x216/0x240
[  105.164818]  RSP 
[  105.164823] ---[ end trace 00c060fd7d17a3d4 ]---




--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic

On 28.12.2012 00:42, Sedat Dilek wrote:

On Fri, Dec 28, 2012 at 12:39 AM, Zlatko Calusic
 wrote:

On 28.12.2012 00:30, Sedat Dilek wrote:


Hi Zlatko,

I am not sure if I hit the same problem as described in this thread.

Under heavy load, while building a customized toolchain for the Freetz
router project I got a BUG || NULL pointer derefence || kswapd ||
zone_balanced || pgdat_balanced() etc. (details see my screenshot).

I will try your patch from [1] ***only*** on top of my last
Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).



Yes, that's the same bug. It should be fixed with my latest patch, so I'd
appreciate you testing it, to be on the safe side this time. There should be
no difference if you apply it to anything newer than 3.8-rc1, so go for it.
Thanks!



Not sure how I can really reproduce this bug as one build worked fine
within my last v3.8-rc1 kernel.
I increased the parallel-make-jobs-number from "4" to "8" to stress a
bit harder.
Just building right now... and will report.

If you have any test-case (script or whatever), please let me/us know.



Unfortunately not, I haven't reproduced it yet on my machines. But it 
seems that bug will hit only under heavy memory pressure. When close to 
OOM, or possibly with lots of writing to disk. It's also possible that 
fragmentation of memory zones could provoke it, that means testing it 
for a longer time.


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic

On 28.12.2012 00:30, Sedat Dilek wrote:

Hi Zlatko,

I am not sure if I hit the same problem as described in this thread.

Under heavy load, while building a customized toolchain for the Freetz
router project I got a BUG || NULL pointer derefence || kswapd ||
zone_balanced || pgdat_balanced() etc. (details see my screenshot).

I will try your patch from [1] ***only*** on top of my last
Linux-v3.8-rc1 GIT setup (post-v3.8-rc1 mainline + some net-fixes).



Yes, that's the same bug. It should be fixed with my latest patch, so 
I'd appreciate you testing it, to be on the safe side this time. There 
should be no difference if you apply it to anything newer than 3.8-rc1, 
so go for it. Thanks!


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion

2012-12-27 Thread Zlatko Calusic

On 21.12.2012 12:51, Hillf Danton wrote:

On Thu, Dec 20, 2012 at 7:25 AM, Zlatko Calusic  wrote:

  static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
 int *classzone_idx)
  {
-   int all_zones_ok;
+   struct zone *unbalanced_zone;


nit: less hunks if not erase that mark

Hillf


This one left unanswered and forgotten because I didn't understand what 
you meant. Could you elaborate?


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-27 Thread Zlatko Calusic
On 26.12.2012 12:22, Zhouping Liu wrote:
> Hello everyone,
> 
> The latest mainline(637704cbc95c) would trigger the following error when the 
> system was under
> some pressure condition(in my testing, I used oom01 case inside LTP test 
> suite to trigger the issue):
> 
> [ 5462.920151] BUG: unable to handle kernel NULL pointer dereference at 
> 0500
> [ 5462.927991] IP: [] wait_iff_congested+0x59/0x140
> [ 5462.934176] PGD 0
> [ 5462.936191] Oops:  [#2] SMP
> [ 5462.939428] Modules linked in: lockd sunrpc iptable_mangle ipt_REJECT 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter 
> ebtables ip6table_filter ip6_tables iptable_filter ip_tabled
> [ 5462.984261] CPU 13
> [ 5462.986184] Pid: 117, comm: kswapd3 Tainted: G  D  3.8.0-rc1+ #1 
> Dell Inc. PowerEdge M905/0D413F
> [ 5462.995814] RIP: 0010:[]  [] 
> wait_iff_congested+0x59/0x140
> [ 5463.004411] RSP: 0018:88007c97fd48  EFLAGS: 00010202
> [ 5463.009701] RAX: 0001 RBX: 0064 RCX: 
> 0001
> [ 5463.016818] RDX: 0064 RSI:  RDI: 
> 
> [ 5463.023926] RBP: 88007c97fd98 R08:  R09: 
> 88022ffd9d80
> [ 5463.031033] R10: 3189 R11:  R12: 
> 0001004ee87e
> [ 5463.038140] R13: 0002 R14:  R15: 
> 88022ffd9000
> [ 5463.045258] FS:  7f3e570de740() GS:88022fcc() 
> knlGS:
> [ 5463.053317] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 5463.059041] CR2: 0500 CR3: 018dc000 CR4: 
> 07e0
> [ 5463.066157] DR0:  DR1:  DR2: 
> 
> [ 5463.073276] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 5463.080400] Process kswapd3 (pid: 117, threadinfo 88007c97e000, task 
> 88007c981970)
> [ 5463.088633] Stack:
> [ 5463.090646]  88007c97fd98  88007c981970 
> 81086080
> [ 5463.098090]  88007c97fd68 88007c97fd68 88022ffd9d80 
> 0002
> [ 5463.105527]  0002  88007c97feb8 
> 8114b0e3
> [ 5463.112998] Call Trace:
> [ 5463.115446]  [] ? wake_up_bit+0x40/0x40
> [ 5463.120826]  [] kswapd+0x6c3/0xa50
> [ 5463.125775]  [] ? zone_reclaim+0x270/0x270
> [ 5463.131415]  [] kthread+0xc0/0xd0
> [ 5463.136278]  [] ? kthread_create_on_node+0x120/0x120
> [ 5463.142786]  [] ret_from_fork+0x7c/0xb0
> [ 5463.148166]  [] ? kthread_create_on_node+0x120/0x120
> [ 5463.154668] Code: 4e 6d 88 00 48 c7 45 b8 00 00 00 00 48 83 c0 18 48 c7 45 
> c8 80 60 08 81 48 89 45 d0 48 89 45 d8 8b 04 b5 a0 9a cd 81 85 c0 74 0f <48> 
> 8b 87 00 05 00 00 a8 04 0f 85 98 00 00 00 e8 b3 c3
> [ 5463.174097] RIP  [] wait_iff_congested+0x59/0x140
> [ 5463.180352]  RSP 
> [ 5463.183824] CR2: 0500
> [ 5463.203717] ---[ end trace 9ff4ff9087c13a36 ]---
> 
> I attached the config file, hope it can make some help.
> 
> Thanks,
> Zhouping
> 

Thank you for the report Zhouping!

Would you be so kind to test the following patch and report results? Apply the 
patch to the latest mainline.

Thanks,

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23291b9..e55ce55 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
 {
+   bool pgdat_is_balanced = false;
struct zone *unbalanced_zone;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
@@ -2638,8 +2639,11 @@ loop_again:
zone_clear_flag(zone, ZONE_CONGESTED);
}
}
-   if (i < 0)
+
+   if (i < 0) {
+   pgdat_is_balanced = true;
goto out;
+   }
 
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2766,8 +2770,11 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);
 
-   if (pgdat_balanced(pgdat, order, *classzone_idx))
+   if (pgdat_balanced(pgdat, order, *classzone_idx)) {
+   pgdat_is_balanced = true;
break;  /* kswapd: all done */
+   }
+
/*
 * OK, kswapd is getting into trouble.  Take a nap, then take
 * another pass across the zones.
@@ -2775,7 +2782,7 @@ loop_again:
if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
-   else
+   e

[PATCH] mm: avoid calling pgdat_balanced() needlessly

2012-12-26 Thread Zlatko Calusic
Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called
to check the status, then break the loop if pgdat is balanced, just to
be immediately called again. The second call is completely unnecessary,
of course.

The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots
of code.

Signed-off-by: Zlatko Calusic 
Cc: Andrew Morton 
Cc: Mel Gorman 
Cc: Hugh Dickins 
---
 mm/vmscan.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 23291b9..02bcfa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2564,6 +2564,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
 {
+   bool pgdat_is_balanced = false;
struct zone *unbalanced_zone;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
@@ -2638,8 +2639,11 @@ loop_again:
zone_clear_flag(zone, ZONE_CONGESTED);
}
}
-   if (i < 0)
+
+   if (i < 0) {
+   pgdat_is_balanced = true;
goto out;
+   }
 
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2766,8 +2770,11 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);
 
-   if (pgdat_balanced(pgdat, order, *classzone_idx))
+   if (pgdat_balanced(pgdat, order, *classzone_idx)) {
+   pgdat_is_balanced = true;
break;  /* kswapd: all done */
+   }
+
/*
 * OK, kswapd is getting into trouble.  Take a nap, then take
 * another pass across the zones.
@@ -2788,9 +2795,9 @@ loop_again:
if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX)
break;
} while (--sc.priority >= 0);
-out:
 
-   if (!pgdat_balanced(pgdat, order, *classzone_idx)) {
+out:
+   if (!pgdat_is_balanced) {
cond_resched();
 
try_to_freeze();
-- 
1.8.1.rc0

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: unable to handle kernel NULL pointer dereference at 0000000000000500

2012-12-26 Thread Zlatko Calusic

On 26.12.2012 12:22, Zhouping Liu wrote:

Hello everyone,

The latest mainline(637704cbc95c) would trigger the following error when the 
system was under
some pressure condition(in my testing, I used oom01 case inside LTP test suite 
to trigger the issue):

[ 5462.920151] BUG: unable to handle kernel NULL pointer dereference at 
0500
[ 5462.927991] IP: [] wait_iff_congested+0x59/0x140
[ 5462.934176] PGD 0
[ 5462.936191] Oops:  [#2] SMP
[ 5462.939428] Modules linked in: lockd sunrpc iptable_mangle ipt_REJECT 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter 
ebtables ip6table_filter ip6_tables iptable_filter ip_tabled
[ 5462.984261] CPU 13
[ 5462.986184] Pid: 117, comm: kswapd3 Tainted: G  D  3.8.0-rc1+ #1 
Dell Inc. PowerEdge M905/0D413F
[ 5462.995814] RIP: 0010:[]  [] 
wait_iff_congested+0x59/0x140
[ 5463.004411] RSP: 0018:88007c97fd48  EFLAGS: 00010202
[ 5463.009701] RAX: 0001 RBX: 0064 RCX: 0001
[ 5463.016818] RDX: 0064 RSI:  RDI: 
[ 5463.023926] RBP: 88007c97fd98 R08:  R09: 88022ffd9d80
[ 5463.031033] R10: 3189 R11:  R12: 0001004ee87e
[ 5463.038140] R13: 0002 R14:  R15: 88022ffd9000
[ 5463.045258] FS:  7f3e570de740() GS:88022fcc() 
knlGS:
[ 5463.053317] CS:  0010 DS:  ES:  CR0: 8005003b
[ 5463.059041] CR2: 0500 CR3: 018dc000 CR4: 07e0
[ 5463.066157] DR0:  DR1:  DR2: 
[ 5463.073276] DR3:  DR6: 0ff0 DR7: 0400
[ 5463.080400] Process kswapd3 (pid: 117, threadinfo 88007c97e000, task 
88007c981970)
[ 5463.088633] Stack:
[ 5463.090646]  88007c97fd98  88007c981970 
81086080
[ 5463.098090]  88007c97fd68 88007c97fd68 88022ffd9d80 
0002
[ 5463.105527]  0002  88007c97feb8 
8114b0e3
[ 5463.112998] Call Trace:
[ 5463.115446]  [] ? wake_up_bit+0x40/0x40
[ 5463.120826]  [] kswapd+0x6c3/0xa50
[ 5463.125775]  [] ? zone_reclaim+0x270/0x270
[ 5463.131415]  [] kthread+0xc0/0xd0
[ 5463.136278]  [] ? kthread_create_on_node+0x120/0x120
[ 5463.142786]  [] ret_from_fork+0x7c/0xb0
[ 5463.148166]  [] ? kthread_create_on_node+0x120/0x120
[ 5463.154668] Code: 4e 6d 88 00 48 c7 45 b8 00 00 00 00 48 83 c0 18 48 c7 45 c8 80 
60 08 81 48 89 45 d0 48 89 45 d8 8b 04 b5 a0 9a cd 81 85 c0 74 0f <48> 8b 87 00 
05 00 00 a8 04 0f 85 98 00 00 00 e8 b3 c3
[ 5463.174097] RIP  [] wait_iff_congested+0x59/0x140
[ 5463.180352]  RSP 
[ 5463.183824] CR2: 0500
[ 5463.203717] ---[ end trace 9ff4ff9087c13a36 ]---

I attached the config file, hope it can make some help.

Thanks,
Zhouping



If I'm decoding it properly, this translates to:

0x811542e9 is in wait_iff_congested 
(/usr/src/linux/arch/x86/include/asm/bitops.h:321).

316 }
317 
318	static __always_inline int constant_test_bit(unsigned int nr, const 
volatile unsigned long *addr)

319 {
320 return ((1UL << (nr % BITS_PER_LONG)) &
321 (addr[nr / BITS_PER_LONG])) != 0;
322 }
323 
324	static inline int variable_test_bit(int nr, volatile const unsigned 
long *addr)

325 {

0x811542e8 is in wait_iff_congested (mm/backing-dev.c:815).
810 /*
811  * If there is no congestion, or heavy congestion is not being
812  * encountered in the current zone, yield if necessary instead
813  * of sleeping on the congestion queue
814  */
815 if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
816 !zone_is_reclaim_congested(zone)) {
817 cond_resched();
818 
819 /* In case we scheduled, work out time remaining */

All code

   0:   4e 6d   rex.WRX insl (%dx),%es:(%rdi)
   2:   88 00   mov%al,(%rax)
   4:   48 c7 45 b8 00 00 00movq   $0x0,-0x48(%rbp)
   b:   00
   c:   48 83 c0 18 add$0x18,%rax
  10:   48 c7 45 c8 80 60 08movq   $0x81086080,-0x38(%rbp)
  17:   81
  18:   48 89 45 d0 mov%rax,-0x30(%rbp)
  1c:   48 89 45 d8 mov%rax,-0x28(%rbp)
  20:   8b 04 b5 a0 9a cd 81mov-0x7e326560(,%rsi,4),%eax
  27:   85 c0   test   %eax,%eax
  29:   74 0f   je 0x3a
  2b:*	48 8b 87 00 05 00 00 	mov0x500(%rdi),%rax <-- trapping 
instruction

  32:   a8 04   test   $0x4,%al
  34:   0f 85 98 00 00 00   jne0xd2
  3a:   e8  .byte 0xe8
  3b:   b3 c3   mov$0xc3,%bl

I remember when I was instrumenting vmscan.c to see which of the 
congestion_wait() calls was making trouble, the only place that really 
called it

[PATCH v2] mm: modify pgdat_balanced() so that it also handles order=0

2012-12-23 Thread Zlatko Calusic
On 22.12.2012 19:54, Zlatko Calusic wrote:
> On 20.12.2012 21:58, Andrew Morton wrote:
>> There seems to be some complexity/duplication here between the new
>> unbalanced_zone() and pgdat_balanced().
>>
>> Can we modify pgdat_balanced() so that it also handles order=0, then do
>>
>> -if (!unbalanced_zone || (order && pgdat_balanced(pgdat, 
>> balanced, *classzone_idx)))
>> +if (!pgdat_balanced(...))
>>
>> ?
>>
> 
> Makes sense, I like the idea! Took me some time to wrap my mind around
> all the logic in balance_pgdat(), while writing my previous patch. Also had
> to revert one if-case logic to avoid double negation, which would be even
> harder to grok. But unbalanced_zone (var, not a function!) has to stay because
> wait_iff_congested() needs a struct zone* param. Here's my take on the 
> subject:
> 

And now also with 3 unused variables removed, no other changes.
prepare_kswapd_sleep() now looks so beautiful. ;)

I've been testing the patch on 3 different machines, and no problem at all. One 
of
those I pushed hard, and it survived.

---8<---
mm: modify pgdat_balanced() so that it also handles order=0

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Signed-off-by: Zlatko Calusic 
---
 mm/vmscan.c | 105 ++--
 1 file changed, 45 insertions(+), 60 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc7e90..23291b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order,
 }
 
 /*
- * pgdat_balanced is used when checking if a node is balanced for high-order
- * allocations. Only zones that meet watermarks and are in a zone allowed
- * by the callers classzone_idx are added to balanced_pages. The total of
- * balanced pages must be at least 25% of the zones allowed by classzone_idx
- * for the node to be considered balanced. Forcing all zones to be balanced
- * for high orders can cause excessive reclaim when there are imbalanced zones.
+ * pgdat_balanced() is used when checking if a node is balanced.
+ *
+ * For order-0, all zones must be balanced!
+ *
+ * For high-order allocations only zones that meet watermarks and are in a
+ * zone allowed by the callers classzone_idx are added to balanced_pages. The
+ * total of balanced pages must be at least 25% of the zones allowed by
+ * classzone_idx for the node to be considered balanced. Forcing all zones to
+ * be balanced for high orders can cause excessive reclaim when there are
+ * imbalanced zones.
  * The choice of 25% is due to
  *   o a 16M DMA zone that is balanced will not balance a zone on any
  * reasonable sized machine
@@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order,
  * Similarly, on x86-64 the Normal zone would need to be at least 1G
  * to balance a node on its own. These seemed like reasonable ratios.
  */
-static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
-   int classzone_idx)
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
unsigned long present_pages = 0;
+   unsigned long balanced_pages = 0;
int i;
 
-   for (i = 0; i <= classzone_idx; i++)
-   present_pages += pgdat->node_zones[i].present_pages;
+   /* Check the watermark levels */
+   for (i = 0; i <= classzone_idx; i++) {
+   struct zone *zone = pgdat->node_zones + i;
 
-   /* A special case here: if zone has no page, we think it's balanced */
-   return balanced_pages >= (present_pages >> 2);
+   if (!populated_zone(zone))
+   continue;
+
+   present_pages += zone->present_pages;
+
+   /*
+* A special case here:
+*
+* balance_pgdat() skips over all_unreclaimable after
+* DEF_PRIORITY. Effectively, it considers them balanced so
+* they must be considered balanced here as well!
+*/
+   if (zone->all_unreclaimable) {
+   balanced_pages += zone->present_pages;
+   continue;
+   }
+
+   if (zone_balanced(zone, order, 0, i))
+   balanced_pages += zone->present_pages;
+   else if (!order)
+   return false;
+   }
+
+   if (order)
+   return balanced_pages >= (present_pages >> 2);
+   else
+   return true;
 }
 
 /*
@@ -2489,10 +2519,6 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned 
long balanced_pages,
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
 

[PATCH] mm: modify pgdat_balanced() so that it also handles order=0

2012-12-22 Thread Zlatko Calusic
On 20.12.2012 21:58, Andrew Morton wrote:
> There seems to be some complexity/duplication here between the new
> unbalanced_zone() and pgdat_balanced().
> 
> Can we modify pgdat_balanced() so that it also handles order=0, then do
> 
> - if (!unbalanced_zone || (order && pgdat_balanced(pgdat, 
> balanced, *classzone_idx)))
> + if (!pgdat_balanced(...))
> 
> ?
> 

Makes sense, I like the idea! Took me some time to wrap my mind around
all the logic in balance_pgdat(), while writing my previous patch. Also had
to revert one if-case logic to avoid double negation, which would be even
harder to grok. But unbalanced_zone (var, not a function!) has to stay because
wait_iff_congested() needs a struct zone* param. Here's my take on the subject:

---8<---
mm: modify pgdat_balanced() so that it also handles order=0

Teach pgdat_balanced() about order-0 allocations so that we can simplify
code in a few places in vmstat.c.

Signed-off-by: Zlatko Calusic 
---
 mm/vmscan.c | 101 +++-
 1 file changed, 45 insertions(+), 56 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc7e90..0d15d99 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2452,12 +2452,16 @@ static bool zone_balanced(struct zone *zone, int order,
 }
 
 /*
- * pgdat_balanced is used when checking if a node is balanced for high-order
- * allocations. Only zones that meet watermarks and are in a zone allowed
- * by the callers classzone_idx are added to balanced_pages. The total of
- * balanced pages must be at least 25% of the zones allowed by classzone_idx
- * for the node to be considered balanced. Forcing all zones to be balanced
- * for high orders can cause excessive reclaim when there are imbalanced zones.
+ * pgdat_balanced() is used when checking if a node is balanced.
+ *
+ * For order-0, all zones must be balanced!
+ *
+ * For high-order allocations only zones that meet watermarks and are in a
+ * zone allowed by the callers classzone_idx are added to balanced_pages. The
+ * total of balanced pages must be at least 25% of the zones allowed by
+ * classzone_idx for the node to be considered balanced. Forcing all zones to
+ * be balanced for high orders can cause excessive reclaim when there are
+ * imbalanced zones.
  * The choice of 25% is due to
  *   o a 16M DMA zone that is balanced will not balance a zone on any
  * reasonable sized machine
@@ -2467,17 +2471,43 @@ static bool zone_balanced(struct zone *zone, int order,
  * Similarly, on x86-64 the Normal zone would need to be at least 1G
  * to balance a node on its own. These seemed like reasonable ratios.
  */
-static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
-   int classzone_idx)
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
unsigned long present_pages = 0;
+   unsigned long balanced_pages = 0;
int i;
 
-   for (i = 0; i <= classzone_idx; i++)
-   present_pages += pgdat->node_zones[i].present_pages;
+   /* Check the watermark levels */
+   for (i = 0; i <= classzone_idx; i++) {
+   struct zone *zone = pgdat->node_zones + i;
+
+   if (!populated_zone(zone))
+   continue;
+
+   present_pages += zone->present_pages;
+
+   /*
+* A special case here:
+*
+* balance_pgdat() skips over all_unreclaimable after
+* DEF_PRIORITY. Effectively, it considers them balanced so
+* they must be considered balanced here as well!
+*/
+   if (zone->all_unreclaimable) {
+   balanced_pages += zone->present_pages;
+   continue;
+   }
+
+   if (zone_balanced(zone, order, 0, i))
+   balanced_pages += zone->present_pages;
+   else if (!order)
+   return false;
+   }
 
-   /* A special case here: if zone has no page, we think it's balanced */
-   return balanced_pages >= (present_pages >> 2);
+   if (order)
+   return balanced_pages >= (present_pages >> 2);
+   else
+   return true;
 }
 
 /*
@@ -2511,39 +2541,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
return false;
}
 
-   /* Check the watermark levels */
-   for (i = 0; i <= classzone_idx; i++) {
-   struct zone *zone = pgdat->node_zones + i;
-
-   if (!populated_zone(zone))
-   continue;
-
-   /*
-* balance_pgdat() skips over all_unreclaimable after
-* DEF_PRIORITY. Effectively, it considers them balanced so
-  

Re: [PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion

2012-12-19 Thread Zlatko Calusic
On a 4GB RAM machine, where Normal zone is much smaller than
DMA32 zone, the Normal zone gets fragmented in time. This requires
relatively more pressure in balance_pgdat to get the zone above the
required watermark. Unfortunately, the congestion_wait() call in there
slows it down for a completely wrong reason, expecting that there's
a lot of writeback/swapout, even when there's none (much more common).
After a few days, when fragmentation progresses, this flawed logic
translates to a very high CPU iowait times, even though there's no
I/O congestion at all. If THP is enabled, the problem occurs sooner,
but I was able to see it even on !THP kernels, just by giving it a bit
more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which
nicely revealed the problem after only 1 day of uptime, and it's been
working great.

Signed-off-by: Zlatko Calusic 
---
 mm/vmscan.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..4588d1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
 {
-   int all_zones_ok;
+   struct zone *unbalanced_zone;
unsigned long balanced;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
@@ -2580,7 +2580,7 @@ loop_again:
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
 
-   all_zones_ok = 1;
+   unbalanced_zone = NULL;
balanced = 0;
 
/*
@@ -2719,7 +2719,7 @@ loop_again:
}
 
if (!zone_balanced(zone, testorder, 0, end_zone)) {
-   all_zones_ok = 0;
+   unbalanced_zone = zone;
/*
 * We are still under min water mark.  This
 * means that we have a GFP_ATOMIC allocation
@@ -2752,7 +2752,7 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);
 
-   if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, 
*classzone_idx)))
+   if (!unbalanced_zone || (order && pgdat_balanced(pgdat, 
balanced, *classzone_idx)))
break;  /* kswapd: all done */
/*
 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2762,7 +2762,7 @@ loop_again:
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
else
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   wait_iff_congested(unbalanced_zone, 
BLK_RW_ASYNC, HZ/10);
}
 
/*
@@ -2781,7 +2781,7 @@ out:
 * high-order: Balanced zones must make up at least 25% of the node
 * for the node to be balanced
 */
-   if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, 
*classzone_idx {
+   if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, 
*classzone_idx))) {
cond_resched();
 
try_to_freeze();
-- 1.7.10.4

-- 
Zlatko (this time with proper Signed-off-by line)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: do not sleep in balance_pgdat if there's no i/o congestion

2012-12-19 Thread Zlatko Calusic
On a 4GB RAM machine, where Normal zone is much smaller than
DMA32 zone, the Normal zone gets fragmented in time. This requires
relatively more pressure in balance_pgdat to get the zone above the
required watermark. Unfortunately, the congestion_wait() call in there
slows it down for a completely wrong reason, expecting that there's
a lot of writeback/swapout, even when there's none (much more common).
After a few days, when fragmentation progresses, this flawed logic
translates to a very high CPU iowait times, even though there's no
I/O congestion at all. If THP is enabled, the problem occurs sooner,
but I was able to see it even on !THP kernels, just by giving it a bit
more time to occur.

The proper way to deal with this is to not wait, unless there's
congestion. Thanks to Mel Gorman, we already have the function that
perfectly fits the job. The patch was tested on a machine which
nicely revealed the problem after only 1 day of uptime, and it's been
working great.
---
 mm/vmscan.c |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b7ed376..4588d1d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2546,7 +2546,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int *classzone_idx)
 {
-   int all_zones_ok;
+   struct zone *unbalanced_zone;
unsigned long balanced;
int i;
int end_zone = 0;   /* Inclusive.  0 = ZONE_DMA */
@@ -2580,7 +2580,7 @@ loop_again:
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
 
-   all_zones_ok = 1;
+   unbalanced_zone = NULL;
balanced = 0;
 
/*
@@ -2719,7 +2719,7 @@ loop_again:
}
 
if (!zone_balanced(zone, testorder, 0, end_zone)) {
-   all_zones_ok = 0;
+   unbalanced_zone = zone;
/*
 * We are still under min water mark.  This
 * means that we have a GFP_ATOMIC allocation
@@ -2752,7 +2752,7 @@ loop_again:
pfmemalloc_watermark_ok(pgdat))
wake_up(&pgdat->pfmemalloc_wait);
 
-   if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, 
*classzone_idx)))
+   if (!unbalanced_zone || (order && pgdat_balanced(pgdat, 
balanced, *classzone_idx)))
break;  /* kswapd: all done */
/*
 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2762,7 +2762,7 @@ loop_again:
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
else
-   congestion_wait(BLK_RW_ASYNC, HZ/10);
+   wait_iff_congested(unbalanced_zone, 
BLK_RW_ASYNC, HZ/10);
}
 
/*
@@ -2781,7 +2781,7 @@ out:
 * high-order: Balanced zones must make up at least 25% of the node
 * for the node to be balanced
 */
-   if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, 
*classzone_idx {
+   if (unbalanced_zone && (!order || !pgdat_balanced(pgdat, balanced, 
*classzone_idx))) {
cond_resched();
 
try_to_freeze();
-- 
1.7.10.4

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-19 Thread Zlatko Calusic

On 11.12.2012 01:19, Zlatko Calusic wrote:

On 10.12.2012 20:13, Linus Torvalds wrote:


It's worth giving this as much testing as is at all possible, but at
the same time I really don't think I can delay 3.7 any more without
messing up the holiday season too much. So unless something obvious
pops up, I will do the release tonight. So testing will be minimal -
but it's not like we haven't gone back-and-forth on this several times
already, and we revert to *mostly* the same old state as 3.6 anyway,
so it should be fairly safe.



So, here's what I found. In short: close, but no cigar!

Kswapd is certainly no more CPU pig, and memory seems to be utilized
properly (the kernel still likes to keep 400MB free, somebody else can
confirm if that's to be expected on a 4GB THP-enabled machine). So it
looks very decent, and much better than anything I run in last 10 days,
barring !THP kernel.

What remains a mystery is that kswapd occassionaly still likes to get
stuck in a D state, only now it recovers faster than before (sometimes
in a matter of seconds, but sometimes it takes a few minutes). Now, I
admit it's a small, maybe even cosmetic issue. But, it could also be a
warning sign of a bigger problem that will reveal itself on a more
loaded machine.



Ha, I nailed it!

The cigar aka the explanation together with a patch will follow shortly 
in a separate topic.


It's a genuine bug that has been with us for a long long time.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-11 Thread Zlatko Calusic

On 11.12.2012 01:19, Zlatko Calusic wrote:


I will now make one last attempt, I've just reverted 2 Johannes' commits
that were also applied in attempt to fix breakage that removing
gfp_no_kswapd introduced, namely ed23ec4 & c702418. For various reasons
the results of this test will be available tommorow, so it's your call
Linus.



To be honest, I don't see any difference with those two commits 
reverted. Like those lines never did much anyway, so it's probably good 
we got rid of them. :P


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 22:54, Borislav Petkov wrote:

On Mon, Dec 10, 2012 at 01:47:23PM -0800, Linus Torvalds wrote:

On Mon, Dec 10, 2012 at 1:42 PM, Borislav Petkov  wrote:


Aren't we gonna consider the out-of-tree vbox modules being loaded and
causing some corruptions like maybe the single-bit error above?

I'm also thinking of this here: https://lkml.org/lkml/2011/10/6/317


Yup, that looks more likely, I agree.


@Zlatko: can your daughter try to retrigger the freeze without the vbox
modules loaded?



Sure thing! :)

Although, the vbox modules were only loaded, no VM was running at the 
time lockup happened. But, I've just read the whole thread you mention 
above and I understand the concern. I'll make sure the vbox modules are 
unloaded when not really needed (most of the time on that machine), in 
case lockup happens again.


Next time my daughter plays online games, I'll tell her she's actually 
serving a greater purpose, and let her take her time. :)

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic
On 10.12.2012 20:13, Linus Torvalds wrote:
> 
> It's worth giving this as much testing as is at all possible, but at
> the same time I really don't think I can delay 3.7 any more without
> messing up the holiday season too much. So unless something obvious
> pops up, I will do the release tonight. So testing will be minimal -
> but it's not like we haven't gone back-and-forth on this several times
> already, and we revert to *mostly* the same old state as 3.6 anyway,
> so it should be fairly safe.
> 

It compiles and boots without a hitch, so it must be perfect. :)

Seriously, a few more hours need to pass, until I can provide more convincing 
data. That's how long it takes on this particular machine for memory pressure 
to build up and memory fragmentation to ensue. Only then I'll be able to tell 
how it really behaves. I promise to get back as soon as I can.

And funny thing that you mention i915, because yesterday my daughter managed to 
lock up our laptop hard (that was a first), and this is what I found in 
kern.log after restart:

Dec  9 21:29:42 titan vmunix: general protection fault:  [#1] PREEMPT SMP 
Dec  9 21:29:42 titan vmunix: Modules linked in: vboxpci(O) vboxnetadp(O) 
vboxnetflt(O) vboxdrv(O) [last unloaded: microcode]
Dec  9 21:29:42 titan vmunix: CPU 2 
Dec  9 21:29:42 titan vmunix: Pid: 2523, comm: Xorg Tainted: G   O 
3.7.0-rc8 #1 Hewlett-Packard HP Pavilion dv7 Notebook PC/144B
Dec  9 21:29:42 titan vmunix: RIP: 0010:[]  
[] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix: RSP: 0018:88014d9f7928  EFLAGS: 00010246
Dec  9 21:29:42 titan vmunix: RAX: 880052594bc8 RBX: 0200 RCX: 
fffa
Dec  9 21:29:42 titan vmunix: RDX: 0001 RSI: 880052594bc8 RDI: 

Dec  9 21:29:42 titan vmunix: RBP: 88014d9f7948 R08: 0200 R09: 
880052594b18
Dec  9 21:29:42 titan vmunix: R10: 57ffe4cbb74d1280 R11:  R12: 
88011c959a90
Dec  9 21:29:42 titan vmunix: R13: 0053 R14:  R15: 
0053
Dec  9 21:29:42 titan vmunix: FS:  7fcd8d413880() 
GS:880157c8() knlGS:
Dec  9 21:29:42 titan vmunix: CS:  0010 DS:  ES:  CR0: 80050033
Dec  9 21:29:42 titan vmunix: CR2: ff600400 CR3: 00014d937000 CR4: 
07e0
Dec  9 21:29:42 titan vmunix: DR0:  DR1:  DR2: 

Dec  9 21:29:42 titan vmunix: DR3:  DR6: 0ff0 DR7: 
0400
Dec  9 21:29:42 titan vmunix: Process Xorg (pid: 2523, threadinfo 
88014d9f6000, task 88014d9c1260)
Dec  9 21:29:42 titan vmunix: Stack:
Dec  9 21:29:42 titan vmunix:  88014d9f7958 88011c959a88 
0053 88011c959a88
Dec  9 21:29:42 titan vmunix:  88014d9f7978 81090e21 
0001 ea00014d1280
Dec  9 21:29:42 titan vmunix:  88011c959960 0001 
88014d9f7a28 810a1b60
Dec  9 21:29:42 titan vmunix: Call Trace:
Dec  9 21:29:42 titan vmunix:  [] find_lock_page+0x21/0x80
Dec  9 21:29:42 titan vmunix:  [] shmem_getpage_gfp+0xa0/0x620
Dec  9 21:29:42 titan vmunix:  [] 
shmem_read_mapping_page_gfp+0x2c/0x50
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_get_pages_gtt+0xe1/0x270
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_get_pages+0x4f/0x90
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_bind_to_gtt+0xc3/0x4c0
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_object_pin+0x123/0x190
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer_reserve_object.isra.13+0x77/0x190
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer_reserve.isra.14+0x2c1/0x320
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_do_execbuffer.isra.17+0x5e2/0x11b0
Dec  9 21:29:42 titan vmunix:  [] 
i915_gem_execbuffer2+0x94/0x280
Dec  9 21:29:42 titan vmunix:  [] drm_ioctl+0x493/0x530
Dec  9 21:29:42 titan vmunix:  [] ? 
i915_gem_execbuffer+0x480/0x480
Dec  9 21:29:42 titan vmunix:  [] do_vfs_ioctl+0x8f/0x530
Dec  9 21:29:42 titan vmunix:  [] sys_ioctl+0x4b/0x90
Dec  9 21:29:42 titan vmunix:  [] ? sys_read+0x4d/0xa0
Dec  9 21:29:42 titan vmunix:  [] 
system_call_fastpath+0x16/0x1b
Dec  9 21:29:42 titan vmunix: Code: 63 08 48 83 ec 08 e8 84 9c fb ff 4c 89 ee 
4c 89 e7 e8 89 b7 15 00 48 85 c0 48 89 c6 74 41 48 8b 18 48 85 db 74 1f f6 c3 
03 75 3c <8b> 53 1c 85 d2 74 d9 8d 7a 01 89 d0 f0 0f b1 7b 1c 39 c2 75 23 
Dec  9 21:29:42 titan vmunix: RIP  [] find_get_page+0x3c/0x90
Dec  9 21:29:42 titan vmunix:  RSP 

It seems that whenever (if ever?) GFP_NO_KSWAPD removal is attempted again, the 
i915 driver will need to be taken better care of.
-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 19:01, Mel Gorman wrote:

In this last-minute disaster, I'm not thinking properly at all any more. The
shrink slab disabling should have happened before the loop_again but even
then it's wrong because it's just covering over the problem.

The way order and testorder interact with how balanced is calculated means
that we potentially call shrink_slab() multiple times and that thing is
global in nature and basically uncontrolled. You could argue that we should
only call shrink_slab() if order-0 watermarks are not met but that will
not necessarily prevent kswapd reclaiming too much. It keeps going back
to balance_pgdat needing its list of requirements drawn up and receive
some major surgery and we're not going to do that as a quick hack.



I was about to apply the patch that you sent, and reboot the server, but 
it seems there's no point because the patch is flawed?


Anyway, if and when you have a proper one, I'll be glad to test it for 
you and report results.

--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-10 Thread Zlatko Calusic

On 10.12.2012 12:03, Mel Gorman wrote:

There is a big difference between a direct reclaim/compaction for THP
and kswapd doing the same work. Direct reclaim/compaction will try once,
give up quickly and defer requests in the near future to avoid impacting
the system heavily for THP. The same applies for khugepaged.

kswapd is different. It can keep going until it meets its watermarks for
a THP allocation are met. Two reasons why it might keep going for a long
time are that compaction is being inefficient which we know it may be due
to crap like this

end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);

and the second reason is if the highest zone is relatively because
compaction_suitable will keep saying that allocations are failing due to
insufficient amounts of memory in the highest zone. It'll reclaim a little
from this highest zone and then shrink_slab() potentially dumping a large
amount of memory. This may be the case for Zlatko as with a 4G machine
his ZONE_NORMAL could be small depending on how the 32-bit address space
is used by his hardware.



The kernel is 64-bit, if it makes any difference (userspace, though is 
still 32-bit). There's no swap (swap support not even compiled in). The 
zones are as follows:


On node 0 totalpages: 1048019
  DMA zone: 64 pages used for memmap
  DMA zone: 6 pages reserved
  DMA zone: 3913 pages, LIFO batch:0
  DMA32 zone: 16320 pages used for memmap
  DMA32 zone: 831109 pages, LIFO batch:31
  Normal zone: 3072 pages used for memmap
  Normal zone: 193535 pages, LIFO batch:31

If I understand correctly, you think that because 193535 pages in 
ZONE_NORMAL is relatively small compared to 831109 pages of ZONE_DMA32 
the system has hard time balancing itself?


Is there any way I could force and test different memory layout? I'm 
slightly lost at all the memory models (if I have a choice at all), so 
if you have any suggestions, I'm all ears.


Maybe I could limit available memory and thus have only DMA32 zone, just 
to prove your theory? I remember doing tuning like that many years ago 
when I had more time to play with Linux MM, unfortunately didn't have 
much time lately, so I'm a bit rusty, but I'm willing to help testing 
and resolving this issue.


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Zlatko Calusic

On 08.12.2012 13:06, Zlatko Calusic wrote:

On 06.12.2012 20:31, Linus Torvalds wrote:

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.



I've been testing this patch since it was applied, and it certainly
fixes the kswapd craziness issue, good work Johannes!

But, it's still not perfect yet, because I see that the system keeps
lots of memory unused (free), where it previously used it all for the
page cache (there's enough fs activity to warrant it).

I'm now testing the last piece of Johannes' changes (still not in git
tree), and can report results in 24-48 hours.

Regards,


Or sooner... in short: nothing's changed!

On a 4GB RAM system, where applications use close to 2GB, kswapd likes 
to keep around 1GB free (unused), leaving only 1GB for page/buffer 
cache. If I force bigger page cache by reading a big file and thus use 
the unused 1GB of RAM, kswapd will soon (in a matter of minutes) evict 
those (or other) pages out and once again keep unused memory close to 1GB.


I guess it's not a showstopper, but it still counts as a very bad memory 
management, wasting lots of RAM.


As an additional data point, if memory pressure is slightly higher (say 
backup kicks in, keeping page cache mostly full) kswapd gets in D 
(uninterruptible sleep) state (function: congestion_wait) and load 
average goes up by 1. It recovers only when it successfully throws out 
half of page cache again.


Hope it helps.
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kswapd craziness in 3.7

2012-12-08 Thread Zlatko Calusic

On 06.12.2012 20:31, Linus Torvalds wrote:

Ok, people seem to be reporting success.

I've applied Johannes' last patch with the new tested-by tags.



I've been testing this patch since it was applied, and it certainly 
fixes the kswapd craziness issue, good work Johannes!


But, it's still not perfect yet, because I see that the system keeps 
lots of memory unused (free), where it previously used it all for the 
page cache (there's enough fs activity to warrant it).


I'm now testing the last piece of Johannes' changes (still not in git 
tree), and can report results in 24-48 hours.


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High context switch rate, ksoftirqd's chewing cpu

2012-12-01 Thread Zlatko Calusic

On 01.12.2012 20:13, Tejun Heo wrote:

Hello,

On Sat, Dec 01, 2012 at 06:11:10PM +0100, Zlatko Calusic wrote:

Sure. Please clarify, should I apply it on top of the previous one
or standalone?


It's a replacement, so by itself.

Thanks!



I have good news, again. The kernel with the patch applied has been 
running flawlessly for the last hour. No excess context switching.


Regards,
--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High context switch rate, ksoftirqd's chewing cpu

2012-12-01 Thread Zlatko Calusic

On 01.12.2012 15:38, Tejun Heo wrote:

Hello,

On Sat, Dec 01, 2012 at 12:06:41PM +0100, Zlatko Calusic wrote:

I have good news. The patch fixes the regression!

To doublecheck and provide you additional data, I updated to the latest Linus
kernel (commit 7c17e48), recompiled (WITHOUT the patch), rebooted and this is
what vmstat 1 looks like:


Awesome, can you please test the following patch too?  Thanks!



Sure. Please clarify, should I apply it on top of the previous one or 
standalone?


--
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High context switch rate, ksoftirqd's chewing cpu

2012-12-01 Thread Zlatko Calusic
On 30.11.2012 23:55, Tejun Heo wrote:
> Hello, again.
> 
> Can you please try this patch?  Thanks!
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 042d221..26368ef 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1477,7 +1477,10 @@ bool mod_delayed_work_on(int cpu, struct 
> workqueue_struct *wq,
>   } while (unlikely(ret == -EAGAIN));
>   
>   if (likely(ret >= 0)) {
> - __queue_delayed_work(cpu, wq, dwork, delay);
> + if (!delay)
> + __queue_work(cpu, wq, &dwork->work);
> + else
> + __queue_delayed_work(cpu, wq, dwork, delay);
>   local_irq_restore(flags);
>   }
>   
> 

I have good news. The patch fixes the regression!

To doublecheck and provide you additional data, I updated to the latest Linus
kernel (commit 7c17e48), recompiled (WITHOUT the patch), rebooted and this is
what vmstat 1 looks like:

procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 1  0  0 2957924  43676 65584000   868   460 1446 38112  1  1 75 23
 0  0  0 2957436  43684 65567200 8  1290  941 30743  1  2 96  1
 0  0  0 2957416  43708 65612000   300   501  755 23642  1  2 89  9
 1  0  0 2957648  43740 65659200   632   162  946 19837  1  3 81 17
 0  0  0 2950104  43740 66466000  8192 0  550  326  1  1 91  8
 0  0  0 2950600  43772 66409200   148   289  691 39594  0  1 90  9
 0  0  0 2939580  43924 67461200  5568   115 2424 38662  5  4 77 15
 0  1  0 2944888  43932 6695800056   869 1095 20062  6  1 89  4
 0  0  0 2945812  43936 67052400   82492  824 49790  0  2 93  5
 1  0  0 2945724  44084 67065600   34891  650 26455  1  1 89 10
 0  2  0 2945356  44380 67084800   536   161  718 18824  1  2 76 22
 0  0  0 2944432  44400 67121600   156   534  684 16232  2  1 81 17
 0  0  0 2943660  44412 67154400   292   120  562 49618  1  3 87 10
 0  0  0 2943740  44412 67152000 0 9  393 7247  0  0 100  0
 0  0  0 2943608  44420 67181200   27642  507 36329  1  1 96  3
 0  0  0 2943704  44420 67199600   176 0  405  269  0  0 100  0
 0  0  0 2943548  44428 67196400 0   238  534 14823  0  1 99  0
 0  0  0 2943136  4 67215600   212   692  698 29321  1  2 86 11

Then I applied the patch and this is how vmstat 1 looks now (WITH the patch):

 procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 1  0  0 2736628  35160 89770800 8   293 1172 2493 17  3 79  1
 0  0  0 2737336  35172 89794400   15643  883  971  3  0 95  3
 0  0  0 2736500  35212 89825200   292  5564 1267 2168 14  2 79  5
 0  0  0 2736056  35356 89886400   59651 1029 1344  2  3 90  6
 0  0  0 2735732  35504 89928400   51682 1225 1495  2  2 85 11
 0  1  0 2734052  35508 90032400   51291 1149 1225  2  2 93  4
 0  0  0 2733980  35508 89982000 017  918 1164  2  1 96  1
 0  2  0 2733988  35812 89994000   656  1764 1097 1549  3  2 79 17
 0  0  0 2733792  35820 900348004076 1303 1299  2  3 83 13
 0  0  0 2733888  35820 90034400 017  914 1085  2  1 97  0
 0  0  0 2733316  35952 90036400   144   235 1062 1316  1  2 95  3
 0  0  0 2733012  36092 90041200   17611 1112 1469  3  1 92  4
 0  0  0 2732732  36236 90044400   160   709  932 1022  2  1 93  5
 1  0  0 2732128  36384 90040000   156  8987 1491 2519 12  3 82  3
 0  0  0 2732128  36384 90041600 034  927 1376  5  2 93  0
 0  0  0 2732044  36540 90078800   44482  963 1278  3  1 87  8
 1  0  0 2732020  36680 90079600   168 2  883 1041  1  2 94  2
 0  0  0 2731228  36700 90145600   324   196  882 1125  2  0 94  4

Observe the difference in the cs column!

I hope this gets in before 3.7.0. Good work Tejun!

Best regards,

-- 
Zlatko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: High context switch rate, ksoftirqd's chewing cpu

2012-11-30 Thread Zlatko Calusic

On 30.11.2012 23:52, Tejun Heo wrote:

Hello, Zlatko.

Sorry about the delay.  Your message was in my spam folder.  The
attachment seems to have confused the filter.

On Sat, Nov 17, 2012 at 02:01:29PM +0100, Zlatko Calusic wrote:

This week I spent some hours tracking a regression in 3.7 kernel
that was producing high context switch rate on one of my machines. I
carefully bisected between 3.6 and 3.7-rc1 and eventually found this
commit a culprit:

commit e7c2f967445dd2041f0f8e3179cca22bb8bb7f79
Author: Tejun Heo 
Date:   Tue Aug 21 13:18:24 2012 -0700

 workqueue: use mod_delayed_work() instead of __cancel + queue

...


Then I carefully reverted chunk by chunk to find out what exact
change is responsible for the regression. You can find it attached
as wq.patch (to preserve whitespace). Very simple modification with
wildly different behavior on only one of my machines, weird. I'm
also attaching ctxt/s graph that shows the impact nicely. I'll
gladly provide any additional info that could help you resolve this.

Please Cc: on reply (not subscribed to lkml).

Regards,
--
Zlatko



diff --git a/block/blk-core.c b/block/blk-core.c
index 4b4dbdf..4b8b606 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -319,10 +319,8 @@ EXPORT_SYMBOL(__blk_run_queue);
   */
  void blk_run_queue_async(struct request_queue *q)
  {
-   if (likely(!blk_queue_stopped(q))) {
-   __cancel_delayed_work(&q->delay_work);
-   queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
-   }
+   if (likely(!blk_queue_stopped(q)))
+   mod_delayed_work(kblockd_workqueue, &q->delay_work, 0);
  }
  EXPORT_SYMBOL(blk_run_queue_async);


That's intersting.  Is there anything else noticeably different than
the ctxsw counts?  e.g. CPU usage, IO throughput / latency, etc...
Also, can you please post the kernel boot log from the machine?  I
assume that the issue is readily reproducible?  Are you up for trying
some debug patches?

Thanks.



Hey Tejun! Thanks for replying.

It's an older C2D machine, I've attached the kernel boot log. Funny 
thing is that on the other half a dozen machines I don't observe any 
problems, only on this one. And it's reproducible every time. I don't 
see any other anomalies beside the two I already mentioned, high context 
switch rate and ksoftirqd daemons eating more CPU, probably as a 
consequence.


I'll gladly try your patch and send my observations tommorow, as I've 
just started md resync on the machine, which will take couple of hours.


Regards,
--
Zlatko
Linux version 3.7.0-rc7 (root@ps) (gcc version 4.7.2 (Debian 4.7.2-4) ) #1 SMP 
Fri Nov 30 23:13:52 CET 2012
Command line: root=/dev/md0 rootfstype=ext4 profile=2 ro 
BOOT_IMAGE=/boot/vmlinuz-3.7.0-rc7 
e820: BIOS-provided physical RAM map:
BIOS-e820: [mem 0x-0x0009fbff] usable
BIOS-e820: [mem 0x0009fc00-0x0009] reserved
BIOS-e820: [mem 0x000e-0x000f] reserved
BIOS-e820: [mem 0x0010-0xcf239fff] usable
BIOS-e820: [mem 0xcf23a000-0xcf27cfff] ACPI NVS
BIOS-e820: [mem 0xcf27d000-0xcf354fff] reserved
BIOS-e820: [mem 0xcf355000-0xcf364fff] ACPI NVS
BIOS-e820: [mem 0xcf365000-0xcf3e0fff] reserved
BIOS-e820: [mem 0xcf3e1000-0xcf3e6fff] ACPI data
BIOS-e820: [mem 0xcf3e7000-0xcf3e7fff] ACPI NVS
BIOS-e820: [mem 0xcf3e8000-0xcf3e9fff] ACPI data
BIOS-e820: [mem 0xcf3ea000-0xcf3ebfff] ACPI NVS
BIOS-e820: [mem 0xcf3ec000-0xcf3ecfff] reserved
BIOS-e820: [mem 0xcf3ed000-0xcf3f0fff] ACPI NVS
BIOS-e820: [mem 0xcf3f1000-0xcf3f] reserved
BIOS-e820: [mem 0xcf40-0xcf5f] usable
BIOS-e820: [mem 0xfed1c000-0xfed1] reserved
BIOS-e820: [mem 0xffa0-0xffbf] reserved
BIOS-e820: [mem 0xffe0-0x] reserved
BIOS-e820: [mem 0x0001-0x00012fff] usable
NX (Execute Disable) protection: active
DMI 2.4 present.
DMI:  /DG31PR, BIOS PRG3110H.86A.0065.2009.0421.1559 04/21/2009
e820: update [mem 0x-0x] usable ==> reserved
e820: remove [mem 0x000a-0x000f] usable
No AGP bridge found
e820: last_pfn = 0x13 max_arch_pfn = 0x4
MTRR default type: uncachable
MTRR fixed ranges enabled:
  0-9 write-back
  A-E7FFF uncachable
  E8000-F write-protect
MTRR variable ranges enabled:
  0 base 0 mask F write-back
  1 base 1 mask FE000 write-back
  2 base 12000 mask FF000 write-back
  3 base 0CF60 mask FFFE0 uncachable
  4 base 0CF80 mask FFF80 uncachable
  5 base 0D000 mask FF000 uncachable
  6 base 0E000 mask FE000 uncachable
  7 disabled
x86 PAT enabled: cpu 0, old 0x7040600070406

High context switch rate, ksoftirqd's chewing cpu

2012-11-17 Thread Zlatko Calusic

Hello Tejun et al.

This week I spent some hours tracking a regression in 3.7 kernel that 
was producing high context switch rate on one of my machines. I 
carefully bisected between 3.6 and 3.7-rc1 and eventually found this 
commit a culprit:


commit e7c2f967445dd2041f0f8e3179cca22bb8bb7f79
Author: Tejun Heo 
Date:   Tue Aug 21 13:18:24 2012 -0700

workqueue: use mod_delayed_work() instead of __cancel + queue

Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo 
Cc: "David S. Miller" 
Cc: Tomi Valkeinen 

Then I carefully reverted chunk by chunk to find out what exact change 
is responsible for the regression. You can find it attached as wq.patch 
(to preserve whitespace). Very simple modification with wildly different 
behavior on only one of my machines, weird. I'm also attaching ctxt/s 
graph that shows the impact nicely. I'll gladly provide any additional 
info that could help you resolve this.


Please Cc: on reply (not subscribed to lkml).

Regards,
--
Zlatko
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b4dbdf..4b8b606 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -319,10 +319,8 @@ EXPORT_SYMBOL(__blk_run_queue);
  */
 void blk_run_queue_async(struct request_queue *q)
 {
-	if (likely(!blk_queue_stopped(q))) {
-		__cancel_delayed_work(&q->delay_work);
-		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
-	}
+	if (likely(!blk_queue_stopped(q)))
+		mod_delayed_work(kblockd_workqueue, &q->delay_work, 0);
 }
 EXPORT_SYMBOL(blk_run_queue_async);
 
<>

Problems with reboot/poweroff on SMP machine

2005-07-19 Thread Zlatko Calusic
Hi Eric and all!

Last few weeks or so I started having problems with reboot/poweroff on
my aging SMP desktop (dual PIII, Apollo Pro 266 chipset). The machine
does all steps til' the very end where it stops (hangs) before the
actual reboot or poweroff. The problem doesn't happen every time (but
occasionally). Alt-SysRQ-B/O doesn't work at the point of hang.

I did a little bit of investigation and I believe that this patch:

 
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=dd2a13054ffc25783a74afb5e4a0f2115e45f9cd

is the primary suspect for the regression (reboots and poweroffs have
been working fine for the last few years on this particular
machine). But now I need expert help. :) I'm willing to help decipher
this, so don't hesitate to ask for more details! I don't even know
what info is useful to provide at this point (kernel is virgin 2.6.12,
ACPI is compiled in, I don't use any boot time reboot= parameter, what
else?). And please Cc: me 'cause I'm not on the list.

Thanks for any info!
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: VM Report was:Re: Break 2.4 VM in five easy steps

2001-06-09 Thread Zlatko Calusic


Mike Galbraith <[EMAIL PROTECTED]> writes:

> On Fri, 8 Jun 2001, John Stoffel wrote:
> 
> > Mike> OK, riddle me this.  If this test is a crummy test, just how is
> > Mike> it that I was able to warn Rik in advance that when 2.4.5 was
> > Mike> released, he should expect complaints?  How did I _know_ that?
> > Mike> The answer is that I fiddle with Rik's code a lot, and I test
> > Mike> with this test because it tells me a lot.  It may not tell you
> > Mike> anything, but it does me.
> >
> > I never said it was a crummy test, please do not read more into my
> > words than was written.  What I was trying to get across is that just
> > one test (such as a compile of the kernel) isn't perfect at showing
> > where the problems are with the VM sub-system.
> 
> Hmm...
> 
> Tobias> Could you please explain what is good about this test?  I
> Tobias> understand that it will stress the VM, but will it do so in a
> Tobias> realistic and relevant way?
> 
> I agree, this isn't really a good test case.  I'd rather see what
> 
> happens when you fire up a gimp session to edit an image which is
> *almost* the size of RAM, or even just 50% the size of ram.  Then how
> does that affect your other processes that are running at the same
> time?
> 
> ...but anyway, yes it just one test from any number of possibles.

One great test that I'm using regularly to see what's goin' on, is at
http://lxr.linux.no/. It is a cool utility to cross reference your
Linux kernel source tree, and in the mean time eat gobs of memory, do
lots of I/O, and burn many CPU cycles (all at the same time). Ideal
test, if you ask me and if anybody has the time, it would be nice to
see different timing numbers when run on different kernels. Just make
sure you run it on the same kernel tree to make reproducable results.
It has three passes, and the third one is the most interesting one
(use vmstat 1 to see why). When run with 64MB RAM configuration, it
would swap heavily, with 128MB somewhat, and at 192MB maybe not
(depending on the other applications running at the same time).

Try it, it is a nice utility, and a great test. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Comment on patch to remove nr_async_pages limit

2001-06-05 Thread Zlatko Calusic

Ed Tomlinson <[EMAIL PROTECTED]> writes:

[snip]
> Maybe we can have the best of both worlds.  Is it possible to allocate the BH
> early and then defer the IO?  The idea being to make IO possible without having
> to allocate.  This would let us remove the async page limit but would ensure
> we could still free.
> 

Yes, this is a good idea if you ask me. Basically, to remove as many
limits as we can, and also to secure us from the deadlocks. With just
a few pages of extra memory for the reserved buffer heads, I think
it's a fair game. Still, pending further analysis...
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Comment on patch to remove nr_async_pages limit

2001-06-05 Thread Zlatko Calusic

Mike Galbraith <[EMAIL PROTECTED]> writes:

> On Mon, 4 Jun 2001, Marcelo Tosatti wrote:
> 
> > Zlatko,
> >
> > I've read your patch to remove nr_async_pages limit while reading an
> > archive on the web. (I have to figure out why lkml is not being delivered
> > correctly to me...)
> >
> > Quoting your message:
> >
> > "That artificial limit hurts both swap out and swap in path as it
> > introduces synchronization points (and/or weakens swapin readahead),
> > which I think are not necessary."
> >
> > If we are under low memory, we cannot simply writeout a whole bunch of
> > swap data. Remember the writeout operations will potentially allocate
> > buffer_head's for the swapcache pages before doing real IO, which takes
> > _more memory_: OOM deadlock.
> 
> What's the point of creating swapcache pages, and then avoiding doing
> the IO until it becomes _dangerous_ to do so?  That's what we're doing
> right now.  This is a problem because we guarantee it will become one.
> We guarantee that the pagecache will become almost pure swapcache by
> delaying the writeout so long that everything else is consumed.
> 

Huh, this looks just like my argument, just put in different words. I
should have read this sooner. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Comment on patch to remove nr_async_pages limit

2001-06-05 Thread Zlatko Calusic

Marcelo Tosatti <[EMAIL PROTECTED]> writes:

[snip]
> Exactly. And when we reach a low watermark of memory, we start writting
> out the anonymous memory.
>

Hm, my observations are a little bit different. I find that writeouts
happen sooner than the moment we reach low watermark, and many times
just in time to interact badly with some read I/O workload that made a
virtual shortage of memory in the first place. Net effect is poor
performance and too much stuff in the swap.

> > In experiments, speeding swapcache pages on their way helps.  Special
> > handling (swapcache bean counting) also helps. (was _really ugly_ code..
> > putting them on a seperate list would be a lot easier on the stomach:)
> 
> I agree that the current way of limiting on-flight swapout can be changed
> to perform better. 
> 
> Removing the amount of data being written to disk when we have a memory
> shortage is not nice. 
> 

OK, then we basically agree that there is a place for improvement, and
you also agree that we must be careful while trying to achieve that.

I'll admit that my patch is mostly experimental, and its best effect
is this discussion, which I enjoy very much. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Comment on patch to remove nr_async_pages limit

2001-06-05 Thread Zlatko Calusic

Marcelo Tosatti <[EMAIL PROTECTED]> writes:

> Zlatko, 
> 
> I've read your patch to remove nr_async_pages limit while reading an
> archive on the web. (I have to figure out why lkml is not being delivered
> correctly to me...)
> 
> Quoting your message: 
> 
> "That artificial limit hurts both swap out and swap in path as it
> introduces synchronization points (and/or weakens swapin readahead),
> which I think are not necessary."
> 
> If we are under low memory, we cannot simply writeout a whole bunch of
> swap data. Remember the writeout operations will potentially allocate
> buffer_head's for the swapcache pages before doing real IO, which takes
> _more memory_: OOM deadlock. 
> 

My question is: if we defer writing and in a way "loose" that 4096
bytes of memory (because we decide to keep the page in the memory for
some more time), how can a much smaller buffer_head be a problem?

I think we could always make a bigger reserve of buffer heads just for
this purpose, to make swapout more robust, and then don't impose any
limits on the number of the outstanding async io pages in the flight.

Does this make any sense?

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] Remove nr_async_pages limit

2001-06-04 Thread Zlatko Calusic

This patch removes the limit on the number of async pages in the
flight.

That artificial limit hurts both swap out and swap in path as it
introduces synchronization points (and/or weakens swapin readahead),
which I think are not necessary.

I also took an opportunity to clean up code a little bit. The patch
practically only removes code. Linus will like it (if and when it's
submitted). :)

Still, it needs some more testing on various workloads, so I'm posting
it on the lists only. So far, it's been completely stable.


Index: 5.9/mm/page_io.c
--- 5.9/mm/page_io.c Sat, 28 Apr 2001 13:16:05 +0200 zcalusic (linux24/j/10_page_io.c 
1.1.3.1 644)
+++ 5.8/mm/page_io.c Sat, 02 Jun 2001 19:54:40 +0200 zcalusic (linux24/j/10_page_io.c 
+1.1.3.1.1.1 644)
@@ -20,7 +20,6 @@
 
 /*
  * Reads or writes a swap page.
- * wait=1: start I/O and wait for completion. wait=0: start asynchronous I/O.
  *
  * Important prevention of race condition: the caller *must* atomically 
  * create a unique swap cache entry for this swap page before calling
@@ -41,12 +40,6 @@
kdev_t dev = 0;
int block_size;
struct inode *swapf = 0;
-   int wait = 0;
-
-   /* Don't allow too many pending pages in flight.. */
-   if ((rw == WRITE) && atomic_read(&nr_async_pages) >
-   pager_daemon.swap_cluster * (1 << page_cluster))
-   wait = 1;
 
if (rw == READ) {
ClearPageUptodate(page);
@@ -75,26 +68,11 @@
} else {
return 0;
}
-   if (!wait) {
-   SetPageDecrAfter(page);
-   atomic_inc(&nr_async_pages);
-   }
-
/* block_size == PAGE_SIZE/zones_used */
brw_page(rw, page, dev, zones, block_size);
 
-   /* Note! For consistency we do all of the logic,
-* decrementing the page count, and unlocking the page in the
-* swap lock map - in the IO completion handler.
-*/
-   if (!wait)
-   return 1;
-
-   wait_on_page(page);
-   /* This shouldn't happen, but check to be sure. */
-   if (page_count(page) == 0)
-   printk(KERN_ERR "rw_swap_page: page unused while waiting!\n");
-
+   /* Note! For consistency, we decrement the page count and
+  unlock the page in the IO completion handler. */
return 1;
 }
 
@@ -121,11 +99,6 @@
UnlockPage(page);
 }
 
-/*
- * The swap lock map insists that pages be in the page cache!
- * Therefore we can't use it.  Later when we can remove the need for the
- * lock map and we can reduce the number of functions exported.
- */
 void rw_swap_page_nolock(int rw, swp_entry_t entry, char *buf)
 {
struct page *page = virt_to_page(buf);
Index: 5.9/mm/page_alloc.c
--- 5.9/mm/page_alloc.c Sat, 26 May 2001 20:44:49 +0200 zcalusic 
(linux24/j/14_page_alloc 1.1.7.1.1.1.1.1.1.1 644)
+++ 5.8/mm/page_alloc.c Sat, 02 Jun 2001 19:54:40 +0200 zcalusic 
+(linux24/j/14_page_alloc 1.1.7.1.1.1.1.1.1.1.2.1 644)
@@ -79,8 +79,6 @@
BUG();
if (PageLocked(page))
BUG();
-   if (PageDecrAfter(page))
-   BUG();
if (PageActive(page))
BUG();
if (PageInactiveDirty(page))
Index: 5.9/mm/swap.c
--- 5.9/mm/swap.c Wed, 31 Jan 2001 23:52:50 +0100 zcalusic (linux24/j/17_swap.c 
1.1.4.1 644)
+++ 5.8/mm/swap.c Sat, 02 Jun 2001 19:54:40 +0200 zcalusic (linux24/j/17_swap.c 
+1.1.4.2 644)
@@ -52,10 +52,6 @@
  */
 int memory_pressure;
 
-/* We track the number of pages currently being asynchronously swapped
-   out, so that we don't try to swap TOO many pages out at once */
-atomic_t nr_async_pages = ATOMIC_INIT(0);
-
 buffer_mem_t buffer_mem = {
2,  /* minimum percent buffer */
10, /* borrow percent buffer */
Index: 5.9/mm/memory.c
--- 5.9/mm/memory.c Sat, 28 Apr 2001 13:16:05 +0200 zcalusic (linux24/j/18_memory.c 
1.1.7.1.1.1.1.1.2.1 644)
+++ 5.8/mm/memory.c Sat, 02 Jun 2001 19:54:40 +0200 zcalusic (linux24/j/18_memory.c 
+1.1.7.1.1.1.1.1.2.1.1.1 644)
@@ -1089,16 +1089,9 @@
 */
num = valid_swaphandles(entry, &offset);
for (i = 0; i < num; offset++, i++) {
-   /* Don't block on I/O for read-ahead */
-   if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster
-   * (1 << page_cluster)) {
-   while (i++ < num)
-   swap_free(SWP_ENTRY(SWP_TYPE(entry), offset++));
-   break;
-   }
-   /* Ok, do the async read-ahead now */
-   new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset));
-   if (new_page != NULL)
+   new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry),
+  offset));
+   if (new_page)
page_cache_release(new_page);
swap_free(SWP_ENTRY(SWP_TYPE(entry), offset

XMM: monitor Linux MM inactive/active lists graphically

2001-06-03 Thread Zlatko Calusic

Ed Tomlinson <[EMAIL PROTECTED]> writes:

> Zlatko,
> 
> Do you have your modified xmem available somewhere.  Think it might be of
> interest to a few of us.
> 
> TIA
> Ed Tomlinson <[EMAIL PROTECTED]>
> 

For some time I've been trying to make a simple, yet functional web
page to put some stuff there. But, HTML hacking and kernel hacking are
such a different beasts... :)

XMM is heavily modified XMEM utility that shows graphically size of
different Linux page lists: active, inactive_dirty, inactive_clean,
code, free and swap usage. It is better suited for the monitoring of
Linux 2.4 MM implementation than original (XMEM) utility.

Find it here:  http://linux.inet.hr/>

-- 
Zlatko

P.S. I'm gladly accepting suggestion for a simple tool that would help
in static web site creation/development. I checked genpage, htmlmake
and some other utilities but in every of them I found something that I
didn't like. Tough job, that HTML authoring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] balance inactive_dirty list

2001-06-02 Thread Zlatko Calusic

For a long time I've been thinking that inactive list is too small,
while observing lots of different workloads (all I/O bound). Finally,
I decided to take a look and try to improve things. In mm/vmscan.c I
found this overly complicated piece of heuristics:

if (!target) {
int inactive = nr_free_pages() + nr_inactive_clean_pages() +
nr_inactive_dirty_pages;
int active = MAX(nr_active_pages, num_physpages / 2);
if (active > 10 * inactive)
maxscan = nr_active_pages >> 4;
else if (active > 3 * inactive)
maxscan = nr_active_pages >> 8;
else
return 0;
}

We're trying to be too clever there, and that eventually hurts
performance because inactive_dirty list is too small for typical
scenarios. Especially that 'return 0' is hurting us, as it effectively
stops background scan, so too many pages stay active without the real
need.

With patch below performance is much better under lots of workloads I
have tested. The patch simplifies code a lot and removes unnecessary
complex calculation. Code is now completely autotuning. I have a
modified xmem utility that shows the state of the lists in a graphical
manner, so it's easy to see what's going on. Things look much more
smooth now.

I think I've seen Mike Galbraith (on the list) trying to solve almost
the same problem, although in a slightly different way. Mike, could
you give this patch a try.

All comments welcome, of course. :)

Index: 5.2/mm/vmscan.c
--- 5.2/mm/vmscan.c Sat, 26 May 2001 20:44:49 +0200 zcalusic (linux24/j/9_vmscan.c 
1.1.7.1.1.1.2.1.1.1 644)
+++ 5.2(w)/mm/vmscan.c Sat, 02 Jun 2001 23:25:40 +0200 zcalusic (linux24/j/9_vmscan.c 
+1.1.7.1.1.1.2.1.1.1 644)
@@ -655,24 +655,10 @@
 
/*
 * When we are background aging, we try to increase the page aging
-* information in the system. When we have too many inactive pages
-* we don't do background aging since having all pages on the
-* inactive list decreases aging information.
-*
-* Since not all active pages have to be on the active list, we round
-* nr_active_pages up to num_physpages/2, if needed.
+* information in the system.
 */
-   if (!target) {
-   int inactive = nr_free_pages() + nr_inactive_clean_pages() +
-   nr_inactive_dirty_pages;
-   int active = MAX(nr_active_pages, num_physpages / 2);
-   if (active > 10 * inactive)
-   maxscan = nr_active_pages >> 4;
-   else if (active > 3 * inactive)
-   maxscan = nr_active_pages >> 8;
-   else
-   return 0;
-   }
+   if (!target)
+   maxscan = nr_active_pages >> 4;
 
/* Take the lock while messing with the list... */
spin_lock(&pagemap_lru_lock);

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC][PATCH] Re: Linux 2.4.4-ac10

2001-05-20 Thread Zlatko Calusic

Mike Galbraith <[EMAIL PROTECTED]> writes:

> Hi,
> 
> On Fri, 18 May 2001, Stephen C. Tweedie wrote:
> 
> > That's the main problem with static parameters.  The problem you are
> > trying to solve is fundamentally dynamic in most cases (which is also
> > why magic numbers tend to suck in the VM.)
> 
> Magic numbers might be sucking some performance right now ;-)
> 
[snip]

I like your patch, it improves performance somewhat and makes things
more smooth and also code is simpler.

Anyway, 2.4.5-pre3 is quite debalanced and it has even broken some
things that were working properly before. For instance, swapoff now
deadlocks the machine (even with your patch applied).

Unfortunately, I have failed to pinpoint the exact problem, but I'm
confident that kernel goes in some kind of loop (99% system time, just
before deadlock). Anybody has some guidelines how to debug kernel if
you're running X?

Also in all recent kernels, if the machine is swapping, swap cache
grows without limits and is hard to recycle, but then again that is
a known problem.
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-17 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> > Second test: kernel compile make -j32 (empirically this puts the
> > VM under load, but not excessively!)
> >
> > 2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
> > 2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total
> >
> > Now, is this great news or what, 2.4.0 is definitely faster.
> 
> One problem is that these tasks may be waiting on kswapd when
> kswapd might not get scheduled in on time. On the one hand this
> will mean lower load and less thrashing, on the other hand it
> means more IO wait.
> 

Hm, if all tasks are waiting for memory, what is stopping kswapd to
run? :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: mmap()/VM problems in 2.4.0

2001-01-15 Thread Zlatko Calusic

"Vlad Bolkhovitine" <[EMAIL PROTECTED]> writes:

> Here is updated info for 2.4.1pre3:
> 
> Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/sec
> 
> with mmap()
> 
>  File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
>  DirSize   SizeThr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> --- -- --- --- --- --- --- ---
>. 1024   40962  1.089 1.24% 0.235 0.45% 1.118 4.11% 0.616 1.41%
> 
> without mmap()
>
>  File   Block  Num  Seq ReadRand Read   Seq Write  Rand Write
>  DirSize   SizeThr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> --- -- --- --- --- --- --- ---
>. 1024   40962  28.41 41.0% 0.547 1.15% 13.16 16.1% 0.652 1.46%
> 
> 
> Mmap() performance dropped dramatically down to almost unusable level. Plus,
> system was unusable during test: "vmstat 1" updated results every 1-2 _MINUTES_!
> 

You need Marcelo's patch. Please apply and retest.



--- linux.orig/mm/vmscan.c  Mon Jan 15 02:33:15 2001
+++ linux/mm/vmscan.c   Mon Jan 15 02:46:25 2001
@@ -153,7 +153,7 @@

if (VALID_PAGE(page) && !PageReserved(page)) {
try_to_swap_out(mm, vma, address, pte,
page);
-   if (--count)
+   if (!--count)
break;
}
}


-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-09 Thread Zlatko Calusic

Simon Kirby <[EMAIL PROTECTED]> writes:

> On Tue, Jan 09, 2001 at 10:47:57AM -0800, Linus Torvalds wrote:
> 
> > And this _is_ a downside, there's no question about it. There's the worry
> > about the potential loss of locality, but there's also the fact that you
> > effectively need a bigger swap partition with 2.4.x - never mind that
> > large portions of the allocations may never be used. You still need the
> > disk space for good VM behaviour.
> > 
> > There are always trade-offs, I think the 2.4.x tradeoff is a good one.
> 
> Hmm, perhaps you could clarify...
> 
> For boxes that rarely ever use swap with 2.2, will they now need more
> swap space on 2.4 to perform well, or just boxes which don't have enough
> RAM to handle everything nicely?
>

Just boxes that were already short on memory (swapped a lot) will need
more swap, empirically up to 4 times as much. If you already had
enough memory than things will stay almost the same for you.

But anyway, after some testing I've done recently I would now not
recommend anybody to have less than 2 x RAM size swap partition.

> I've always been tending to make swap partitions smaller lately, as it
> helps in the case where we have to wait for a runaway process to eat up
> all of the swap space before it gets killed.  Making the swap size
> smaller speeds up the time it takes for this to happen, albeit something
> which isn't supposed to happen anyway.
> 

Well, if you continue with that practice now you will be even more
successful in killing such processes, I would say. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-09 Thread Zlatko Calusic

Linus Torvalds <[EMAIL PROTECTED]> writes:

> On 8 Jan 2001, Eric W. Biederman wrote:
> 
> > Zlatko Calusic <[EMAIL PROTECTED]> writes:> 
> > > 
> > > Yes, but a lot more data on the swap also means degraded performance,
> > > because the disk head has to seek around in the much bigger area. Are
> > > you sure this is all OK?
> > 
> > I don't think we have more data on the swap, just more data has an
> > allocated home on the swap.
> 
> I think Zlatko's point is that because of the extra allocations, we will
> have worse locality (more seeks etc).

Yes that was my concern.

But in the end I'm not sure. I made two simple tests and haven't found
any problems with 2.4.0 mm logic (opposed to 2.2.17). In fact, the new
kernel was faster in the more interesting (make -j32) test.

Also I have found that new kernel allocates 4 times more swap space
under some circumstances. That may or may not be alarming, it remains
to be seen.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-08 Thread Zlatko Calusic

Linus Torvalds <[EMAIL PROTECTED]> writes:

> On Mon, 8 Jan 2001, Rik van Riel wrote:
> 
> > On Sun, 7 Jan 2001, Wayne Whitney wrote:
> > 
> > > Well, here is a workload that performs worse on 2.4.0 than on 2.2.19pre,
> > 
> > > The typical machine is a dual Intel box with 512MB RAM and 512MB swap.
> > 
> > How does 2.4 perform when you add an extra GB of swap ?
> > 
> > 2.4 keeps dirty pages in the swap cache, so you will need
> > more swap to run the same programs...
> > 
> > Linus: is this something we want to keep or should we give
> > the user the option to run in a mode where swap space is
> > freed when we swap in something non-shared ?
> 
> I'd prefer just documenting it and keeping it. I'd hate to have two fairly
> different modes of behaviour. It's always been the suggested "twice the
> amount of RAM", although there's historically been the "Linux doesn't
> really need that much" that we just killed with 2.4.x.
> 
> If you have 512MB or RAM, you can probably afford another 40GB or so of
> harddisk. They are disgustingly cheap these days.
> 

Yes, but a lot more data on the swap also means degraded performance,
because the disk head has to seek around in the much bigger area. Are
you sure this is all OK?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] mm-cleanup-2 (2.4.0)

2001-01-08 Thread Zlatko Calusic

OK, take two. This patch:

o removes obsolete /proc entryes and other mm structures not used
  anymore.
o adds new /proc/sys/vm/max-async-pages
o updates documentation

As the patch doesn't change any kernel vital functionality it is
completely safe. I don't know if it satisfies Linus' patch submitting
guidelines. So sent only to the lists to be on the safe side. :)


Index: 0.2/mm/oom_kill.c
--- 0.2/mm/oom_kill.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/0_oom_kill.c 
1.1 644)
+++ 0.10/mm/oom_kill.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/0_oom_kill.c 1.2 644)
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 /* #define DEBUG */
Index: 0.2/mm/bootmem.c
--- 0.2/mm/bootmem.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/3_bootmem.c 
1.1 644)
+++ 0.10/mm/bootmem.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/3_bootmem.c 
+1.2 644)
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swap_state.c
--- 0.2/mm/swap_state.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic 
(linux24/j/6_swap_state 1.1 644)
+++ 0.10/mm/swap_state.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/6_swap_state 1.2 644)
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swapfile.c
--- 0.2/mm/swapfile.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/8_swapfile.c 
1.1 644)
+++ 0.10/mm/swapfile.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/8_swapfile.c 1.2 644)
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include  /* for blk_size */
 #include 
 #include 
Index: 0.2/mm/vmscan.c
--- 0.2/mm/vmscan.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/9_vmscan.c 1.1 
644)
+++ 0.10/mm/vmscan.c Tue, 09 Jan 2001 01:39:38 +0100 zcalusic (linux24/j/9_vmscan.c 
+1.4 644)
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/page_io.c
--- 0.2/mm/page_io.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/10_page_io.c 
1.1 644)
+++ 0.10/mm/page_io.c Tue, 09 Jan 2001 01:31:18 +0100 zcalusic (linux24/j/10_page_io.c 
+1.4 644)
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
@@ -43,8 +42,7 @@
struct inode *swapf = 0;
 
/* Don't allow too many pending pages in flight.. */
-   if ((rw == WRITE) && atomic_read(&nr_async_pages) >
-   pager_daemon.swap_cluster * (1 << page_cluster))
+   if ((rw == WRITE) && atomic_read(&nr_async_pages) > max_async_pages)
wait = 1;
 
if (rw == READ) {
Index: 0.2/mm/filemap.c
--- 0.2/mm/filemap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/12_filemap.c 
1.1 644)
+++ 0.10/mm/filemap.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/12_filemap.c 
+1.2 644)
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/page_alloc.c
--- 0.2/mm/page_alloc.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic 
(linux24/j/14_page_alloc 1.1 644)
+++ 0.10/mm/page_alloc.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/14_page_alloc 1.2 644)
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/mmap.c
--- 0.2/mm/mmap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/16_mmap.c 1.1 
644)
+++ 0.10/mm/mmap.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/16_mmap.c 1.2 
+644)
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swap.c
--- 0.2/mm/swap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/17_swap.c 1.1 
644)
+++ 0.10/mm/swap.c Tue, 09 Jan 2001 01:31:18 +0100 zcalusic (linux24/j/17_swap.c 1.5 
+644)
@@ -10,13 +10,11 @@
  * linux/Documentation/sysctl/vm.txt.
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
- * Buffermem limits added 12.3.98, Rik van Riel.
  */
 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -42,6 +40,13 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+/* Maximum number of swap pages in flight */
+int max_async_pages;
+
+/* We track the number of pages currently being asynchronously swapped
+   out, so that we don't try to swap TOO many pages out at once */
+atomic_t nr_async_pages = ATOMIC_INIT(0);
+
 /*
  * This variable contains the amount of page steals the system
  * is doing, averaged over a minute. We use this to determine how
@@ -53,28 +58,6 @@
  */
 int memory_pressure;
 
-/* We track the number of pages currently being asynchronously swapped
-   out, so that we don't try to swap TOO many pages out at once */
-atomic_t nr_async_pages = ATOMIC_INIT(0);
-
-buffer_mem_t buffer_mem = {
-   2,  /* minimum percent buffer */
-   10, /* borrow percent buffer */
-   60  /* maximum percent buffer */
-};
-
-buffer_mem_t page_cache = {
-   2,  /* minimum percent page cache */
- 

Re: Subtle MM bug

2001-01-08 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

Oh, well, it seems that I was wrong. :)


First test: hogmem 180 5 = allocate 180MB and dirty it 5 times (on a
192MB machine)

kernel | swap usage | speed
---
2.2.17 |  48 MB | 11.8 MB/s
---
2.4.0  | 206 MB | 11.1 MB/s
---

So 2.2 is only marginally faster. Also it can be seen that 2.4 uses 4
times more swap space. If Linus says it's ok... :)


Second test: kernel compile make -j32 (empirically this puts the VM
under load, but not excessively!)

2.2.17 -> make -j32  392.49s user 47.87s system 168% cpu 4:21.13 total
2.4.0  -> make -j32  389.59s user 31.29s system 182% cpu 3:50.24 total

Now, is this great news or what, 2.4.0 is definitely faster.

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Subtle MM bug

2001-01-07 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > Things go berzerk if you have one big process whose working set
> > is around your physical memory size.
> 
> "go berzerk" in what way?  Does the system cause lots of extra
> swap IO and does it make the system thrash where 2.2 didn't
> even touch the disk ?
>

Well, I think yes. I'll do some testing on the 2.2 before I can tell
you for sure, but definitely the system is behaving badly where I
think it should not.

> > Final effect is that physical memory gets extremely flooded with
> > the swap cache pages and at the same time the system absorbs
> > ridiculous amount of the swap space.
> 
> This is mostly because Linux 2.4 keeps dirty pages in the
> swap cache. Under Linux 2.2 a page would be deleted from the
> swap cache when a program writes to it, but in Linux 2.4 it
> can stay in the swap cache.
>

OK, I can buy that.

> Oh, and don't forget that pages in the swap cache can also
> be resident in the process, so it's not like the swap cache
> is "eating into" the process' RSS ;)
>

So far so good... A little bit weird but not alarming per se.

> > For instance on my 192MB configuration, firing up the hogmem
> > program which allocates let's say 170MB of memory and dirties it
> > leads to 215MB of swap used.
> 
> So that's 170MB of swap space for hogmem and 45MB for
> the other things in the system (daemons, X, ...).
>

Yes, that's it. So it looks like all of my processes are on the
swap. That can't be good. I mean, even Solaris (known to eat swap
space like there's no tomorrow :)) would probably be more polite.

> Sounds pretty ok, except maybe for the fact that now
> Linux allocates (not uses!) a lot more swap space then
> before and some people may need to add some swap space
> to their system ...
>

Yes, I would say really a lot more. Big diffeence.

Also, I don't see a diference between allocated and used swap space on
the Linux. Could you elaborate on that?

> 
> Now if 2.4 has worse _performance_ than 2.2 due to one
> reason or another, that I'd like to hear about ;)
> 

I'll get back to you later with more data. Time to boot 2.2. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] mm-cleanup-1 (2.4.0)

2001-01-07 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > OK, maybe I was too fast in concluding with that change. I'm
> > still trying to find out why is MM working bad in some
> > circumstances (see my other email to the list).
> > 
> > Anyway, I would than suggest to introduce another /proc entry
> > and call it appropriately: max_async_pages. Because that is what
> > we care about, anyway. I'll send another patch.
> 
> In fact, that's NOT what we care about.
> 
> What we really care about is the number of disk seeks
> the VM subsystem has queued to disk, since it's seek
> time that causes other requests to suffer bad latency.
> 

Yes, but that's not what we have in the code now. I'm just trying to
make it little easier for the end user to tune his system. Right now
things are quite complicated and misleading for the uninitiated.

If we are to optimize things better in the future, then be it, but I
would like first to clean some historical cruft.

I'm a quite pedantical guy, you know. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] mm-cleanup-1 (2.4.0)

2001-01-07 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > The following patch cleans up some obsolete structures from the
> > mm & proc code.
> > 
> > Beside that it also fixes what I think is a bug:
> > 
> > if ((rw == WRITE) && atomic_read(&nr_async_pages) >
> >pager_daemon.swap_cluster * (1 << page_cluster))
> > 
> > In that (swapout logic) it effectively says swap out 512KB at
> > once (at least on my memory configuration). I think that is a
> > little too much.
> 
> Since we submit a whole cluster of (1 << page_cluster)
> size at once, your change would mean that the VM can
> only do one IO at a time...
> 
> Have you actually measured your changes or is it just
> a gut feeling that the current default is too much?
>

Well, to be honest I didn't find any change after the modification. :)

But, anyway, Marcelo explained to me what's going on and I have
already agreed there is no need to change that. Instead I'll modify my
patch to introduce new /proc entry with meaningful name:
max_async_pages.

> (I can agree with 1/2 MB being a bit much, but doing
> just one IO at a time is probably wrong too...)
>

I can only add that I share your opinion. :)

> 
> The cleanup part of your patch is nice. I think that
> one should be submitted as soon as the 2.4 bugfix
> period is over ...
>

Right.

> (and yes, I'm not submitting any of my own trivial
> patches either unless they're REALLY needed, lets make
> sure Linus has enough time to focus on the real bugfixes)
> 

I'll check your new patch as soon as I have investigated few more
things and got a little more acquainted with the mm code in the
2.4.0. It's a pity I found some free time this late, but then again I
see myself much more involved with the mm code in the future. It's
just that I'll need some help in the start thus so much questions on
the lists. :)
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] mm-cleanup-1 (2.4.0)

2001-01-07 Thread Zlatko Calusic

Marcelo Tosatti <[EMAIL PROTECTED]> writes:

> On 7 Jan 2001, Zlatko Calusic wrote:
> 
> > The following patch cleans up some obsolete structures from the mm &
> > proc code.
> > 
> > Beside that it also fixes what I think is a bug:
> > 
> > if ((rw == WRITE) && atomic_read(&nr_async_pages) >
> >pager_daemon.swap_cluster * (1 << page_cluster))
> > 
> > In that (swapout logic) it effectively says swap out 512KB at once (at
> > least on my memory configuration). I think that is a little too much.
> > I modified it to be a little bit more conservative and send only
> > (1 << page_cluster) to the swap at a time. Same applies to the
> > swapin_readahead() function. Comments welcome.
> 
> 512kb is the maximum limit for in-flight swap pages, not the cluster size 
> for IO. 
> 
> swapin_readahead actually sends requests of (1 << page_cluster) to disk
> at each run.
>  

OK, maybe I was too fast in concluding with that change. I'm still
trying to find out why is MM working bad in some circumstances (see my
other email to the list).

Anyway, I would than suggest to introduce another /proc entry and call
it appropriately: max_async_pages. Because that is what we care about,
anyway. I'll send another patch.
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Subtle MM bug

2001-01-07 Thread Zlatko Calusic

I'm trying to get more familiar with the MM code in 2.4.0, as can be
seen from lots of questions I have on the subject. I discovered nasty
mm behaviour under even moderate load (2.2 didn't have troubles).

Things go berzerk if you have one big process whose working set is
around your physical memory size. Typical memory hoggers are good
enough to trigger the bad behaviour. Final effect is that physical
memory gets extremely flooded with the swap cache pages and at the
same time the system absorbs ridiculous amount of the swap space.
xmem is as usual very good at detecting this and you just need to
press Alt-SysReq-M to see that most of the memory (e.g. 90%) is
populated with the swap cache pages.

For instance on my 192MB configuration, firing up the hogmem program
which allocates let's say 170MB of memory and dirties it leads to
215MB of swap used. vmstat 1 shows that the pagecache size is
constantly growing - that is swapcache enlarging in fact - during the
second pass of the hogmem program.

...
   procs  memoryswap  io system cpu
 r  b  w   swpd  free buff  cache   si   sobibo   incs  us  sy  id
 0  1  1 131488  1592  400  62384 4172 5188  1092  1298  353  1447   2   4  94
 0  1  1 136584  1592  400  67428 5860 4104  1465  1034  322  1327   3   3  93
 0  1  1 141668  1592  388  72536 5504 4420  1376  1106  323  1423   1   3  95
 0  1  1 146724  1592  380  77592 5996 4236  1499  1060  335  1096   2   3  94
 0  1  1 151876  1600  320  82764 6264 3712  1566   936  327  1226   3   4  93
 0  1  1 157016  1600  320  87908 5284 4268  1321  1068  315  1248   1   2  96
 1  0  0 157016  1600  308  87792 1836 5168   459  1293  281  1324   3   3  94
 0  1  0 162204  1600  304  92892 7784 5236  1946  1315  385  1353   3   5  92
 0  1  0 167216  1600  304  97780 3496 5016   874  1256  301  1222   0   2  97
 0  1  1 177904  1608  284 108276 5160 5168  1290  1300  330  1453   1   4  94
 0  1  2 182008  1588  288 112264 4936 3344  1268   838  293   801   2   3  95
 0  2  1 183620  1588  260 114012 3064 1756   830   445  290   846   0  15  85
 0  2  2 185384  1596  180 115864 2320 2620   635   658  285   722   1  29  70
 0  3  2 187528  1592  220 117892 2488 2224   657   557  273   754   3  30  67
 0  4  1 190512  1592  236 120772 2524 3012   725   760  343  1080   1  14  85
 0  4  1 195780  1592  240 125868 2336 5316   613  1331  381  1624   2   2  96
 1  0  1 200992  1592  248 131052 2080 2176   623   552  234  1044   3  23  74
 0  1  0 200996  1592  252 130948 2208 3048   580   762  256  1065  10  10  80
 0  1  1 206240  1592  252 136076 2988 5252   760  1314  309  1406   7   4  8
 0  2  1 211408  1592  256 141080 5424 5180  1389  1303  395  1885   3   5  91
 0  2  0 214744  1592  264 144280 4756 3328  1223   834  327  1211   1   5  95
 1  0  0 214868  1592  244 144468 4344 5148  1087  1295  303  1189  11   2  86
 0  1  1 214900  1592  248 144496 4360 3244  1098   812  318  1467   7   4  89
 0  1  1 214916  1592  248 144520 4280 3452  1070   865  336  1602   3   3  94
 0  1  1 214964  1592  248 144580 4972 4184  1243  1054  368  1620   3   5  92
 0  2  2 214956  1592  272 144548 3700 4544  1081  1142  665  2952   1   1  98
 0  1  0 214992  1592  272 144588 1220 5088   305  1274  282  1363   1   4  95
 0  1  1 215012  1592  272 144600 3640 4420   910  1106  325  1579   3   2  9

Any thoughts on this?
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] mm-cleanup-1 (2.4.0)

2001-01-07 Thread Zlatko Calusic

The following patch cleans up some obsolete structures from the mm &
proc code.

Beside that it also fixes what I think is a bug:

if ((rw == WRITE) && atomic_read(&nr_async_pages) >
   pager_daemon.swap_cluster * (1 << page_cluster))

In that (swapout logic) it effectively says swap out 512KB at once (at
least on my memory configuration). I think that is a little too much.
I modified it to be a little bit more conservative and send only
(1 << page_cluster) to the swap at a time. Same applies to the
swapin_readahead() function. Comments welcome.

Index: 0.2/mm/oom_kill.c
--- 0.2/mm/oom_kill.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/0_oom_kill.c 
1.1 644)
+++ 0.6/mm/oom_kill.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/0_oom_kill.c 
+1.2 644)
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 /* #define DEBUG */
Index: 0.2/mm/bootmem.c
--- 0.2/mm/bootmem.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/3_bootmem.c 
1.1 644)
+++ 0.6/mm/bootmem.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/3_bootmem.c 
+1.2 644)
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swap_state.c
--- 0.2/mm/swap_state.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic 
(linux24/j/6_swap_state 1.1 644)
+++ 0.6/mm/swap_state.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/6_swap_state 1.2 644)
@@ -10,7 +10,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swapfile.c
--- 0.2/mm/swapfile.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/8_swapfile.c 
1.1 644)
+++ 0.6/mm/swapfile.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/8_swapfile.c 
+1.2 644)
@@ -9,7 +9,6 @@
 #include 
 #include 
 #include 
-#include 
 #include  /* for blk_size */
 #include 
 #include 
Index: 0.2/mm/vmscan.c
--- 0.2/mm/vmscan.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/9_vmscan.c 1.1 
644)
+++ 0.6/mm/vmscan.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/9_vmscan.c 1.2 
+644)
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/page_io.c
--- 0.2/mm/page_io.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/10_page_io.c 
1.1 644)
+++ 0.6/mm/page_io.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/10_page_io.c 
+1.3 644)
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 
@@ -44,7 +43,7 @@
 
/* Don't allow too many pending pages in flight.. */
if ((rw == WRITE) && atomic_read(&nr_async_pages) >
-   pager_daemon.swap_cluster * (1 << page_cluster))
+   (1 << page_cluster))
wait = 1;
 
if (rw == READ) {
Index: 0.2/mm/filemap.c
--- 0.2/mm/filemap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/12_filemap.c 
1.1 644)
+++ 0.6/mm/filemap.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/12_filemap.c 
+1.2 644)
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/page_alloc.c
--- 0.2/mm/page_alloc.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic 
(linux24/j/14_page_alloc 1.1 644)
+++ 0.6/mm/page_alloc.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic 
+(linux24/j/14_page_alloc 1.2 644)
@@ -12,7 +12,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/mmap.c
--- 0.2/mm/mmap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/16_mmap.c 1.1 
644)
+++ 0.6/mm/mmap.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/16_mmap.c 1.2 
+644)
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
Index: 0.2/mm/swap.c
--- 0.2/mm/swap.c Sat, 06 Jan 2001 01:48:21 +0100 zcalusic (linux24/j/17_swap.c 1.1 
644)
+++ 0.6/mm/swap.c Sun, 07 Jan 2001 20:17:13 +0100 zcalusic (linux24/j/17_swap.c 1.4 
+644)
@@ -10,13 +10,11 @@
  * linux/Documentation/sysctl/vm.txt.
  * Started 18.12.91
  * Swap aging added 23.2.95, Stephen Tweedie.
- * Buffermem limits added 12.3.98, Rik van Riel.
  */
 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -42,6 +40,10 @@
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
 
+/* We track the number of pages currently being asynchronously swapped
+   out, so that we don't try to swap TOO many pages out at once */
+atomic_t nr_async_pages = ATOMIC_INIT(0);
+
 /*
  * This variable contains the amount of page steals the system
  * is doing, averaged over a minute. We use this to determine how
@@ -52,28 +54,6 @@
  * In recalculate_vm_stats the value is decayed (once a second)
  */
 int memory_pressure;
-
-/* We track the number of pages currently being asynchronously swapped
-   out, so that we don't try to swap TOO many pages out at once */
-atomic_t nr_async_pages = ATOMIC_INIT(0);
-
-buffer_mem_t buffer_mem = {
-   2,  /* minimum percent buffer */
-   10, /* borrow percent buffer */
-   60   

Re: [PATCH] add PF_MEMALLOC to __alloc_pages()

2001-01-03 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> Hi Linus, Alan, Mike,
> 
> the following patch sets PF_MEMALLOC for the current task
> in __alloc_pages() to avoid infinite recursion when we try
> to free memory from __alloc_pages().
> 
> Please apply the patch below, which fixes this (embarrasing)
> bug...
> 
[snip]
>* free ourselves...
>*/
>   } else if (gfp_mask & __GFP_WAIT) {
> + current->flags |= PF_MEMALLOC;
>   try_to_free_pages(gfp_mask);
> + current->flags &= ~PF_MEMALLOC;
>   memory_pressure++;
>   if (!order)
>   goto try_again;
> 

Hm, try_to_free_pages already sets the PF_MEMALLOC flag!
-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[patch] gemtek radio obvious fix

2000-12-30 Thread Zlatko Calusic

Index: 24012.6/drivers/media/radio/radio-gemtek.c
--- 24012.6/drivers/media/radio/radio-gemtek.c Thu, 14 Dec 2000 00:08:47 +0100 
zcalusic (linux/P/d/1_radio-gemt 1.1.2.2.3.1 644)
+++ 24012.7(w)/drivers/media/radio/radio-gemtek.c Sat, 30 Dec 2000 12:06:56 +0100 
+zcalusic (linux/P/d/1_radio-gemt 1.1.2.2.3.2 644)
@@ -265,7 +265,7 @@
return -EINVAL;
}
 
-   if (request_region(io, 4, "gemtek")) 
+   if (!request_region(io, 4, "gemtek")) 
{
printk(KERN_ERR "gemtek: port 0x%x already in use\n", io);
return -EBUSY;

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: innd mmap bug in 2.4.0-test12

2000-12-27 Thread Zlatko Calusic

Rik van Riel <[EMAIL PROTECTED]> writes:

> On Mon, 25 Dec 2000, Dan Aloni wrote:
> > On 25 Dec 2000, Zlatko Calusic wrote:
> > 
> > > Speaking of page_launder() I just stumbled upon two oopsen today on
> > > the UP build. Maybe it could give a hint to someone, I'm not that good
> > > at Oops decoding.
> > 
> > I've decoded the Oops I got, and found that the problem is in
> > vmscan.c:line-605, where page->mapping is NULL and a_ops gets
> > resolved and dereferenced at 0x000c.
> 
> The code assumes that every page which has the PG_dirty
> bit set also has page->mapping set to a valid value.
> 
> The BUG() people are getting confirms that this assumption
> is not necessarily true and the VM work that's going on will
> most likely make it not be true either in some cases.
> 
> The (trivial) patch below should fix this problem.
> 
> Linus and/or Alan, could you please apply this for the next
> pre-patch ?
> 

Looking at the patch, I'm practically sure it will cure the symptoms.
But I'm still slightly worried about those pages we skip in
there. Maybe we should at least try to discover what are those pages,
and then maybe it will become obvious what we need (or not) to do with
them.

Some strategic printk()s could probably give us some clue.

Too bad I lost track of the recent changes due to catastrofic time
shortage. And that is a shame as I'm very satisfied with the current
VM code, thanks to your hard work Rik.

> regards,
> 
> Rik
> --
> Hollywood goes for world dumbination,
>   Trailer at 11.
> 
>   http://www.surriel.com/
> http://www.conectiva.com/ http://distro.conectiva.com.br/
> 
> 
> --- vmscan.c.orig Wed Dec 27 16:48:24 2000
> +++ vmscan.c  Wed Dec 27 17:14:32 2000
> @@ -601,7 +601,7 @@
>* Dirty swap-cache page? Write it out if
>* last copy..
>*/
> - if (PageDirty(page)) {
> + if (PageDirty(page) && page->mapping) {
>   int (*writepage)(struct page *) = 
>page->mapping->a_ops->writepage;
>   int result;
>  
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: innd mmap bug in 2.4.0-test12

2000-12-24 Thread Zlatko Calusic

Linus Torvalds <[EMAIL PROTECTED]> writes:

> On Sun, 24 Dec 2000, Linus Torvalds wrote:
> > 
> > Marco, would you mind changing the test in reclaim_page(), somewheer
> > around line mm/vmscan.c:487 that says:
> 

Speaking of page_launder() I just stumbled upon two oopsen today on
the UP build. Maybe it could give a hint to someone, I'm not that good
at Oops decoding.

Merry Christmas!


Unable to handle kernel NULL pointer dereference at virtual address 000c
 printing eip:
c012872e
*pde = 
Oops: 
CPU:0
EIP:0010:[page_launder+510/2156]
EFLAGS: 00010202
eax:    ebx: c12e2ce8   ecx: c1244474   edx: 
esi: c12e2d04   edi:    ebp:    esp: c15d1fb4
ds: 0018   es: 0018   ss: 0018
Process bdflush (pid: 6, stackpage=c15d1000)
Stack: c15d  c15d023a 0008e000   0001 2933 
    c0131e5d 0003  00010f00 c146ff88 c146ffc4 c01073fc 
   c146ffc4 0078 c146ffc4 
Call Trace: [bdflush+141/236] [kernel_thread+40/56] 
Code: 8b 40 0c 8b 00 85 c0 0f 84 ba 04 00 00 83 7c 24 10 00 75 73 


Unable to handle kernel NULL pointer dereference at virtual address 000c
 printing eip:
c012872e
*pde = 
Oops: 
CPU:0
EIP:0010:[page_launder+510/2156]
EFLAGS: 00010202
eax:    ebx: c1260eec   ecx: c15d5fe0   edx: c02917f0
esi: c1260f08   edi:    ebp:    esp: c15d5f9c
ds: 0018   es: 0018   ss: 0018
Process kswapd (pid: 4, stackpage=c15d5000)
Stack: 00010f00 0004   0004   2938 
    c01290fc 0004  00010f00 c01f77f7 c15d4239 0008e000 
   c01291c6 0004  c146ffb8  c01073fc  0078 
Call Trace: [do_try_to_free_pages+52/128] [tvecs+8683/64084] [kswapd+126/288] 
[kernel_thread+40/56] 
Code: 8b 40 0c 8b 00 85 c0 0f 84 ba 04 00 00 83 7c 24 10 00 75 73 

-- 
Zlatko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/