date:20071220

Re: [Fwd: Re: [PATCH 1/5]PCI: x86 MMCONFIG: introduce PCI_USING_MMCONF]

2007-12-20 Thread Greg KH

On Thu, Dec 20, 2007 at 09:19:00AM -0500, Tony Camuso wrote:
 Tony Camuso wrote:

 Greg KH wrote:
 On Wed, Dec 19, 2007 at 05:17:51PM -0500, [EMAIL PROTECTED] wrote:
  +extern struct pci_ops pci_legacy_ops; /* direct.c */

 This isn't needed in this patch at all, and might make the compiler
 confused if you were to build with only this patch present :(

 thanks,

 greg k-h
 Yes, of course. I missed that. Thank you.
 I take that back.

 This struct must be declared extern because it is referenced in
 arch/x86/pci/common.c by pcibios_fix_bus_scan_quirk()

Sure, but you do not reference it in this patch, right?  So it's not
needed until you actually use it, so just include it in the patch that
you are needing it in.

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 1/5]PCI: x86 MMCONFIG: introduce PCI_USING_MMCONF]

2007-12-20 Thread Tony Camuso


Greg KH wrote:


Sure, but you do not reference it in this patch, right?  So it's not
needed until you actually use it, so just include it in the patch that
you are needing it in.

thanks,

greg k-h


Will do.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] power: RFC: introduce a new power API

2007-12-20 Thread Anton Vorontsov

On Thu, Dec 20, 2007 at 11:00:30AM -0500, Andres Salomon wrote:
 On Thu, 20 Dec 2007 18:07:16 +0300
 Anton Vorontsov [EMAIL PROTECTED] wrote:
 
  Hi Andres,
  
 [...]
  
  Then, what the power supply subsystem is for? Just place all the
  drivers together in driver/power/, and let them create sysfs
  attributes by their own. You'll get a medley, not the subsystem.
  
  Good luck,
  
 
 
 Ok, I'm really tired of arguing about this.

Me either, you wouldn't believe.

 I've pointed out why
 the current API is inadequate, but you seem to be resisting
 changing it.

No, you didn't point out anything. You didn't describe _why_ you
want to change the subsystem. No single reason except flexibility
that _no one_ is using.

 It's your subsystem, do what you like.

No, this isn't mine subsystem. I've written most of it, probably.
But now I'm just volunteering to co-maintain it (that is, the person
whom to blame if something goes wrong in power/). I'm also directly
interested in this subsystem, because I have the hardware on which
I'm using it.

But I'm not the last person you can ask to merge your changes. Whole
LKML is hearing us, and if you want to--ask David to push your
changes over me. Or Andrew. Or ask Linus at the last. They're are
much better programmers than I am, so whatever they'll do would be
the right thing. I'll just swallow it.

Though, on the current patches, please stamp my:

Nacked-by: Anton Vorontsov [EMAIL PROTECTED]

so the history will have written evidence that I did not agree on
the changes, so years later I could reply to the complaints: Look,
I nacked this code, it wasn't me who checked this in!


You have another option though: finally show up the user of the
purposed changes. Without single flexible word, please. Real,
existent user.

-- 
Anton Vorontsov
email: [EMAIL PROTECTED]
backup email: [EMAIL PROTECTED]
irc://irc.freenode.net/bd2
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Testing RAM from userspace / question about memmap= arguments

2007-12-20 Thread Siva Prasad


Hi Matthew,

I worked on some thing similar. For one of our customer product that 
goes to defense and security markets, we had to support maximum possible 
memory test. We implemented a mechanism of pre-test to test the memory 
with walking 1's and 0's just before Linux kernel starts allocating 
serious memory for its use. That way, coverage was almost 99%. Once 
Linux boots, we do a very through test using various algorithms, however 
as you said coverage of memory is little less when we test the system 
after Linux boots up completely.


memtest86+ started as a very good alternative, until customer's customer 
started complaining about memory issues. Then we had no choice but to 
take this route and implement it ourselves from the scratch.


If you want 100% coverage, it may not be possible unless you do it in 
BIOS early on. If you take the route of implementing some simple memory 
test in Linux kernel before it starts allocating memory, you get very 
good % of coverage. Good Luck.


- Siva


Date: Thu, 20 Dec 2007 14:17:10 +

From: Matthew Bloch [EMAIL PROTECTED]

Subject: Re: Testing RAM from userspace / question about memmap=

 arguments

To: linux-kernel@vger.kernel.org

Message-ID: [EMAIL PROTECTED]

Content-Type: text/plain; charset=ISO-8859-1



Jon Masters wrote:

 On Tue, 2007-12-18 at 17:06 +, Matthew Bloch wrote:



 I can see a few potential problems, but since my understanding of the

 low-level memory mapping is muddy at best, I won't speculate; I'd just

 appreciate any more expert views on whether this does work, or could be

 made to work.



 Yo,



 I don't think your testing approach is thorough enough. Clearly (knowing

 your line of business - as a virtual machine provider), you want to do

 pre-production testing as part of your provisioning. I would suggest

 instead of using mlock() from userspace of simply writing a kernel

 module that does this for every page of available memory.



Yes this is to improve the efficiency of server burn-ins.  I would

consider a kernel module, but I still wouldn't be able to test the

memory in which the kernel is sitting, which is my problem.  I'm not

sure even a kernel module could reliably test the memory in which it is

residing (memtest86+ relocates itself to do this).  Also I don't see how

 userspace testing is any less thorough than doing it in the kernel; I

just need a creative way of accessing every single page of memory.



I may do some experiments with the memmap args, some bad RAM and

shuffling it between DIMM sockets when I have the time :)



--

Matthew
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page

2007-12-20 Thread Andrew Morton

On Thu, 20 Dec 2007 21:54:15 +0530 Balbir Singh [EMAIL PROTECTED] wrote:

  struct page_cgroup *page_get_page_cgroup(struct page *page)
  {
 return (struct page_cgroup *)
 (page-page_cgroup  ~PAGE_CGROUP_LOCK);
  }
 
  I guess the issue is that often a get function has a complementary
  put function, but this isn't one of them.  Would page_page_cgroup
  be a better name, perhaps?  I don't know.
  
  Ah, yes, I mistakenly assumed it was a reference get. In that case I
  stand corrected and do not have any objections.
  
 
 I was going to say the same thing, page_get_page_cgroup() does not hold
 any references. May be _get_ in the name is confusing.

It is a bit unconventional.  page_cgroup()?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

2007-12-20 Thread Matt Mackall

On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote:
 
 On Thu, 2007-12-20 at 15:26 +, Hugh Dickins wrote:
 
  The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
  but might prove useful for more general use when swapping in.
  Not really the same as Con's swap prefetch, but worth looking
  at that for reference.  But I guess this becomes a much bigger
  issue than you were intending to get into here.
 
 heh, yeah, got somewhat more complex that I'd hoped for.
 
 last patch for today (not even compile tested), will do a proper patch
 and test it tomorrow.
 
 ---
 A best effort MADV_WILLNEED implementation for anonymous memory.
 
 It adds a batch method to the page table walk routines so we can
 copy a few ptes while holding the kmap, which makes it possible to
 allocate the backing pages using GFP_KERNEL.

Yuck. We actually need to just fix the atomic kmap issue in the
existing pagemap code rather than add a new method, I think.

If performance of map/unmap is too slow at a granularity of 1, we can
add some internal batching in the CONFIG_HIGHPTE case.

-- 
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Stephen Hemminger

On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
Parag Warudkar [EMAIL PROTECTED] wrote:

 
 sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
 second.
 
 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
 
 --- linux-2.6/drivers/net/sky2.c  2007-12-07 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/sky2.c 2007-12-18 20:07:58.0 -0500
 @@ -4230,7 +4230,10 @@
   sky2_show_addr(dev1);
   }
 
 - setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
 + hw-watchdog_timer.function = sky2_watchdog;
 + hw-watchdog_timer.data = (unsigned long) hw;
 + init_timer_deferrable(hw-watchdog_timer);
 +
   INIT_WORK(hw-restart_work, sky2_restart);
 
   pci_set_drvdata(pdev, hw);

Does it really reduce the wakeup's or only change who gets charged by powertop?
The system is going to wakeup once a second anyway. Looks to me that if the
timer is using round_jiffies(), that setting deferrable just changes the 
accounting.

My interpretation of the api is:
   * round_jiffies()  - timer wants to wakeup but isn't precise about when so 
schedule
on next second when system will wake up anyway;
e.g why meetings are usually scheduled on the hour

   * deferrable   - timer doesn't have to really wakeup but wants to happen 
near
a particular time. e.g. I'll meet you at the pub 
around 8pm

Therefore doing deferrable is unnecessary for timers using round_jiffies unless 
system
is so good at doing timers that it is going to skip doing timer once per second.

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

2007-12-20 Thread Peter Zijlstra


On Thu, 2007-12-20 at 11:11 -0600, Matt Mackall wrote:
 On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote:
  
  On Thu, 2007-12-20 at 15:26 +, Hugh Dickins wrote:
  
   The asynch code: perhaps not worth doing for MADV_WILLNEED alone,
   but might prove useful for more general use when swapping in.
   Not really the same as Con's swap prefetch, but worth looking
   at that for reference.  But I guess this becomes a much bigger
   issue than you were intending to get into here.
  
  heh, yeah, got somewhat more complex that I'd hoped for.
  
  last patch for today (not even compile tested), will do a proper patch
  and test it tomorrow.
  
  ---
  A best effort MADV_WILLNEED implementation for anonymous memory.
  
  It adds a batch method to the page table walk routines so we can
  copy a few ptes while holding the kmap, which makes it possible to
  allocate the backing pages using GFP_KERNEL.
 
 Yuck. We actually need to just fix the atomic kmap issue in the
 existing pagemap code rather than add a new method, I think.
 
 If performance of map/unmap is too slow at a granularity of 1, we can
 add some internal batching in the CONFIG_HIGHPTE case.

OK, sounds like a much better idea indeed. Will implement that.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Tom Tucker


On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
 Hey Roland (and any iommu/ppc/dma experts out there):
 
 I'm debugging a data corruption issue that happens on PPC64 systems 
 running rdma on kernels where the iommu page size is 4KB yet the host 
 page size is 64KB.  This feature was added to the PPC64 code recently, 
 and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
 page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
   configs work too. Its just a problem when the iommu page size != host 
 page size.
 
 It appears that my problem boils down to a single host page of memory 
 that is mapped for dma, and the dma address returned by dma_map_sg() is 
 _not_ 64KB aligned.  Here is an example:
 
 app registers va 0x2d9a3000 len 12288
 ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
 state from a registered user memory region):
 
  umem len 12288 off 12288 pgsz 65536 shift 16
  chunk 0: nmap 1 nents 1
  sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
  5bff4000 dma_len 65536
  
 
 So the kernel maps 1 full page for this MR.  But note that the dma 
 address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
 think this is causing grief to the RDMA HW.
 
 My first question is: Is there an assumption or requirement in linux 
 that dma_addressess should have the same alignment as the host address 
 they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
 but the mapping doesn't begin on a 64KB page boundary.
 
 If this mapping is considered valid, then perhaps the rdma hw is at 
 fault here.  But I'm wondering if this is an PPC/iommu bug.
 
 BTW:  Here is what the Memory Region looks like to the HW:
 
  TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
  perms RW rem_inv_dis 0 addr_type VATO
  bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
  len 12288 va 2d9a3000 bind_cnt 0
  PBL: 5bff4000
 
 
 
 Any thoughts?

The Ammasso certainly works this way. If you tell it the page size is
64KB, it will ignore bits in the page address that encode 0-65535.

 
 Steve.
 
 
 ___
 general mailing list
 [EMAIL PROTECTED]
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 4/5]PCI: x86 MMCONFIG: introduce pcibios_fix_bus_scan()]

2007-12-20 Thread Greg KH

On Thu, Dec 20, 2007 at 07:26:17AM -0500, Tony Camuso wrote:
 +
 +#define CHECK_MMCFG_STR_1 \
 +   PCI: Device at %04x:%02x.%02x.%x is not MMCONFIG compliant.\n
 +#define CHECK_MMCFG_STR_2 \
 +   PCI: Bus %04x:%02x and its descendents cannot use MMCONFIG.\n
 Why define these if they are only used in one place?

 If you object, I will be happy to move them into the routine body
 without the defines. I agree that It does look inconsistent to have
 these strings defined and other strings embedded in the routine body.

Yes, please fix this.

 Also, as you use dev_info(), I think you are duplicating some of the
 information in the resulting printk(), right?
 Actually, no. The strings do not contain redundant info. The pr_info
 routine is just a macro for printk(KERN_INFO ...)

Ah, sorry, I was thinking you were using dev_info(), which is what you
should be using instead anyway :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Greg KH

On Thu, Dec 20, 2007 at 07:28:00AM -0500, Tony Camuso wrote:

  Original Message 
 Subject: Re: [PATCH 0/5]PCI: x86 MMCONFIG
 Date: Wed, 19 Dec 2007 19:33:45 -0500
 From: Tony Camuso [EMAIL PROTECTED]
 Reply-To: [EMAIL PROTECTED]
 To: Greg KH [EMAIL PROTECTED]
 References: 
 [EMAIL PROTECTED] 
 [EMAIL PROTECTED]

 Greg KH wrote:
 On Wed, Dec 19, 2007 at 05:17:46PM -0500, [EMAIL PROTECTED] wrote:
 There exist devices that do not respond correctly to PCI
 MMCONFIG accesses in x86 platforms.
 What devices are these?  Do you have reports of them somewhere?
 There are the AMD 8131 and 8132, the Serverworks HT1000 bridge chips
 and the 830M/MG graphics. Not all versions of these chips present
 this pathology, but there are perhaps tens of thousands of systems
 out there that have the broken versions of these chipsets.

Why haven't we gotten reports about this before if this is a common
problem?

And why hasn't the vendor fixed the bios on these to work properly?

 RedHat have been maintaining a blacklist of systems having these
 devices. Systems in the blacklist are confined to legacy PCI
 access.

Do you have a pointer to this blacklist anywhere so that everyone can
benifit from this knowledge?

 That sounds like this patchset can cause bad side affects on hardware
 that currently works just fine.  That is not a good thing to be adding
 to the kernel, right?
 No, the patch set tries to obviate this without requiring endusers to
 write customized scripts with pci=nommconf and without requiring the
 RH folks to add another platform (usually belatedly) to the blacklist.

 If a device is going to machine check when you touch it with an mmconfig
 access, it will happen with or without this patch-set.

 However, the patch-set does cover most of the devices that don't respond
 well to mmconfig access. Such devices almost alway7s return garbage when
 you read from them.

 The one device we know about that throws exceptions is the 830M/MG
 graphics chip. This chip passes the read-compare test, so the code
 merrily advances to bus sizing. When the bus sizing code writes the
 BAR at offset 0x18 in this device, the system hangs.

So it doesn't work at all, with or without this patch?  Does the vendor
know about this?

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Jan Kara

 On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
  
  OK, so I looked for PG_dirty anyway.
  
  In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
  bail out if the page is dirty.
  
  Then in 3e67c0987d7567ad41164a153dca9a43b11d, Andrew fixed
  truncate_complete_page, because it called cancel_dirty_page (and thus
  cleared PG_dirty) after try_to_free_buffers was called via
  do_invalidatepage.
  
  Now, if I'm not mistaken, we can end up as follows.
  
  truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
  ext3_invalidatepage()
journal_invalidatepage()
  journal_unmap_buffer()
__dispose_buffer()
  __journal_unfile_buffer()
__journal_temp_unlink_buffer()
  mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
 
 Good, this seems to be the exact path that actually triggers it. I got to 
 journal_unmap_buffer(), but was too lazy to actually then bother to follow 
 it all the way down - I decided that I didn't actually really even care 
 what the low-level FS layer did, I had already convinced myself that it 
 obviously must be dirtying the page some way, since that matched the 
 symptoms exactly (ie only the journaling case was impacted, and this was 
 all about the journal).
 
 But perhaps more importantly: regardless of what the low-level filesystem 
 did at that point, the VM accounting shouldn't care, and should be robust 
 in the face of a low-level filesystem doing strange and wonderful things. 
 But thanks for bothering to go through the whole history and figure out 
 what exactly is up.
  As I wrote in my previous email, this solution works but hides the
fact that the page really *has* dirty data in it and *is* pinned in memory
until the commit code gets to writing it. So in theory it could disturb
the writeout logic by having more dirty data in memory than vm thinks it
has. Not that I'd have a better fix now but I wanted to point out this
problem.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SuSE CR Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Parag Warudkar wrote:
 
 Reduce wakeups from idle per second.
 
 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]
 
 --- linux-2.6/drivers/net/e1000e/netdev.c2007-12-07
 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/e1000e/netdev.c2007-12-18
 20:45:59.0 -0500
 @@ -3899,7 +3899,7 @@
  goto err_eeprom;
  }
 
 -init_timer(adapter-watchdog_timer);
 +init_timer_deferrable(adapter-watchdog_timer);
  adapter-watchdog_timer.function = e1000_watchdog;
  adapter-watchdog_timer.data = (unsigned long) adapter;


I can't even apply this patch and the e1000 one... not only is it whitespace
damaged it is also not properly formatted as patch at all. If you want me to 
take
these patches seriously, then please fix the formatting issues.

Auke
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: OOPS: 2.6.24-rc5-mm1 -- EIP is at r_show+0x2a/0x70 -- (triggered by cat /proc/iomem AFTER suspend-to-disk/resume)

2007-12-20 Thread Andrew Morton

On Thu, 20 Dec 2007 08:38:03 -0500 Miles Lane [EMAIL PROTECTED] wrote:

 On further investigation, cat /proc/iomem does not trigger the stack 
 trace until after a suspend-to-disk/resume cycle has occurred.

I still can't reproduce this.

Could you please try this?

- cat /proc/iomem
- suspend/resume
- do

while read i
do
echo $i
sleep 1
done  /proc/iomem

then, with luck, we'll be able to work out which /proc/iomem record
immediately precedes the corrupted one.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page

2007-12-20 Thread Hugh Dickins

On Thu, 20 Dec 2007, Andrew Morton wrote:
 On Thu, 20 Dec 2007 21:54:15 +0530 Balbir Singh [EMAIL PROTECTED] wrote:
  I was going to say the same thing, page_get_page_cgroup() does not hold
  any references. May be _get_ in the name is confusing.
 
 It is a bit unconventional.  page_cgroup()?

Seems ideal to me
(though there is some argument for page_page_cgroup(), weird as it is).

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 09:22:05AM -0800, Greg KH wrote:
  The one device we know about that throws exceptions is the 830M/MG
  graphics chip. This chip passes the read-compare test, so the code
  merrily advances to bus sizing. When the bus sizing code writes the
  BAR at offset 0x18 in this device, the system hangs.
 
 So it doesn't work at all, with or without this patch?  Does the vendor
 know about this?

I can't imagine there are too many machines with MMCONFIG and an
i830 chipset.  The laptop I'm currently typing on is an i830 ... and
its BIOS is from 2002, well predating the specification of MMCONFIG.
If I didn't know better, I'd accuse Tony of lying about the existance
of such a machine ;-)

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 4/5]PCI: x86 MMCONFIG: introduce pcibios_fix_bus_scan()]

2007-12-20 Thread Tony Camuso


Greg KH wrote:


Ah, sorry, I was thinking you were using dev_info(), which is what you
should be using instead anyway :)



I found it in include/linux/device.h

#define dev_info(dev, format, arg...)   \
dev_printk(KERN_INFO , dev , format , ## arg)

The info I'm trying to print is before a dev or pci_dev struct
has been initialized for the device.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Move page_assign_page_cgroup to VM_BUG_ON in free_hot_cold_page

2007-12-20 Thread Balbir Singh

Hugh Dickins wrote:
 On Thu, 20 Dec 2007, Andrew Morton wrote:
 On Thu, 20 Dec 2007 21:54:15 +0530 Balbir Singh [EMAIL PROTECTED] wrote:
 I was going to say the same thing, page_get_page_cgroup() does not hold
 any references. May be _get_ in the name is confusing.
 It is a bit unconventional.  page_cgroup()?
 
 Seems ideal to me
 (though there is some argument for page_page_cgroup(), weird as it is).
 
 Hugh

Sounds good to me as well, except that it sounds like a constructor :-)
I'll try and make the changes and submit a patch. I am in the middle of
losing control_type now (based on Hugh's comments)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise


Hey Roland (and any iommu/ppc/dma experts out there):

I'm debugging a data corruption issue that happens on PPC64 systems 
running rdma on kernels where the iommu page size is 4KB yet the host 
page size is 64KB.  This feature was added to the PPC64 code recently, 
and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
 configs work too. Its just a problem when the iommu page size != host 
page size.


It appears that my problem boils down to a single host page of memory 
that is mapped for dma, and the dma address returned by dma_map_sg() is 
_not_ 64KB aligned.  Here is an example:


app registers va 0x2d9a3000 len 12288
ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
state from a registered user memory region):



umem len 12288 off 12288 pgsz 65536 shift 16
chunk 0: nmap 1 nents 1
sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
5bff4000 dma_len 65536



So the kernel maps 1 full page for this MR.  But note that the dma 
address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
think this is causing grief to the RDMA HW.


My first question is: Is there an assumption or requirement in linux 
that dma_addressess should have the same alignment as the host address 
they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
but the mapping doesn't begin on a 64KB page boundary.


If this mapping is considered valid, then perhaps the rdma hw is at 
fault here.  But I'm wondering if this is an PPC/iommu bug.


BTW:  Here is what the Memory Region looks like to the HW:


TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
perms RW rem_inv_dis 0 addr_type VATO
bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
len 12288 va 2d9a3000 bind_cnt 0
PBL: 5bff4000




Any thoughts?

Steve.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

2007-12-20 Thread Takashi Iwai

At Thu, 13 Dec 2007 11:49:51 +0100,
I wrote:
 
 [Sorry for the late response as I've been on vacation]
 
 At Sat, 8 Dec 2007 21:15:44 -0500,
 Theodore Tso wrote:
  
  On Sat, Dec 08, 2007 at 11:30:53PM +0100, Rafael J. Wysocki wrote:
   On Saturday, 8 of December 2007, Theodore Tso wrote:
However, as far as I am concerned, Ingo's patch, first posted to LKML
here: http://lkml.org/lkml/2007/11/16/66 should be listed as fixing
the above regression.  Rafael, could you please make a note of this in
your regression list,
   
   Done, thanks.
  
  Great, thanks.  I should add that technically this wasn't a regression
  since I had been seeing this since before 2.6.23.  Also, it isn't a
  big deal, since aside from noise in the syslog, falling back to
  polling more doesn't make any functional or user-visible difference
  (although I guess it's less efficient).  
  
  Regardless of whether it is a regression, it would be nice to get the
  patch applied and and this issue fixed for 2.6.25!
 
 You mean 2.6.24 ? ;-)
 
 Yes, if it solves the problem, not only improves the latency, it's
 definitely nice to have now.  I was just too conservative to mark it
 for 2.6.24 merge although it looks safe.
 
 Jaroslav, could you prepare this for the push?  It corresponds to
 alsa-kernel HG changeset 5557.

Jaroslav, what about this now?


Takashi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc5-mm1: problems with cat /proc/kpageflags

2007-12-20 Thread Matt Mackall

On Thu, Dec 20, 2007 at 04:53:59AM -0800, David Miller wrote:
 From: Matt Mackall [EMAIL PROTECTED]
 Date: Mon, 17 Dec 2007 08:55:54 -0600

  On Sun, Dec 16, 2007 at 10:39:17PM -0800, Andrew Morton wrote:
  Actually, you may only need these two:

   maps4-add-proc-kpagecount-interface.patch
   maps4-add-proc-kpageflags-interface.patch

 Yes these two were enough, and exporting fs/proc/base.c's
 mem_lseek().

 As hard as I try, I can't reproduce this at all.  I tried
 both on my workstation and my niagara boxes.

That's good to know, I was having a very hard time imagining how the
kpagecount code could be going south.

 It must be other needle in the 30MB+ -mm haystack. :-(

Have we seen a config for the broken machine? Perhaps that'll help us
make a guess..

-- 
Mathematics is the supreme nostalgia of our time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Handle i_size s_maxbytes correctly

2007-12-20 Thread Jan Kara

  Hi Andrew,

  nobody seemed to care about this patch, so could you pull it into -mm so
that it gets broader testing? Thanks.

Honza

-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

Although we don't allow writes over s_maxbytes, it can happen that a file's
size is larger than s_maxbytes. For example we can write the file from a
computer with a different architecture (which has larger s_maxbytes), boot
a kernel with a different set of config options (CONFIG_LBD...), or if two
nodes in a [Ocfs2, and likely Gfs2] cluster have mounted the same file
system and have different s_maxbytes.  Thus we have to make sure we don't
crash / corrupt data when seeing such file (page offset of the last page
needn't fit into pgoff_t). Firstly, we make read() and mmap() return error
when user tries to access the file above s_maxbytes, secondly we introduce
a function i_size_read_trunc() which returns min(i_size, s_maxbytes) and
use it when determining maximal page offset we are interested in.

Signed-off-by: Jan Kara [EMAIL PROTECTED]
CC: Mark Fasheh [EMAIL PROTECTED]

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3861118 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1623,7 +1623,7 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
 
BUG_ON(!PageLocked(page));
 
-   last_block = (i_size_read(inode) - 1)  inode-i_blkbits;
+   last_block = (i_size_read_trunc(inode) - 1)  inode-i_blkbits;
 
if (!page_has_buffers(page)) {
create_empty_buffers(page, blocksize,
@@ -2084,7 +2084,7 @@ int block_read_full_page(struct page *page, get_block_t 
*get_block)
head = page_buffers(page);
 
iblock = (sector_t)page-index  (PAGE_CACHE_SHIFT - inode-i_blkbits);
-   lblock = (i_size_read(inode)+blocksize-1)  inode-i_blkbits;
+   lblock = (i_size_read_trunc(inode)+blocksize-1)  inode-i_blkbits;
bh = head;
nr = 0;
i = 0;
@@ -2347,7 +2347,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct 
page *page,
int ret = -EINVAL;
 
lock_page(page);
-   size = i_size_read(inode);
+   size = i_size_read_trunc(inode);
if ((page-mapping != inode-i_mapping) ||
(page_offset(page)  size)) {
/* page got truncated out from underneath us */
@@ -2603,7 +2603,7 @@ int nobh_writepage(struct page *page, get_block_t 
*get_block,
struct writeback_control *wbc)
 {
struct inode * const inode = page-mapping-host;
-   loff_t i_size = i_size_read(inode);
+   loff_t i_size = i_size_read_trunc(inode);
const pgoff_t end_index = i_size  PAGE_CACHE_SHIFT;
unsigned offset;
int ret;
@@ -2803,7 +2803,7 @@ int block_write_full_page(struct page *page, get_block_t 
*get_block,
struct writeback_control *wbc)
 {
struct inode * const inode = page-mapping-host;
-   loff_t i_size = i_size_read(inode);
+   loff_t i_size = i_size_read_trunc(inode);
const pgoff_t end_index = i_size  PAGE_CACHE_SHIFT;
unsigned offset;
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index acf0da1..8223868 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -525,8 +525,8 @@ static int get_more_blocks(struct dio *dio)
 
create = dio-rw  WRITE;
if (dio-lock_type == DIO_LOCKING) {
-   if (dio-block_in_file  (i_size_read(dio-inode) 
-   dio-blkbits))
+   if (dio-block_in_file  (i_size_read_trunc(dio-inode)
+dio-blkbits))
create = 0;
} else if (dio-lock_type == DIO_NO_LOCKING) {
create = 0;
@@ -870,7 +870,8 @@ do_holes:
 * Be sure to account for a partial block as the
 * last block in the file
 */
-   i_size_aligned = ALIGN(i_size_read(dio-inode),
+   i_size_aligned =
+   ALIGN(i_size_read_trunc(dio-inode),
1  blkbits);
if (dio-block_in_file =
i_size_aligned  blkbits) {
@@ -961,7 +962,7 @@ direct_io_worker(int rw, struct kiocb *iocb, struct inode 
*inode,
dio-next_block_for_io = -1;
 
dio-iocb = iocb;
-   dio-i_size = i_size_read(inode);
+   dio-i_size = i_size_read_trunc(inode);
 
spin_lock_init(dio-bio_lock);
dio-refcount = 1;
diff --git a/fs/mpage.c b/fs/mpage.c
index d54f8f8..c666089 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -190,7 +190,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, 
unsigned nr_pages,

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Stephen Hemminger

On Thu, 20 Dec 2007 17:29:23 +

 -Original Message-
 From: Stephen Hemminger [EMAIL PROTECTED]

 Date: Thu, 20 Dec 2007 09:16:03 
 To:[EMAIL PROTECTED]
 Cc:[EMAIL PROTECTED], [EMAIL PROTECTED],   linux-kernel@vger.kernel.org
 Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog

 On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
 Parag Warudkar [EMAIL PROTECTED] wrote:

  sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
  second.

  Signed-off-by: Parag Warudkar [EMAIL PROTECTED]

  --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500
  +++ linux-2.6-work/drivers/net/sky2.c   2007-12-18 20:07:58.0 
  -0500
  @@ -4230,7 +4230,10 @@
  sky2_show_addr(dev1);
  }

  -   setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
  +   hw-watchdog_timer.function = sky2_watchdog;
  +   hw-watchdog_timer.data = (unsigned long) hw;
  +   init_timer_deferrable(hw-watchdog_timer);
  +
  INIT_WORK(hw-restart_work, sky2_restart);

  pci_set_drvdata(pdev, hw);

 Does it really reduce the wakeup's or only change who gets charged by 
 powertop?
 The system is going to wakeup once a second anyway. Looks to me that if the
 timer is using round_jiffies(), that setting deferrable just changes the 
 accounting.

 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise about when so 
 schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on the hour

* deferrable   - timer doesn't have to really wakeup but wants to 
 happen near
 a particular time. e.g. I'll meet you at the pub 
 around 8pm

 Therefore doing deferrable is unnecessary for timers using round_jiffies 
 unless system
 is so good at doing timers that it is going to skip doing timer once per 
 second.

[EMAIL PROTECTED] wrote:

 NO_HZ kernels don't do timers every second - if you do round_jiffies() the 
 kernel will wakeup and run the timer at that time no matter what. 

 The reason deferrable was introduced is to avoid waking up the kernel just 
 for this one timer that can be called when the CPU is not idle for some 
 reason other than this timer.

 In other words let's say there were two timers - one non-deferrable expiring 
 in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will 
 not wake up twice - once for 1.5 second and other for 3 second - it will wake 
 up once at expiry of 3 second timer and execute both the 1.5 second and 3 
 second timers.

 And this is not just powertop accounting thing - like I said the total num of 
 wakeups per second go down with this patch.

 Parag

 Sent via BlackBerry from T-Mobile

Quit top-posting!

If this is the case then the whole usage of round_jiffies() is bogus. All users 
of round_jiffies()
should just be converted to deferrable??  I am a bit concerned that if 
deferrable gets used everywhere
then a strange situation would occur where all timers were waiting for some 
other timer to finally
happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
 you first, no you first, after you mister chip, no after you mister dale,...

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 4/5]PCI: x86 MMCONFIG: introduce pcibios_fix_bus_scan()]

2007-12-20 Thread Greg KH

On Thu, Dec 20, 2007 at 12:39:48PM -0500, Tony Camuso wrote:
 Greg KH wrote:
 Ah, sorry, I was thinking you were using dev_info(), which is what you
 should be using instead anyway :)

 I found it in include/linux/device.h

 #define dev_info(dev, format, arg...)   \
 dev_printk(KERN_INFO , dev , format , ## arg)

 The info I'm trying to print is before a dev or pci_dev struct
 has been initialized for the device.

Ok, fair enough.

But you never answered my questions about _who_ would be responding to
those log messages about reporting things...

thanks,

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote:

 I can't even apply this patch and the e1000 one... not only is it whitespace
 damaged it is also not properly formatted as patch at all. If you want me to 
 take
 these patches seriously, then please fix the formatting issues.

Sigh - I use Pine, follow Documents/email-clients.txt for the
recommended settings and obviously the pathces are not generated with
whitespace damage at my end as I test those before sending out.

So although I hate to see this happen there is nothing at this moment
that I can do - except for attaching the patch instead of inlining it.
Since they have already been reviewed inline, please see if the
attached patches work for you.

[EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree
../../Patches/e1000_main.c.patch
total: 0 errors, 0 warnings, 8 lines checked

Your patch has no obvious style problems and is ready for submission.
[EMAIL PROTECTED] linux-2.6]$
[EMAIL PROTECTED] linux-2.6]$ vim drivers/net/e1000/e1000_main.c
[EMAIL PROTECTED] linux-2.6]$ patch -p1  ../../Patches/e1000_main.c.patch
patching file drivers/net/e1000/e1000_main.c

[EMAIL PROTECTED] linux-2.6]$ scripts/checkpatch.pl --no-tree
../../Patches/e1000e-netdev.c.patch
total: 0 errors, 0 warnings, 8 lines checked

Your patch has no obvious style problems and is ready for submission.
[EMAIL PROTECTED] linux-2.6]$ patch -p1  ../../Patches/e1000e-netdev.c.patch
patching file drivers/net/e1000e/netdev.c

Thanks

Parag


e1000_main.c.patch
Description: Binary data


e1000e-netdev.c.patch
Description: Binary data

Re: [PATCH] Handle i_size s_maxbytes correctly

2007-12-20 Thread Mark Fasheh

On Thu, Dec 20, 2007 at 06:51:04PM +0100, Jan Kara wrote:
   Hi Andrew,
 
   nobody seemed to care about this patch, so could you pull it into -mm so
 that it gets broader testing? Thanks.

Oh, this can get my ack btw:

Acked-by: Mark Fasheh [EMAIL PROTECTED]
--Mark

--
Mark Fasheh
Principal Software Developer, Oracle
[EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Matthew Wilcox wrote:

On Thu, Dec 20, 2007 at 09:22:05AM -0800, Greg KH wrote:

I can't imagine there are too many machines with MMCONFIG and an i830
chipset.  The laptop I'm currently typing on is an i830 ... and its
BIOS is from 2002, well predating the specification of MMCONFIG. If I
didn't know better, I'd accuse Tony of lying about the existance of
such a machine ;-)


Sorry, that was the 82Q963/Q965 graphics controller, PCI ID 2992.

I can't remember why I thought PCI ID 2992 maps to 830M/MG. I think
it was in some intel doc about the ICH8 or ICH9.

Nevertheless, I have an hp dc5700 microtower right here on the floor
next to me, and it hangs when the bus sizing code does an mmconfig
write to the BAR at offset 0x18 in this device (id 2992).

It hangs in exactly the same place every time.

I am surmising that the write to that BAR is causing a MCE.

Here is the lspci output of the dc5700. The misbehaving device is
00:02.0

[EMAIL PROTECTED] ~]# lspci
00:00.0 Host bridge: Intel Corporation 82Q963/Q965 Memory Controller Hub
(rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82Q963/Q965
Integrated Graphics Controller (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
#4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
#5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
#2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio
Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express
Port 2 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
#1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI
#2 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI
#1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface
Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801H (ICH8 Family) 4 port
SATA IDE Controller (rev 02)
00:1f.5 IDE interface: Intel Corporation 82801H (ICH8 Family) 2 port
SATA IDE Controller (rev 02)
3f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5755
Gigabit Ethernet PCI Express (rev 02)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 12:51 PM, Stephen Hemminger
[EMAIL PROTECTED] wrote:

 Quit top-posting!

 If this is the case then the whole usage of round_jiffies() is bogus. All 
 users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that if 
 deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting for some 
 other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
  you first, no you first, after you mister chip, no after you mister 
 dale,...


Haha - I thought about this too. I think there should be mechanism
where the machine does not idle infinitely even if there are no
non-deferrable timers. Something like an affordable QoS for non
deferrable timers - the kernel wakes up after that interval and runs
all deferrable timers  even if nothing non-deferrable is set to run.
So we still get advantage of not having to wake individually for each
timer and the non-deferrable timers do get all run in reasonable
amount of time.

Who knows Thomas/Ingo already built in something of that nature or effect?!

Parag
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch/backport] CFS scheduler, -v24.1, for v2.6.23.12, v2.6.22.15 and v2.6.21.7

2007-12-20 Thread Ingo Molnar


I'm pleased to announce the v24.1 CFS backport patches.

It is a full backport of the latest  greatest upstream scheduler code 
to v2.6.23.12, v2.6.22.15 and v2.6.21.7. (Note: the upstream 2.6.24-rc5 
and soon-to-be-released v2.6.24-rc6 kernels already include the v24.1 
CFS code.)

The CFS patches can be downloaded from the usual place:

 http://people.redhat.com/mingo/cfs-scheduler/

No major user-visible changes were done since v24, only a number of 
bugfixes. The delta since v24 is:

  8 files changed, 339 insertions(+), 484 deletions(-)

if you'd like to make sure that the v2.6.24 kernel's scheduler handles 
your workloads just fine, or if you just want to try the latest then you 
can try one of these backport kernels. (or try the latest 2.6.24-rc 
kernel)

As usual, any sort of feedback, bugreport, fix and suggestion is more 
than welcome!

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch, rfc] mm.h, security.h, key.h and preventing namespace poisoning

2007-12-20 Thread Eric Paris


On Thu, 2007-12-20 at 11:07 +1100, James Morris wrote:
 On Wed, 19 Dec 2007, David Chinner wrote:
 
  Folks,
  
  I just updated a git tree and started getting errors on a
  copy_keys macro warning.
  
  The code I've been working on uses a -copy_keys() method for
  copying the keys in a btree block from one place to another. I've
  been working on this code for a while
  (http://oss.sgi.com/archives/xfs/2007-11/msg00046.html) and keep the
  tree I'm working in reletively up to date (lags linus by a couple of
  weeks at most). The update I did this afternoon gave a conflict
  warning with the macro in include/linux/key.h.
  
  Given that I'm not directly including key.h anywhere in the XFS
  code, I'm getting the namespace polluted indirectly from some other
  include that is necessary.
  
  As it turns out, this commit from 13 days ago:
  
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7cd94146cd504016315608e297219f9fb7b1413b
  
  included security.h in mm.h and that is how I'm seeing the namespace
  poisoning coming from key.h when !CONFIG_KEY.
  
  Including security.h in mm.h means much wider includes for pretty
  much the entire kernel, and it opens up namespace issues like this
  that never previously existed.
  
  The patch below (only tested for !CONFIG_KEYS  !CONFIG_SECURITY)
  moves security.h into the mmap.c and nommu.c files that need it so
  it doesn't end up with kernel wide scope.
  
  Comments?
 
 The idea with this placement was to keep memory management code with other 
 similar code, rather than pushing it into security.h, where it does not 
 functionally belong.
 
 Something to not also is that you can't depend on security.h not being 
 included all over the place, as LSM does touch a lot of the kernel.  
 Unecessarily including it is bad, of course.
 
 I'm not sure I understand your namespace pollution issue, either.
 
 In any case, I think the right solution is not to include security.h at 
 all in mm.h, as it is only being done to get a declaration for 
 mmap_min_addr.
 
 How about this, instead ?
 
 Signed-off-by: James Morris [EMAIL PROTECTED]
Acked-by: Eric Paris [EMAIL PROTECTED]
 ---
 
  mm.h |5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 1b7b95c..02fbac7 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -12,7 +12,6 @@
  #include linux/prio_tree.h
  #include linux/debug_locks.h
  #include linux/mm_types.h
 -#include linux/security.h
  
  struct mempolicy;
  struct anon_vma;
 @@ -34,6 +33,10 @@ extern int sysctl_legacy_va_layout;
  #define sysctl_legacy_va_layout 0
  #endif
  
 +#ifdef CONFIG_SECURITY
 +extern unsigned long mmap_min_addr;
 +#endif
 +
  #include asm/page.h
  #include asm/pgtable.h
  #include asm/processor.h
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 4/5]PCI: x86 MMCONFIG: introduce pcibios_fix_bus_scan()]

2007-12-20 Thread Tony Camuso


Greg KH wrote:


But you never answered my questions about _who_ would be responding to
those log messages about reporting things...

thanks,

greg k-h


I did.

I said I would remove that string, because I don't want all that email,
either. It's a little past the middle of the post appended here.

 Original Message 
Subject: Re: [PATCH 4/5]PCI: x86 MMCONFIG: introduce pcibios_fix_bus_scan()
Date: Wed, 19 Dec 2007 19:17:42 -0500
From: Tony Camuso [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED]
To: Greg KH [EMAIL PROTECTED]
References: [EMAIL PROTECTED] 
[EMAIL PROTECTED] [EMAIL PROTECTED]


Greg,

First, let me thank you for your prompt replies!

Greg KH wrote:
 On Wed, Dec 19, 2007 at 05:18:06PM -0500, [EMAIL PROTECTED] wrote:
 commit ab28e1157e970f711c8451b66b3f940ec092db9d
 Author: Tony Camuso [EMAIL PROTECTED]
 Date:   Wed Dec 19 15:51:48 2007 -0500

 Introduces the x86 arch-specific routine that will determine whether
 a device responds correctly to MMCONFIG accesses. This routine is
 given the generic name pcibios_fix_bus_scan_quirk()

 The comment at the top of the routine explains its logic.

 Signed-off-by: Tony Camuso [EMAIL PROTECTED]

 diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
 index 8627463..9b1742d 100644
 --- a/arch/x86/pci/common.c
 +++ b/arch/x86/pci/common.c
 @@ -525,3 +525,64 @@ struct pci_bus *pci_scan_bus_with_sysdata(int busno)

return bus;
  }
 +
 +/**
 + * This routine traps devices not correctly responding to MMCONFIG access.
 + * For each device on the current bus, compare a mmconf read of the
 + * vendor/device dword with a legacy PCI config read. If they're not the 
same,
 + * the bus serving this device must use legacy PCI config accesses instead 
of
 + * mmconf, as must all buses descending from this bus.
 + */
 +
 +#define CHECK_MMCFG_STR_1 \
 +  PCI: Device at %04x:%02x.%02x.%x is not MMCONFIG compliant.\n
 +#define CHECK_MMCFG_STR_2 \
 +  PCI: Bus %04x:%02x and its descendents cannot use MMCONFIG.\n

 Why define these if they are only used in one place?

If you object, I will be happy to move them into the routine body
without the defines. I agree that It does look inconsistent to have
these strings defined and other strings embedded in the routine body.


 Also, as you use dev_info(), I think you are duplicating some of the
 information in the resulting printk(), right?

Actually, no. The strings do not contain redundant info. The pr_info
routine is just a macro for printk(KERN_INFO ...)

 +
 +void __devinit pcibios_fix_bus_scan_quirk(struct pci_bus *bus)
 +{
 +  int devfn;
 +  int fail;
 +  int found_nommconf_dev = 0;
 +  static int advised;
 +  u32 mcfg_vendev;
 +  u32 arch_vendev;
 +  struct pci_ops *save_ops = bus-ops;
 +
 +  if (bus-parent != NULL)
 +  if (bus-parent-ops == pci_legacy_ops)
 +  return;
 +
 +  if (!advised) {
 +  pr_info(PCI: If a device isn't working, try \pci=nommconf\. 

 +  If that helps, please post a report.\n);

 Post a report where?  Who is going to handle these reports?

 The last time someone put a line like this in the kernel, I got a ton of
 email and didn't know what to do with it.  If you really are trusting of
 this patch, please put your email address in this printk(), so that you
 can properly handle the resulting reports.  I sure don't want to :)

Hmmm! Good point! I was actually copying that other message. I will
remove the string that advises posting a report. I sure don't want 'em,
and I can see that you don't, either.
:)


 +  advised = 1;
 +  }
 +  pr_debug(PCI: Checking bus %04x:%02x for MMCONFIG compliance.\n,
 +   pci_domain_nr(bus), bus-number);
 +
 +  for (devfn = 0; devfn  256; devfn++) {
 +  bus-ops = pci_legacy_ops;
 +  fail = (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID,
 +arch_vendev));

 What's with the extra () around the function?

The function call used to be contained in an if statement.
I changed the logic, but forgot to remove the extra parens.
It's tough getting old.
:}


 +  if ((arch_vendev == 0x) || (arch_vendev == 0) || fail)
 +  continue;
 +
 +  bus-ops = save_ops; /* Restore to original value */
 +  pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID,
 +mcfg_vendev);
 +  if (mcfg_vendev != arch_vendev) {
 +  found_nommconf_dev = 1;
 +  break;
 +  }
 +  }
 +
 +  if (found_nommconf_dev) {
 +  pr_info(CHECK_MMCFG_STR_1, pci_domain_nr(bus), bus-number,
 +  PCI_SLOT(devfn), PCI_FUNC(devfn));
 +  pr_info(CHECK_MMCFG_STR_2, pci_domain_nr(bus), bus-number);
 +  bus-ops = pci_legacy_ops;  /* Use Legace

Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

2007-12-20 Thread Hemmann, Volker Armin

On Donnerstag, 20. Dezember 2007, you wrote:
 Hemmann, Volker Armin wrote:
  [ 5194.131014] Pid: 22490, comm: sleep Tainted: P2.6.23.11reiser4
  #4

 The subject line is wrong.
 You apparently run Linux, but not Linux 2.6.23.y.

first of all, apart from this oops all other oopses I reported were with a 
not-tainted kernel. You might want to read the other mails I have sent.

Also, besides of the reiser4 patch there is no other patch added to the 
kernel. And since people have had  successfully reported problems with 
heavily distro-patched kernels in the past it looks a little bit hypocritical 
to put my reports aside because of one single patch - don't you think?

Glück Auf,
Volker
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

2007-12-20 Thread Hemmann, Volker Armin

On Donnerstag, 20. Dezember 2007, you wrote:
 On Thu, 2007-12-20 at 06:53 +0100, Hemmann, Volker Armin wrote:
  On Donnerstag, 20. Dezember 2007, you wrote:
   On Thu, 2007-12-20 at 03:13 +0100, Hemmann, Volker Armin wrote:
On Montag, 17. Dezember 2007, you wrote:
   
and another one, this time tainted with the nvidia module:
5194.130985] Unable to handle kernel paging request at
0300 RIP:
  
   This really sounds like bad hardware. Either memory or the mobo/riser
   card the memory is on. You might try lowering the memory timings of
   your memory in BIOS. Try removing 1/2 of your memory. If it still
   remove the other 1/2 and put the first 1/2 back and try again.
 
  if this is bad hardware why:
 
  - didn't this show up earlier?
 
  - did a several hour memtest run couple of weeks ago didn't show up
  anything?
 
  - and does stuff like compiling all of kde 3.5.8 or the latest kde4 rc
  finish without any problems?

 Because bad hardware can be highly sensitive to exact load patterns.
 Don't be so skeptical of the suggestion that your hardware may be
 flakey, in the last 30+ years as a hardware guy in the design lab and in
 the field, I've seen very much hardware which passed extensive
 diagnostics, but turn out to be flakey nonetheless.

 I would suggest that you rearrange your ram modules, and see if the bit
 pattern changes.  Memtest may not show a problem with bitflips... if
 that's happening.  I would also suggest that you check your case
 temperature as someone else suggested - lmsensors may say that the CPU
 temperature is fine, but that isn't the whole picture. by a long shot.

   -Mike

case temp: 25°C measured near a warm harddisk by a digital thermometer.
mainboard temp: 31°C measured by lmsensors (mobo bios agrees)
cpu temp: 29-50°C (load dependent) measured by lmsensors, bios puts on two 
additional degrees.

I have 4 'big' fans installed to have a constant air stream in the case.

This really does not look like overheating. And I did have flaky ram in the 
past. The thing is, apart from the oops the system is completly, perfectly 
stable. That really does not smell like flaky hardware. At least not in my 
experience. Flaky PSU = sudden reboots, boot problems, crashes under load. 
Flaky mobo = see flaky psu. Flaky Ram: crashes, crashes, more crashes, 
segfaults here and there, especially when updating glibc, qt or kde. 

And I don't get this. I only get oopses.

It is just.. I could be the hardware - but I should have seen the 
same 'problem' with earlier kernels - and the 'almost daily oops' only 
started with 2.6.23.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] memcgroup: work better with tmpfs

2007-12-20 Thread Balbir Singh

Hugh Dickins wrote:
 On Wed, 19 Dec 2007, Balbir Singh wrote:
 Hugh Dickins wrote:

 We always call mem_cgroup_isolate_pages() from shrink_(in)active_pages
 under spin_lock_irq of the zone's lru lock. That's the reason that we
 don't explicitly use it in the routine.
 
 Indeed, thanks.
 
 2. There's mem_cgroup_charge and mem_cgroup_cache_charge (wouldn't the
 former be better called mem_cgroup_charge_mapped? why does the latter
 Yes, it would be. After we've refactored the code, the new name makes sense.

 test MEM_CGROUP_TYPE_ALL instead of MEM_CGROUP_TYPE_CACHED? I still don't
 understand your enums there).
 We do that to ensure that we charge page cache pages only when the
 accounting type is set to MEM_CGROUP_TYPE_ALL. If the type is anything
 else, we ignore cached pages, we did not have MEM_CGROUP_TYPE_CACHED
 initially when the patches went in.
 
 I think you've given yourself far too many degrees of freedom here.
 
 Please explain to me how the system is supposed to behave when I
 echo -n 2 /cg/0/memory.control_type
 
 From the name in the enum (MEM_CGROUP_TYPE_CACHED, doesn't even have
 an = 2, yet this is a part of the API?), I think it's supposed to
 account pages into and out of the page cache, but not as they're
 mapped into and out of userspace.  But from the code, no references
 whatever to MEM_CGROUP_TYPE_CACHED or MEM_CGROUP_TYPE_MAPPED,
 it looks as if it'll behave just like MEM_CGROUP_TYPE_MAPPED
 (which we hope = 1).
 
 As you say, you didn't have MEM_CGROUP_TYPE_CACHED originally:
 it looks like it got added to the straightforward case of MAPPED,
 in an apparently flexible but not fully thought out way.
 

Yes, I think your argument makes sense. I had initially thought of
making the enums a mask

MEM_CGROUP_TYPE_MAPPED = 0x1,
MEM_CGROUP_TYPE_CACHED = 0x2,
MEM_CGROUP_TYPE_ALL = 0x3

and check for control_type  MEM_CGROUP_TYPE_CACHED


  But there's only mem_cgroup_uncharge.
 So when, for example, an add_to_page_cache fails, the uncharge may not
 balance the charge?

 We use mem_cgroup_uncharge() everywhere. The reason being, we might
 switch control type, we uncharge pages that have a page_cgroup
 associated with them, hence once we;ve charged, uncharge does not
 distinguish between charge types.
 
 Ah, so this is the meaning of that 
   /*
* This can handle cases when a page is not charged at all and we
* are switching between handling the control_type.
*/
   if (!pc)
   return;
 
   if (atomic_dec_and_test(pc-ref_cnt)) {
 at the beginning of mem_cgroup_uncharge.
 
 Sorry, that doesn't fly.  You've no locking between acquiring pc from
 the page-page_cgroup, testing pc here, and decrementing its ref_cnt.
 When that ref_cnt goes down to zero, clear_page_cgroup may NULLify
 page-page_cgroup and kfree the pc.
 

Yes, we need to lock the page group.


 So if you cannot really keep track of the ref_cnt (because of
 changing control_type), that atomic_dec_and_test is in danger
 of corrupting someone else's memory - isn't it?


Yes, my bad :(

 I can just about imagine that some admins will want control_type
 MAPPED and others CACHED and others MAPPED+CACHED.  But is there
 actually a need for one cgroup to be controlling MAPPED while
 another on the same machine is controlling MAPPED+CACHED?  And
 does that make sense - there'll be weirdnesses, won't there?
 And is there actually a need to change a cgroup's control_type
 on the fly while it's alive?
 
 Of course it's nice to be flexible and allow for such possibilities;
 but not if that just opens windows for corruption.  My own view is
 that at present you should just account mapped and cached for all,
 and strip out these extra degrees of freedom: add them back in some
 future when the locking is worked out.
 


I am going to rip out the control_type interface for now. The locking
needs fixing irrespective of control_type. I am sending out a patch to
rip out control_type.

 Hugh

Thanks for your detailed comments,

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 01:04:09PM -0500, Tony Camuso wrote:
 Sorry, that was the 82Q963/Q965 graphics controller, PCI ID 2992.
 
 I can't remember why I thought PCI ID 2992 maps to 830M/MG. I think
 it was in some intel doc about the ICH8 or ICH9.
 
 Nevertheless, I have an hp dc5700 microtower right here on the floor
 next to me, and it hangs when the bus sizing code does an mmconfig
 write to the BAR at offset 0x18 in this device (id 2992).

Oh, that's the same bug others (including me) have been complaining
about.

http://marc.info/?l=linux-kernelm=118809338631160w=2

 It hangs in exactly the same place every time.
 
 I am surmising that the write to that BAR is causing a MCE.

Bad deduction.  What's happening is that the write to the BAR is causing
it to overlap the decode for mmconfig space.  So the mmconfig write to
set the BAR back never gets through.

I have a different idea to fix this problem.  Instead of writing
0x, we could look for an unused bit of space in the E820 map and
write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
overlap, and we could find its size using MMCONFIG.

Does anyone know how Windows handles these machines?  Obviously, if it's
using MMCONFIG, it'd have the same problems.  Does it just use type 1
for initial sizing?  Or does it use type 1 for all accesses below 256
bytes?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Roland Dreier

  It appears that my problem boils down to a single host page of memory
  that is mapped for dma, and the dma address returned by dma_map_sg()
  is _not_ 64KB aligned.  Here is an example:

  My first question is: Is there an assumption or requirement in linux
  that dma_addressess should have the same alignment as the host address
  they are mapped to?  IE the rdma core is mapping the entire 64KB page,
  but the mapping doesn't begin on a 64KB page boundary.

I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-20 Thread Arjan van de Ven

On Thu, 20 Dec 2007 08:16:54 -0800
H. Peter Anvin [EMAIL PROTECTED] wrote:

 Arjan van de Ven wrote:
  On Wed, 19 Dec 2007 11:48:14 -0800
  H. Peter Anvin [EMAIL PROTECTED] wrote:
  
  I think C3 guarantees that the cache contents stay intact, and thus
  it might make sense in some technology to preserve the TLB as well
  (being a kind of cache.)
  
  that sounds nice. It's fiction though ;-)
  
  The thing to realize is that linux only sees ACPI C3; the BIOS
  maps that C3 to.. well any of the C states the processor in the
  system has. What you're saying is afaik correct for the *hardware*
  C3, not for the C3 that Linux sees..
  
 
 Well, it can only map ACPI C3 to a state which is no more dead than 
 what would normally be permitted by C3.  IIRC, C3 is allowed to
 require that DMA be turned off (unlike C2), but is not allowed to
 lose the CPU state.

state isn't lost if the tlb or the caches are flushed... 
(properly, eg all pending writebacks are written back first etc)


-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Greg KH wrote:


Why haven't we gotten reports about this before if this is a common
problem?

And why hasn't the vendor fixed the bios on these to work properly?


I can't really answer either of these questions. All I know is the
problem exists, and we need to deal with it.

That's why the unreachable_devices() routine exists in the first place.

But that routine is limited to only the first 16 buses on segment-0.
You would like to think that hardware designers would confine legacy
hardware, or problematic hardware, to that area, but they don't.

As I said before, expanding that routine to cover more buses would
adversely impact mmconfig pci access, since the mmconfig access
code does a lookup of that bitmap for every request. I don't think
we want that bitmap and the accompanying in-line lookup to increase
enough to encompass the pci configuration of some of these larger
systems.


Do you have a pointer to this blacklist anywhere so that everyone can
benifit from this knowledge?



Appended below is a code snippet embedded in the RHEL version of mmconfig.c,
for both i386 and x86_64. It does not encompass all the systems that have
(or will have) problems with mmconf. Only HP platforms are listed, but I
believe there are others.

The reason the HP platforms got caught is they put these devices beyond
bus 16, where they would have been trapped, and the problem would have
been avoided.

static int __devinit disable_mmconf(struct dmi_system_id *d)
{
pci_probe = ~PCI_PROBE_MMCONF;
printk(KERN_INFO %s detected: disabling PCI MMCONFIG\n, d-ident);
return 0;
}


/*
 * Systems which cannot use PCI MMCONFIG at this time...
 */
static struct dmi_system_id __devinitdata nommconf_dmi_table[] = {
{
.callback = disable_mmconf,
.ident = HP xw9300 Workstation,
.matches = {
DMI_MATCH(DMI_PRODUCT_NAME, HP xw9300 Workstation),
},
},
{
.callback = disable_mmconf,
.ident = HP xw9400 Workstation,
.matches = {
DMI_MATCH(DMI_PRODUCT_NAME, HP xw9400 Workstation),
},
},
{
.callback = disable_mmconf,
.ident = ProLiant DL585 G2,
.matches = {
DMI_MATCH(DMI_PRODUCT_NAME, ProLiant DL585 G2),
},
},
{
.callback = disable_mmconf,
.ident = HP Compaq dc5700 Microtower,
.matches = {
DMI_MATCH(DMI_PRODUCT_NAME,
HP Compaq dc5700 Microtower),
},
},

{}
};


The one device we know about that throws exceptions is the 830M/MG
graphics chip. This chip passes the read-compare test, so the code
merrily advances to bus sizing. When the bus sizing code writes the
BAR at offset 0x18 in this device, the system hangs.


So it doesn't work at all, with or without this patch?  Does the vendor
know about this?

thanks,

greg k-h


I have talked to intel about this, but they haven't gotten back to me.

All I know at this point is that a mmconf write to the BAR at offset 0x18
of this device hangs the system.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Trying to convert old modules to newer kernels

2007-12-20 Thread Sam Ravnborg

 
 It never gets to the printk(). You were right about the
 compilation. Somebody changed the kernel to compile with
 parameter passing in REGISTERS! This means that EVERYTHING
 needs to be compiled the same way, 'C' calling conventions
 were not good enough!

How did you build the module. It reads like you failed to use
kbuild to build your module which is why you did not pass
correct options to gcc - correct?

If you did not use kbuild - why not?
Is there anything missing you need?

Sam
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Matthew Wilcox wrote:


Bad deduction.  What's happening is that the write to the BAR is causing
it to overlap the decode for mmconfig space.  So the mmconfig write to
set the BAR back never gets through.

I have a different idea to fix this problem.  Instead of writing
0x, we could look for an unused bit of space in the E820 map and
write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
overlap, and we could find its size using MMCONFIG.


The BAR claims to be a 64-bit BAR.


Does anyone know how Windows handles these machines?  Obviously, if it's
using MMCONFIG, it'd have the same problems.  Does it just use type 1
for initial sizing?  Or does it use type 1 for all accesses below 256
bytes?


As far as I know, Windows has a blacklist that limits systems with these
devices to legacy PCI config access.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/28] FS-Cache: Recruit a couple of page flags for cache management [try #2]

2007-12-20 Thread David Howells

Nick Piggin [EMAIL PROTECTED] wrote:

   I'd much prefer if you would handle this in the filesystem, and have it
   set PG_private whenever fscache needs to receive a callback, and DTRT
   depending on whether PG_fscache etc. is set or not.
 
  That's tricky and slower[*].  One of the things I want to do is to modify
  iso9660 to do be able to do caching, but PG_private is 'owned' by the
  generic buffer cache code.
 
 Maybe it is harder, but it is the right way to do it.

You're wrong.  It would mean that PG_private is the logical disjunction of
PG_fscache and some condition not otherwise explicitly stored.  I tried that
with NFS and it was nasty.

As you can no doubt see, it means that you can't distinguish all the states
you used to be able to.

 So you should modify the filesystems rather than core code.

I think you missed what I said:

but PG_private is 'owned' by the generic buffer cache code.

That means more of the core code would have to change - or, at least, change
more.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-20 Thread H. Peter Anvin


Arjan van de Ven wrote:

On Thu, 20 Dec 2007 08:16:54 -0800
H. Peter Anvin [EMAIL PROTECTED] wrote:


Arjan van de Ven wrote:

On Wed, 19 Dec 2007 11:48:14 -0800
H. Peter Anvin [EMAIL PROTECTED] wrote:


I think C3 guarantees that the cache contents stay intact, and thus
it might make sense in some technology to preserve the TLB as well
(being a kind of cache.)

that sounds nice. It's fiction though ;-)

The thing to realize is that linux only sees ACPI C3; the BIOS
maps that C3 to.. well any of the C states the processor in the
system has. What you're saying is afaik correct for the *hardware*
C3, not for the C3 that Linux sees..

Well, it can only map ACPI C3 to a state which is no more dead than 
what would normally be permitted by C3.  IIRC, C3 is allowed to

require that DMA be turned off (unlike C2), but is not allowed to
lose the CPU state.


state isn't lost if the tlb or the caches are flushed... 
(properly, eg all pending writebacks are written back first etc)




Oh, right.  My bad.

Of course C3 doesn't guarantee cache retention, only cache coherency.

-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 01:30:16PM -0500, Tony Camuso wrote:
 Matthew Wilcox wrote:
 
 Bad deduction.  What's happening is that the write to the BAR is causing
 it to overlap the decode for mmconfig space.  So the mmconfig write to
 set the BAR back never gets through.
 
 I have a different idea to fix this problem.  Instead of writing
 0x, we could look for an unused bit of space in the E820 map and
 write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
 overlap, and we could find its size using MMCONFIG.
 
 The BAR claims to be a 64-bit BAR.

And what's in the upper 32 bits by default?  I'm guessing all zeroes.
We size 64-bit BARs by writing to each half individually.  So I bet this
one still overlaps the MMCONFIG space at 0xf000.

 As far as I know, Windows has a blacklist that limits systems with these
 devices to legacy PCI config access.

I don't think that's true.  If the people designing these machines had
noticed Windows failing to boot on them, they'd've fixed the BIOS to put
the MMCONFIG region elsewhere, not hacked Windows until it worked.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

2007-12-20 Thread Hemmann, Volker Armin

On Donnerstag, 20. Dezember 2007, David Newall wrote:
  On Montag, 17. Dezember 2007, you wrote:
 
  and another one, this time tainted with the nvidia module:
  5194.130985] Unable to handle kernel paging request at 0300
  RIP:

 Numbers like that don't suggest hardware faults.  All those zeros: It's
 far too round.  Sounds very like software.  In fact, it sounds like the
 start of significant hardware region.   And lo! there's a closed-source,
 possibly buggy nvidia module.  Try another; older or newer are equally
 good.

and this one was without the nvidia module:
http://marc.info/?l=linux-kernelm=119790371708690w=2

and the first one I reported, was without nvidia and not-tainted too:
http://marc.info/?l=linux-kernelm=119776365425514w=2

I am not a complete idiot. If I have a problem, I try to reproduce without 
nvidia first (after a clean shutdown and boot, with the module not even on 
harddisk). And I reproduced it without the module. The last oops with the 
module was just an example that it does not matter if the module is loaded or 
not and to (maybe) give some additional information.

Glück Auf,
Volker
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [PATCH 1/3] Merge mkimage tool for building uImages

2007-12-20 Thread H. Peter Anvin


Josh Boyer wrote:

Several platforms require the mkimage tool to generate a uImage file
that is used with U-Boot.  This brings the mkimage tool in-kernel to
enable building those platforms without having mkimage internally
provided.

This is currently based off of the version found in U-Boot 1.3.1.



Can we rename it either ubootimage or mkuboot or something else that 
tells the user what kind of image it is?  (It is, in particular, not 
bzImage, which is probably the first thing that someone who sees image 
in a arch-generic part of the Linux kernel tree will think.)


-hpa
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli


On 12/20/2007 1:16 PM, Matthew Wilcox wrote:

 Bad deduction.  What's happening is that the write to the BAR is causing
 it to overlap the decode for mmconfig space.  So the mmconfig write to
 set the BAR back never gets through.

 I have a different idea to fix this problem.  Instead of writing
 0x, we could look for an unused bit of space in the E820 map and
 write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
 overlap, and we could find its size using MMCONFIG.

 Does anyone know how Windows handles these machines?  Obviously, if it's
 using MMCONFIG, it'd have the same problems.  Does it just use type 1
 for initial sizing?  Or does it use type 1 for all accesses below 256
 bytes?
   



Always using type 1 for accesses below 256 bytes looks like a very very
attractive solution

I know we had a lot of older kernels over the last two years that we
patched like that (we needed MMCONFIG for our own device development
purposes, but we also needed our machines to boot and discover all
devices reliably). Recent kernels works fine out of the box on all
hardware we have, but all this sometimes tricky and apparently endless
work (in big part because of buggy BIOSes) about MMCONFIG would probably
become relatively easy by limiting the aim to have MMCONFIG work when it
is required (for cfg-space accesses = 256).


Loic

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] dma: passing attributes to dma_map_* routines

2007-12-20 Thread akepner

On Tue, Dec 18, 2007 at 09:59:24PM +0100, Stefan Richter wrote:

 
 So that would be option 3) of yours, though without your attrs
 parameter.  Do you expect the need for even more flags for other kinds
 of special behavior?

I was hoping to keep the option of adding additional 
flags, but for now there's no obvious need for other
flags that I'm aware of.

-- 
Arthur

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 24/28] AFS: Add a function to excise a rejected write from the pagecache [try #2]

2007-12-20 Thread David Howells

Nick Piggin [EMAIL PROTECTED] wrote:

  Nick Piggin [EMAIL PROTECTED] wrote:
   This reintroduces the fault vs truncate race window, which must be fixed.
 
  Hmmm...  perhaps.
 
 What do you mean by perhaps?

I mean 'perhaps'.  I'm not sure I remember what the race was, so I can't
evaluate whether or not the same race crops up in AFS too.  So: can you
describe the race please.

 No, you could do writeback caching but disallow read of dirty data.

Someone might already have read-access via mmap at the point someone attempts
to write to an mmapped region.  That means that I'd have to revoke the read
access of the first someone before letting the write take place.

Does NFS do this?

   But otherwise I guess if you really want to discard the dirty data after
   a failed writeback attempt, what's wrong with just
   invalidate_inode_pages2?
 
  Erm...  Because it deadlocks?
 
 Why don't you call it after calling end_page_writeback?

Because then there can be a race over who gets to flush the dead write.
Actually, this may no longer be a problem with your write_begin() changes.
I'll need to have another look at those.

Besides, I don't agree that invalidate_inode_pages2() is necessarily the right
way to do things.

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] [PATCH] Memory controller remove control_type feature

2007-12-20 Thread Balbir Singh



Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was
felt that control_type might not be a good thing to implement right away.
We can add this flexibility at a later point when required.

Signed-off-by: Balbir Singh [EMAIL PROTECTED]
---

 include/linux/memcontrol.h |6 --
 mm/memcontrol.c|   91 -
 2 files changed, 18 insertions(+), 79 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-no-more-control-type mm/memcontrol.c
--- linux-2.6.24-rc5/mm/memcontrol.c~memory-controller-no-more-control-type 
2007-12-20 22:37:17.0 +0530
+++ linux-2.6.24-rc5-balbir/mm/memcontrol.c 2007-12-20 23:50:47.0 
+0530
@@ -131,7 +131,6 @@ struct mem_cgroup {
 */
struct mem_cgroup_lru_info info;
 
-   unsigned long control_type; /* control RSS or RSS+Pagecache */
int prev_priority;  /* for recording reclaim priority */
/*
 * statistics.
@@ -718,24 +717,17 @@ int mem_cgroup_cache_charge(struct page 
gfp_t gfp_mask)
 {
int ret = 0;
-   struct mem_cgroup *mem;
if (!mm)
mm = init_mm;
 
-   rcu_read_lock();
-   mem = rcu_dereference(mm-mem_cgroup);
-   css_get(mem-css);
-   rcu_read_unlock();
-   if (mem-control_type == MEM_CGROUP_TYPE_ALL)
-   ret = mem_cgroup_charge_common(page, mm, gfp_mask,
+   ret = mem_cgroup_charge_common(page, mm, gfp_mask,
MEM_CGROUP_CHARGE_TYPE_CACHE);
-   css_put(mem-css);
return ret;
 }
 
 /*
  * Uncharging is always a welcome operation, we never complain, simply
- * uncharge.
+ * uncharge. This routine should be called with lock_page_cgroup held
  */
 void mem_cgroup_uncharge(struct page_cgroup *pc)
 {
@@ -745,8 +737,7 @@ void mem_cgroup_uncharge(struct page_cgr
unsigned long flags;
 
/*
-* This can handle cases when a page is not charged at all and we
-* are switching between handling the control_type.
+* Check if our page_cgroup is valid
 */
if (!pc)
return;
@@ -758,6 +749,7 @@ void mem_cgroup_uncharge(struct page_cgr
 * get page-cgroup and clear it under lock.
 * force_empty can drop page-cgroup without checking refcnt.
 */
+   unlock_page_cgroup(page);
if (clear_page_cgroup(page, pc) == pc) {
mem = pc-mem_cgroup;
css_put(mem-css);
@@ -767,9 +759,17 @@ void mem_cgroup_uncharge(struct page_cgr
spin_unlock_irqrestore(mz-lru_lock, flags);
kfree(pc);
}
+   lock_page_cgroup(page);
}
 }
 
+void mem_cgroup_uncharge_page(struct page *page)
+{
+   lock_page_cgroup(page);
+   mem_cgroup_uncharge(page_get_page_cgroup(page));
+   unlock_page_cgroup(page);
+}
+
 /*
  * Returns non-zero if a page (under migration) has valid page_cgroup member.
  * Refcnt of page_cgroup is incremented.
@@ -789,8 +789,12 @@ int mem_cgroup_prepare_migration(struct 
 
 void mem_cgroup_end_migration(struct page *page)
 {
-   struct page_cgroup *pc = page_get_page_cgroup(page);
+   struct page_cgroup *pc;
+
+   lock_page_cgroup(page);
+   pc = page_get_page_cgroup(page);
mem_cgroup_uncharge(pc);
+   unlock_page_cgroup(page);
 }
 /*
  * We know both *page* and *newpage* are now not-on-LRU and Pg_locked.
@@ -945,61 +949,6 @@ static ssize_t mem_cgroup_write(struct c
mem_cgroup_write_strategy);
 }
 
-static ssize_t mem_control_type_write(struct cgroup *cont,
-   struct cftype *cft, struct file *file,
-   const char __user *userbuf,
-   size_t nbytes, loff_t *pos)
-{
-   int ret;
-   char *buf, *end;
-   unsigned long tmp;
-   struct mem_cgroup *mem;
-
-   mem = mem_cgroup_from_cont(cont);
-   buf = kmalloc(nbytes + 1, GFP_KERNEL);
-   ret = -ENOMEM;
-   if (buf == NULL)
-   goto out;
-
-   buf[nbytes] = 0;
-   ret = -EFAULT;
-   if (copy_from_user(buf, userbuf, nbytes))
-   goto out_free;
-
-   ret = -EINVAL;
-   tmp = simple_strtoul(buf, end, 10);
-   if (*end != '\0')
-   goto out_free;
-
-   if (tmp = MEM_CGROUP_TYPE_UNSPEC || tmp = MEM_CGROUP_TYPE_MAX)
-   goto out_free;
-
-   mem-control_type = tmp;
-   ret = nbytes;
-out_free:
-   kfree(buf);
-out:
-   return ret;
-}
-
-static ssize_t mem_control_type_read(struct cgroup *cont,
-   struct cftype *cft,
-   struct file *file, char __user *userbuf,
-   size_t nbytes, loff_t *ppos)
-{
-   unsigned long val;
-   char buf[64], *s;
-   struct mem_cgroup *mem;
-
-

Re: [PATCH] adt7470: Support per-sensor alarm files

2007-12-20 Thread Darrick J. Wong

On Thu, Dec 20, 2007 at 10:58:08AM +0100, Jean Delvare wrote:

 BTW, did you try your driver with lm-sensors 3.0.0?

Yes I did, and it looked ok to me.  Did you find something wrong?

--D
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Loic Prylli wrote:


Always using type 1 for accesses below 256 bytes looks like a very very
attractive solution

I know we had a lot of older kernels over the last two years that we
patched like that (we needed MMCONFIG for our own device development
purposes, but we also needed our machines to boot and discover all
devices reliably). Recent kernels works fine out of the box on all
hardware we have, but all this sometimes tricky and apparently endless
work (in big part because of buggy BIOSes) about MMCONFIG would probably
become relatively easy by limiting the aim to have MMCONFIG work when it
is required (for cfg-space accesses = 256).


Loic


Hmmm... I think I like this solution.

It may be easier to implement than the solution I posted.

Also, this solution also would allow us to remove the unreachable_devices()
routine and bitmap.

Does anybody see a down side to this?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

2007-12-20 Thread Ingo Molnar


* Hemmann, Volker Armin [EMAIL PROTECTED] wrote:

 On Donnerstag, 20. Dezember 2007, you wrote:
  Hemmann, Volker Armin wrote:
   [ 5194.131014] Pid: 22490, comm: sleep Tainted: P2.6.23.11reiser4
   #4
 
  The subject line is wrong.
  You apparently run Linux, but not Linux 2.6.23.y.
 
 first of all, apart from this oops all other oopses I reported were 
 with a not-tainted kernel. You might want to read the other mails I 
 have sent.
 
 Also, besides of the reiser4 patch there is no other patch added to 
 the kernel. And since people have had successfully reported problems 
 with heavily distro-patched kernels in the past it looks a little bit 
 hypocritical to put my reports aside because of one single patch - 
 don't you think?

reiser4 isnt just a single random patch, it's a huge patch with lots of 
interactions with file and memory management. Would it be hard for you 
to reproduce the crash without reiser4? (or is all your stuff on 
reiser4?)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: almost daily Kernel oops with 2.6.23.9 - and now 2.6.23.11 as well

2007-12-20 Thread Stefan Richter

Hemmann, Volker Armin wrote:
 On Donnerstag, 20. Dezember 2007, you wrote:
 The subject line is wrong.
 You apparently run Linux, but not Linux 2.6.23.y.
 
 first of all, apart from this oops all other oopses I reported were with a 
 not-tainted kernel. You might want to read the other mails I have sent.
 
 Also, besides of the reiser4 patch there is no other patch added to the 
 kernel. And since people have had  successfully reported problems with 
 heavily distro-patched kernels in the past it looks a little bit hypocritical 
 to put my reports aside because of one single patch - don't you think?

I didn't say anything about putting your report aside.

For successful reports (as in 'leading to a fix'), it's among else
necessary that the issue can be narrowed down enough.  Sometimes this is
a quick process; e.g. user X finds a very specific driver bug while
using a patched and old kernel, driver developer Y takes the time to
confirm this bug in a recent mainline kernel because he already had a
good idea where to look and how to recreate the respective conditions,
and fixes the bug.  Sometimes it takes much much more work to identify
the circumstances of the bug.  It is then necessary that the reporter
knows exactly what he is running, simplifies his system to eliminate as
many potential causes for problems as possible, and always clearly
states under what circumstances the bug happens.

If you already found the bug in an untainted (but patched?) kernel, then
what information does another report against a tainted kernel add?  The
tainted kernel has more unknowns than the untainted one.  Progress can
only be made if the number of unknowns are successively reduced.

Regarding other people's reports and hypocrisy and whatnot:  I myself am
monitoring a few distro bug trackers more or less frequently for bug
reports concerning the kernel subsystem I'm interested in.  With varying
success though.  In order make use of a report against a distro kernel,
I need to have a good picture of what stuff is in that kernel.  Looking
at distro bug trackers does only work for me because my field of
interest is a driver subsystem which is somewhat decoupled from other
kernel parts; so if there is trouble concerning hardware covered by this
subsystem, it is usually not too hard to figure out whether the problem
is in this subsystem or somewhere else.  If it weren't that easy most of
the time, I might for example depend on the reporters to test specific
mainline kernels or specific development kernels.  (Though the latter
becomes necessary after all in cases when more targeted debug output is
needed from the reporter, or in order to test proposed fixes without
having to wait for the distributor to build a test package for the
reporter.)
-- 
Stefan Richter
-=-=-=== ==-- =-=--
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] dma: passing attributes to dma_map_* routines

2007-12-20 Thread akepner

On Tue, Dec 18, 2007 at 09:59:24PM +0100, Stefan Richter wrote:
 
 From its purpose it sounds like you need this only for few special
 memory regions which would typically be mapped by dma_map_single() 

We need the _sg versions too, as Roland already mentioned.

  and
 furthermore that drivers who need this behavior will be changed to
 explicitly demand it.  If so, a nonintrusive API extension could simply
 be to add an
 
 dma_addr_t dma_map_single_write_last(struct device *dev, void *ptr,
 size_t size, enum dma_data_direction direction);
 ...

This is the easiest thing to do, and therefore it'd be my 
preference. But I'm concerned that the keepers of the dma 
interface will object to this. So far they've been silent 
in this thread - maybe they need to see a patch before 
they'll get engaged

-- 
Arthur

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:
 Also, this solution also would allow us to remove the unreachable_devices()
 routine and bitmap.

Not really ... we probe reading address 0x100 to see if the device
supports extended config space or not.  So we need to make that fail
gracefully for the amd7111 case.

 Does anybody see a down side to this?

It'll be slower than it would be if we used mmconfig directly.  Now yes,
nobody should be using pci config space in performance critical paths
... but see the tg3 driver.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Stephen Hemminger wrote:
 On Thu, 20 Dec 2007 17:29:23 +

 -Original Message-
 From: Stephen Hemminger [EMAIL PROTECTED]

 Date: Thu, 20 Dec 2007 09:16:03 
 To:[EMAIL PROTECTED]
 Cc:[EMAIL PROTECTED], [EMAIL PROTECTED],   linux-kernel@vger.kernel.org
 Subject: Re: [PATCH] sky2: Use deferrable timer for watchdog

 On Tue, 18 Dec 2007 20:13:28 -0500 (EST)
 Parag Warudkar [EMAIL PROTECTED] wrote:

 sky2 can use deferrable timer for watchdog - reduces wakeups from idle per 
 second.

 Signed-off-by: Parag Warudkar [EMAIL PROTECTED]

 --- linux-2.6/drivers/net/sky2.c2007-12-07 10:04:39.0 -0500
 +++ linux-2.6-work/drivers/net/sky2.c   2007-12-18 20:07:58.0 
 -0500
 @@ -4230,7 +4230,10 @@
 sky2_show_addr(dev1);
 }

 -   setup_timer(hw-watchdog_timer, sky2_watchdog, (unsigned long) hw);
 +   hw-watchdog_timer.function = sky2_watchdog;
 +   hw-watchdog_timer.data = (unsigned long) hw;
 +   init_timer_deferrable(hw-watchdog_timer);
 +
 INIT_WORK(hw-restart_work, sky2_restart);

 pci_set_drvdata(pdev, hw);
 Does it really reduce the wakeup's or only change who gets charged by 
 powertop?
 The system is going to wakeup once a second anyway. Looks to me that if the
 timer is using round_jiffies(), that setting deferrable just changes the 
 accounting.

 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise about when 
 so schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on the hour

* deferrable   - timer doesn't have to really wakeup but wants to 
 happen near
 a particular time. e.g. I'll meet you at the pub 
 around 8pm

 Therefore doing deferrable is unnecessary for timers using round_jiffies 
 unless system
 is so good at doing timers that it is going to skip doing timer once per 
 second.

 [EMAIL PROTECTED] wrote:

 NO_HZ kernels don't do timers every second - if you do round_jiffies() the 
 kernel will wakeup and run the timer at that time no matter what. 

 The reason deferrable was introduced is to avoid waking up the kernel just 
 for this one timer that can be called when the CPU is not idle for some 
 reason other than this timer.

 In other words let's say there were two timers - one non-deferrable expiring 
 in 3 seconds and other deferrable, expiring in 1.5 seconds. The kernel will 
 not wake up twice - once for 1.5 second and other for 3 second - it will 
 wake up once at expiry of 3 second timer and execute both the 1.5 second and 
 3 second timers.

 And this is not just powertop accounting thing - like I said the total num 
 of wakeups per second go down with this patch.

 Parag

 Sent via BlackBerry from T-Mobile

 Quit top-posting!

 If this is the case then the whole usage of round_jiffies() is bogus. All 
 users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that if 
 deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting for some 
 other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
  you first, no you first, after you mister chip, no after you mister 
 dale,...

that's a dangerous situation indeed and I'd really like to know what the limits
are for deferring deferrable timers Arjan, do you know? Anyone?

I don't see a danger just yet on normal systems - I get something like 10 
wakeups
per second from just the kernel (acpi, ahci, usb) on most my systems which
guarantees that the watchdog runs often enough, but for embedded systems and
critical timers in other drivers this may be an issue quickly

Auke
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] e1000e: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Parag Warudkar wrote:
 On Dec 20, 2007 12:05 PM, Kok, Auke [EMAIL PROTECTED] wrote:
 I can't even apply this patch and the e1000 one... not only is it whitespace
 damaged it is also not properly formatted as patch at all. If you want me to 
 take
 these patches seriously, then please fix the formatting issues.
 
 Sigh - I use Pine, follow Documents/email-clients.txt for the
 recommended settings and obviously the pathces are not generated with
 whitespace damage at my end as I test those before sending out.
 
 So although I hate to see this happen there is nothing at this moment
 that I can do - except for attaching the patch instead of inlining it.
 Since they have already been reviewed inline, please see if the
 attached patches work for you.

here's what the files in my Maildir spool look like in vim (my vim displays a 
'»'
char for tabs and a ¶ for EOL):

 76 --- linux-2.6/drivers/net/e1000e/netdev.c»  2007-12-07 10:04:39.
 77 +++ linux-2.6-work/drivers/net/e1000e/netdev.c» 2007-12-18 20:45:59.
 78 @@ -3899,7 +3899,7 @@¶
 79   » »   goto err_eeprom;¶
 80   » }¶
 81 ¶
 82 -»  init_timer(adapter-watchdog_timer);¶
 83 +»  init_timer_deferrable(adapter-watchdog_timer);¶
 84   » adapter-watchdog_timer.function = e1000_watchdog;¶
 85   » adapter-watchdog_timer.data = (unsigned long) adapter;¶
 86 ¶
 87 --¶

notice that there are two spaces instead of 1. Also there's no line heading the
diff with 'diff a/foo b/foo' which is what throws of stg. And the -p option is
missing.


as for content, the patch looks OK with me. I ran the numbers and allthough 
there
was a slight average delay in the link up detection time it is negligeable (less
than 0.2sec difference over a bunch of measurements), and I confirmed your
powertop numbers are correct. As for the timer interval, the watchdog may 
already
be delayed up to 3 seconds safely, this doesn't change that.

I'll forward the patch, Care to make one for e100? plenty of laptops with those
still around! The embedded guys would love it I think.

Thanks,

Auke

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise


Roland Dreier wrote:

  It appears that my problem boils down to a single host page of memory
  that is mapped for dma, and the dma address returned by dma_map_sg()
  is _not_ 64KB aligned.  Here is an example:

  My first question is: Is there an assumption or requirement in linux
  that dma_addressess should have the same alignment as the host address
  they are mapped to?  IE the rdma core is mapping the entire 64KB page,
  but the mapping doesn't begin on a 64KB page boundary.

I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.

 - R.



In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls 
iommu_range_alloc() with a alignment_order of 0:



vaddr = (unsigned long)page_address(s-page) + s-offset;
npages = iommu_num_pages(vaddr, slen);
entry = iommu_range_alloc(tbl, npages, handle, mask  
IOMMU_PAGE_SHIFT, 0);


But perhaps the alignment order needs to be based on the host page size?


Steve.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven




My interpretation of the api is:
   * round_jiffies()  - timer wants to wakeup but isn't precise about when so 
schedule
on next second when system will wake up anyway;
e.g why meetings are usually scheduled on the hour

   * deferrable   - timer doesn't have to really wakeup but wants to happen 
near
a particular time. e.g. I'll meet you at the pub around 
8pm


this is not correct.

deferrable means if you're busy wake me up at this time. But if not, don't 
bother waking up for me, get to it
later.

The later can be a LONG time later, several seconds easily, if not more.
(timers are on a per cpu bases, and you may end up with a several-core system 
where the common timers are all on another cpu
than this one)




If this is the case then the whole usage of round_jiffies() is bogus. All users 
of round_jiffies()
should just be converted to deferrable??  I am a bit concerned that if 
deferrable gets used everywhere
then a strange situation would occur where all timers were waiting for some 
other timer to finally
happen, kind of a wierd timelock situation. Like the old chip/dale cartoon:
 you first, no you first, after you mister chip, no after you mister dale,...




that's a dangerous situation indeed and I'd really like to know what the limits
are for deferring deferrable timers Arjan, do you know? Anyone?


there is NO limit to deferring a timer. Do NOT use a deferrable timer if you 
can't afford the timer to not happen
within.. 10 to 100 seconds! (or more)
They are really meant for things where you CAN afford for it to not happen when 
you're idle




I don't see a danger just yet on normal systems - I get something like 10 
wakeups
per second from just the kernel (acpi, ahci, usb) on most my systems which
guarantees that the watchdog runs often enough, but for embedded systems and
critical timers in other drivers this may be an issue quickly


on my work desktop test box the average time between cpu wakeups is 1.4 seconds
(and that's single core). It would be higher if it wasn't for some hpet limit 
issues.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bug 9182] Critical memory leak (dirty pages)

2007-12-20 Thread Linus Torvalds



On Thu, 20 Dec 2007, Jan Kara wrote:

   As I wrote in my previous email, this solution works but hides the
 fact that the page really *has* dirty data in it and *is* pinned in memory
 until the commit code gets to writing it. So in theory it could disturb
 the writeout logic by having more dirty data in memory than vm thinks it
 has. Not that I'd have a better fix now but I wanted to point out this
 problem.

Well, I worry more about the VM being sane - and by the time we actually 
hit this case, as far as VM sanity is concerned, the page no longer really 
exists. It's been removed from the page cache, and it only really exists 
as any other random kernel allocation.

The fact that low-level filesystems (in this case ext3 journaling) do 
their own insane things is not something the VM even _should_ care about. 
It's just an internal FS allocation, and the FS can do whatever the hell 
it wants with it, including doing IO etc.

The kernel doesn't consider any other random IO pages to be dirty either 
(eg if you do direct-IO writes using low-level SCSI commands, the VM 
doesn't consider that to be any special dirty stuff, it's just random page 
allocations again). This is really no different.

In other words: the Linux VM subsystem is really two differnt parts: the 
low-level page allocator (which obviously knows that the page is still in 
*use*, since it hasn't been free'd), and the higher-level file mapping and 
caching stuff that knows about things like page dirtyiness. And once 
you've done a remove_from_page_cache(), the higher levels are no longer 
involved, and dirty accounting simply doesn't get into the picture.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Voluntary leave_mm before entering ACPI C3

2007-12-20 Thread Len Brown

On Thursday 20 December 2007 11:16, H. Peter Anvin wrote:
 Arjan van de Ven wrote:
  On Wed, 19 Dec 2007 11:48:14 -0800
  H. Peter Anvin [EMAIL PROTECTED] wrote:
  
  I think C3 guarantees that the cache contents stay intact, and thus
  it might make sense in some technology to preserve the TLB as well
  (being a kind of cache.)
  
  that sounds nice. It's fiction though ;-)
  
  The thing to realize is that linux only sees ACPI C3;
  the BIOS maps that C3 to..
  well any of the C states the processor in the system has.
  What you're saying is afaik correct for the *hardware* C3, not for the C3 
  that Linux sees..   
 
 Well, it can only map ACPI C3 to a state which is no more dead than 
 what would normally be permitted by C3.  IIRC, C3 is allowed to require 
 that DMA be turned off (unlike C2), but is not allowed to lose the CPU 
 state.


Re: mapping HW to ACPI C-states.

Right, it is fair game for the BIOS to map a shallower hardware C-state
to a deeper ACPI C-state.

Re: CPU state

All C-states preserve the CPU SW programming state.
(eg. while it may be saved and restored in HW,
 it appears to SW to be always intact).

Re: C3 guarantees that the cache contents stay intact

This is both true and false, depending on how you use the word intact.

If intact == stays valid in cache, then no, this not guaranteed.
The HW reserves the right to flush some or all of the L1 and L2
caches whenever it wants to --
this includes both HW and ACPI C2 and C3 states.

If intact = cache consistent, then yes, this guarantee is true.
The way the guarantee is implemented varies by generation.
Older systems would lock the bus in C3 to assure
the processor was woken up for DMA to snoop.
Newer hardware simply wakes the cache to snoop
without waking the cores, or if it flushes the caches
then it doesn't have to snoop at all -- which also
counts as cache consistent:-)

cheers,
-Len

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Steve Wise


Steve Wise wrote:

Roland Dreier wrote:

  It appears that my problem boils down to a single host page of memory
  that is mapped for dma, and the dma address returned by dma_map_sg()
  is _not_ 64KB aligned.  Here is an example:

  My first question is: Is there an assumption or requirement in linux
  that dma_addressess should have the same alignment as the host address
  they are mapped to?  IE the rdma core is mapping the entire 64KB page,
  but the mapping doesn't begin on a 64KB page boundary.

I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.

 - R.



In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls 
iommu_range_alloc() with a alignment_order of 0:



vaddr = (unsigned long)page_address(s-page) + s-offset;
npages = iommu_num_pages(vaddr, slen);
entry = iommu_range_alloc(tbl, npages, handle, mask 
 IOMMU_PAGE_SHIFT, 0);


But perhaps the alignment order needs to be based on the host page size?



Or based on the alignment of vaddr actually...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] ps3: vuart: fix error path locking

2007-12-20 Thread Daniel Walker

On Tue, 2007-12-18 at 19:04 -0800, Geoff Levand wrote:

 Unfortunately there wasn't enough context in the patch to see
 that there is a down() earlier in the routine, and that the patch
 does indeed remove an incorrectly placed down().  Here is the
 entire routine, marked with what the patch removes.
 

Andrew have you had a chance to review this?

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Matthew Wilcox wrote:


Bad deduction.  What's happening is that the write to the BAR is causing
it to overlap the decode for mmconfig space.  So the mmconfig write to
set the BAR back never gets through.

I have a different idea to fix this problem.  Instead of writing
0x, we could look for an unused bit of space in the E820 map and
write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
overlap, and we could find its size using MMCONFIG.


Let me see if I understand this correctly.

Writing this BAR with 0x causes it to decode all further mmconfig
references based at addr 0xfxxx as its own?

If I've got that right, then why don't any of the other BARs do that? Is
it because this one's a 64-bit BAR?

AFIK, there are no devices out there that require 32-bits of address
space, so using 0xdfff in the low register would certainly work.

However, the PCI spec says that we should be able to write 0x
to the BAR, buggy BIOS and hardware notwithstanding.

Does anybody see that change as being within the purview of the patch-set
I am proposing? Or is that another patch for another time?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Kernel bug: bluetooth meets TTY layer

2007-12-20 Thread David Newall


Hi Arjan,

I've not been able to find this file, drivers/bluetooth/hci_tty.c, but 
anyway, This seems to be what happens: Hci_uart_close() flushes using 
hci_uart_flush().  Subsequently, in hci_dev_do_close(), (one step in 
hci_unregister_dev()), hci_uart_flush() is called again.  The comment in 
uart_flush_buffer(), relating to the WARN_ON(), indicates you can't 
flush after the port is closed; which sounds reasonable.  I think 
hci_uart_close() should set hdev-flush to NULL before returning.  
Hci_dev_do_close() does check for this.  The code path is rather 
involved and I'm not entirely clear of all steps, but I think that's 
what should be done.


Patch for stupidly obsolete kernel attached.

David
--- hci_ldisc.c	2007-09-11 02:54:02.0 +0930
+++ hci_ldisc.c.new	2007-12-21 06:03:11.0 +1030
@@ -203,16 +203,17 @@
 static int hci_uart_close(struct hci_dev *hdev)
 {
 	BT_DBG(hdev %p, hdev);
 
 	if (!test_and_clear_bit(HCI_RUNNING, hdev-flags))
 		return 0;
 
 	hci_uart_flush(hdev);
+	hdev-flush = NULL;
 	return 0;
 }
 
 /* Send frames from HCI layer */
 static int hci_uart_send_frame(struct sk_buff *skb)
 {
 	struct hci_dev* hdev = (struct hci_dev *) skb-dev;
 	struct tty_struct *tty;

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Kok, Auke

Arjan van de Ven wrote:
 
 My interpretation of the api is:
* round_jiffies()  - timer wants to wakeup but isn't precise
 about when so schedule
 on next second when system will wake up anyway;
 e.g why meetings are usually scheduled on
 the hour

* deferrable   - timer doesn't have to really wakeup but
 wants to happen near
 a particular time. e.g. I'll meet you at
 the pub around 8pm
 
 this is not correct.
 
 deferrable means if you're busy wake me up at this time. But if not,
 don't bother waking up for me, get to it
 later.
 
 The later can be a LONG time later, several seconds easily, if not more.
 (timers are on a per cpu bases, and you may end up with a several-core
 system where the common timers are all on another cpu
 than this one)
 
 
 
 If this is the case then the whole usage of round_jiffies() is bogus.
 All users of round_jiffies()
 should just be converted to deferrable??  I am a bit concerned that
 if deferrable gets used everywhere
 then a strange situation would occur where all timers were waiting
 for some other timer to finally
 happen, kind of a wierd timelock situation. Like the old chip/dale
 cartoon:
  you first, no you first, after you mister chip, no after you mister
 dale,...



 that's a dangerous situation indeed and I'd really like to know what
 the limits
 are for deferring deferrable timers Arjan, do you know? Anyone?
 
 there is NO limit to deferring a timer. Do NOT use a deferrable timer if
 you can't afford the timer to not happen
 within.. 10 to 100 seconds! (or more)
 They are really meant for things where you CAN afford for it to not
 happen when you're idle

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?

If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...

Auke
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc5-mm1: problems with cat /proc/kpageflags

2007-12-20 Thread Mariusz Kozlowski

Hello, 

   Actually, you may only need these two:
   
maps4-add-proc-kpagecount-interface.patch
maps4-add-proc-kpageflags-interface.patch
  
  Yes these two were enough, and exporting fs/proc/base.c's
  mem_lseek().
  
  As hard as I try, I can't reproduce this at all.  I tried
  both on my workstation and my niagara boxes.
 
 That's good to know, I was having a very hard time imagining how the
 kpagecount code could be going south.
  
  It must be other needle in the 30MB+ -mm haystack. :-(

I'm afraid you are wrong. Eariler kernel are affected as well. At reading your 
mail I was
thinking of applying those two patches to 2.6.24-rc5 and do bisection on the 
rest of -mm series.
Unfortunately clean 2.6.24-rc5 with these two patches is affected as well (new 
processes
stuck in D state etc). So I tried vanilla 2.6.23 patched by these two patches 
(and
mem_lseek export from fs/proc/base.c). Now at least I got a trace produced by 
'cat /proc/kpagecount'
which you can find below. Also, in spite of the oops, the box doesn't get 
locked (as with -mm)
and is still usable.

[  126.060976] TSTATE: 009980009603 TPC: 00428a84 TNPC: 
00428a88 Y: Not tainted
[  126.063486] TPC: cpu_idle+0x2c/0xe0
[  126.065986] g0: 0009 g1: 04804000 g2: 000f 
g3: 007204c0
[  126.068636] g4: 007244c0 g5: f8007f878000 g6: 007204c0 
g7: 00724958
[  126.071232] o0: 0001 o1: 007204c8 o2: 0001 
o3: 
[  126.073924] o4: 6000 o5: 0078f140 sp: 007239b1 
ret_pc: 00428a78
[  126.076569] RPC: cpu_idle+0x20/0xe0
[  126.079185] l0: 0072 l1: 0002 l2: 0001 
l3: 0075d400
[  126.081934] l4: 0075d400 l5: f80080015b10 l6: f80080005b08 
l7: 0001
[  126.084637] i0: 0001 i1: 00720094 i2:  
i3: 
[  126.087375] i4: 007204c0 i5: 0002 i6: 00723a71 
i7: 00665a24
[  126.090135] I7: rest_init+0x6c/0x80
[  145.121228] Unable to handle kernel NULL pointer dereference
[  145.124515] tsk-{mm,active_mm}-context = 0d41
[  145.127778] tsk-{mm,active_mm}-pgd = f800bd8d2000
[  145.127801]   \|/  \|/
[  145.127808]   @'/ .. \`@
[  145.127815]   /_| \__/ |_\
[  145.127821]  \__U_/
[  145.127831] cat(3111): Oops [#1]
[  145.127849] 
[  145.127853] =
[  145.127861] [ INFO: inconsistent lock state ]
[  145.127873] 2.6.23 #1
[  145.127880] -
[  145.127891] inconsistent {in-hardirq-W} - {hardirq-on-W} usage.
[  145.127906] cat/3111 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  145.127918]  (regdump_lock){+...}, at: [004281d0] 
__show_regs+0x18/0x320
[  145.127951] {in-hardirq-W} state was registered at:
[  145.127960]   [00669780] _spin_lock+0x28/0x40
[  145.127983]   [004281d0] __show_regs+0x18/0x320
[  145.128000]   [004284e4] show_regs+0xc/0x20
[  145.128016]   [005ac9d8] sysrq_handle_showregs+0x20/0x40
[  145.128041]   [005ac7fc] __handle_sysrq+0x84/0x160
[  145.128060]   [005ac8f8] handle_sysrq+0x20/0x40
[  145.128078]   [005a4f08] kbd_event+0x670/0xb60
[  145.128110]   [005ea0c0] input_event+0x1e8/0x560
[  145.128140]   [005efa2c] sunkbd_interrupt+0x114/0x140
[  145.128167]   [005e6270] serio_interrupt+0x38/0xa0
[  145.128186]   [005b2e58] sunsu_kbd_ms_interrupt+0xa0/0x140
[  145.128212]   [0049f6f8] handle_IRQ_event+0x20/0x80
[  145.128251]   [0049f808] __do_IRQ+0xb0/0x140
[  145.128268]   [0042f48c] handler_irq+0x94/0xc0
[  145.128306]   [00426f30] sunos_sys_table+0x560/0x728
[  145.128324]   [00428a78] cpu_idle+0x20/0xe0
[  145.128341]   [00665a24] rest_init+0x6c/0x80
[  145.128375]   [0076ec24] start_kernel+0x2ec/0x340
[  145.128405]   [0066599c] tlb_fixup_done+0xa0/0xbc
[  145.128425]   [] 0x8
[  145.128443] irq event stamp: 1209
[  145.128451] hardirqs last  enabled at (1209): [00404b74] 
__handle_softirq_continue+0x20/0x24
[  145.128480] hardirqs last disabled at (1207): [00474494] 
__do_softirq+0xbc/0x140
[  145.128506] softirqs last  enabled at (1208): [004744dc] 
__do_softirq+0x104/0x140
[  145.128526] softirqs last disabled at (1203): [004745a0] 
do_softirq+0x88/0xa0
[  145.128546] 
[  145.128551] other info that might help us debug this:
[  145.128562] no locks held by cat/3111.
[  145.128570] 
[  145.128574] stack backtrace:
[  145.128582] Call Trace:
[  145.128590]  [004907a0] print_usage_bug+0x148/0x160
[  145.128624]  [004917f4] mark_lock+0x6dc/0x780
[  145.128641]  [0049286c] __lock_acquire+0x734/0x12a0
[  145.128659]  [00493430] lock_acquire+0x58/0x80
[

Re: not needed patch

2007-12-20 Thread Yinghai Lu

On Thursday 20 December 2007 06:29:06 am Ingo Molnar wrote:
 
 * Yinghai Lu [EMAIL PROTECTED] wrote:
 
  Ingo.
  
  commit fbdcf18df73758b2e187ab94678b30cd5f6ff9f9 is not needed. another 
  patch (by you !! commit 699d934d5f958d7944d195c03c334f28cc0b3669 x86: 
  fixup cpu_info array conversion) already removed clearing of 
  c-cpu_index. in identify_cpu
  also it is not consisent to smpboot_32.c. (it will assign id to 
  cpu_index right after
 
  *c = boot_cpu_data;
  )
 
 well, it might in the worst-case be a superfluous change, but not cause 
 any problems in 2.6.24, right?

now it is ok with 2.6.24.

 
  by revert commit fbdcf18df73758b2e187ab94678b30cd5f6ff9f9, we could 
  use c-cpu_index in identify_cpu.
 
 but that's 2.6.25 stuff, right? Travis?
or at least before bewfore merging smpboot_32.c and smpboot_64.c

YH
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] [PATCH 1/3] Merge mkimage tool for building uImages

2007-12-20 Thread Josh Boyer

On Thu, 20 Dec 2007 10:36:48 -0800
H. Peter Anvin [EMAIL PROTECTED] wrote:

 Josh Boyer wrote:
  Several platforms require the mkimage tool to generate a uImage file
  that is used with U-Boot.  This brings the mkimage tool in-kernel to
  enable building those platforms without having mkimage internally
  provided.
  
  This is currently based off of the version found in U-Boot 1.3.1.
  
 
 Can we rename it either ubootimage or mkuboot or something else that 
 tells the user what kind of image it is?  (It is, in particular, not 
 bzImage, which is probably the first thing that someone who sees image 
 in a arch-generic part of the Linux kernel tree will think.)

We can, yes.  I have no particular objection to that and I've often
found mkimage to be too generic of a name myself.  For the initial
round of patches, I just wanted to keep things as similar to what is
in U-Boot as possible.

josh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Ivan Kokshaysky

On Thu, Dec 20, 2007 at 12:08:33PM -0700, Matthew Wilcox wrote:
 On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:
  Does anybody see a down side to this?
 
 It'll be slower than it would be if we used mmconfig directly.  Now yes,
 nobody should be using pci config space in performance critical paths
 ... but see the tg3 driver.

Use type 1 just for the first 64 bytes and tg3 will be happy. All we need
is to avoid touching BARs with mmconfig.

Ivan.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Parag Warudkar

On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:
 ok, that's just bad and if there's no user-defineable limit to the deferral I
 definately don't like this change.

 Can I safely assume that any irq will cause all deferred timers to run?

I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.


 If this is the case then for e1000 this patch is still OK since the watchdog 
 needs
 to run (1) after a link up/down interrupt or (2) to update statistics. Those
 statistics won't increase if there is no traffic of course...


I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up. If there is something running even on the other
CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
which will make it very likely in practice that each CPU will be
interrupted in reasonable amount of time.

Of course there are theoretical cases where we could land into a
situation where a CPU in a multiprocessor machine is IDLE infinitely
and that causes the watchdog that happens to be bound to run on the
same CPU to not run. To take care of these unlikely cases I think the
timer mechanism should have a reasonable limit on how long a CPU can
go IDLE if there are deferrable timers.

Parag
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 02:37:49PM -0500, Tony Camuso wrote:
 Matthew Wilcox wrote:
 
 Bad deduction.  What's happening is that the write to the BAR is causing
 it to overlap the decode for mmconfig space.  So the mmconfig write to
 set the BAR back never gets through.
 
 I have a different idea to fix this problem.  Instead of writing
 0x, we could look for an unused bit of space in the E820 map and
 write, say, 0xdfff to the low 32-bits of a BAR.  Then it wouldn't
 overlap, and we could find its size using MMCONFIG.
 
 Let me see if I understand this correctly.
 
 Writing this BAR with 0x causes it to decode all further mmconfig
 references based at addr 0xfxxx as its own?
 
 If I've got that right, then why don't any of the other BARs do that? Is
 it because this one's a 64-bit BAR?

Here's how BARs work ... when you write 0x to the BAR, it
ignores all the set bits that are less than the size of the BAR.  So,
assuming this is a 256MB BAR (like my G33 is), what ends up written to
this BAR is 0xf000.  Now, because this is graphics, apparently it's
special and embedded in the chipset, even though it looks like it's a
PCI device.  So it actually gets priority over MMCONFIG which is also
mapped to 0xf000.

For your case of a 64-bit BAR, you could write 0x to the high
32-bits first, then write to the low 32-bits, then reset the low, then
high bits, and you'd avoid the problem.  But the G33 has a 32-bit BAR
with the same problem, so it won't work for that case.

BARs that are behind bridges don't have this problem (they can't decode
memory accesses that aren't forwarded to them).  BARs on devices which
have memory IO disabled also don't have theis problem, but disabling
devices has its problems (as does probing BARs for active devices anyway
...).

 AFIK, there are no devices out there that require 32-bits of address
 space, so using 0xdfff in the low register would certainly work.

The question is how large can 32-bit BARs get.  As we've seen, 256MB
exist, and are causing pain.  I can't imagine any PCI device
manufacturer thinks they can allocate 2GB of the low space, but we could
potentially mis-size a large BAR by not using 0x.

 Does anybody see that change as being within the purview of the patch-set
 I am proposing? Or is that another patch for another time?

I'm really not clear on the purpose of your patchset.  Was it all to
address this one problem?

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven


Kok, Auke wrote:


ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


*on that cpu*. Timers are per cpu, as are interrupts. Just not per se the same 
one ...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Loic Prylli

On 12/20/2007 2:08 PM, Matthew Wilcox wrote:
 On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:
   
 Also, this solution also would allow us to remove the unreachable_devices()
 routine and bitmap.
 

 Not really ... we probe reading address 0x100 to see if the device
 supports extended config space or not.  So we need to make that fail
 gracefully for the amd7111 case.
   



pci_cfg_space_size()  is only done for PCI-express  or PCI-X mode 2
devices, so you still have eliminated the bulk of the problems which are
typically handling the legacy busses on modern machines. I don't know
what is the amd7111.



   
 Does anybody see a down side to this?
 

 It'll be slower than it would be if we used mmconfig directly.  Now yes,
 nobody should be using pci config space in performance critical paths
 ... but see the tg3 driver.
   


I am not familiar with the tg3 driver, just trying to give a 5 minutes
look, it seems the typical cases where the pci-conf-space is used
intensively are with some rev in combination with the 82801
(TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
as well as some very specific PCI-X combinations
(TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
mmconfig.

Even if I am wrong for the tg3, I don't really think mmconfig vs type1
could make a noticeable performance on any common systems (obscure
systems or hardware where it could potentially have a performance impact
could  use a non-default configuration).


Loic




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/15] adjust pvops to accomodate its x86_64 variant

2007-12-20 Thread Glauber de Oliveira Costa

Hi folks,

With this series, the bulk of the work of pvops64 is done.
Here, I integrate most of the paravirt.c and paravirt.h files, making
them applicable to both architectures.

CONFIG_PARAVIRT is _not_ present yet. Basically, this code is missing page
table integration (patches currently being worked on by Jeremy).

Enjoy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] ps3: vuart: fix error path locking

2007-12-20 Thread Andrew Morton

On Thu, 20 Dec 2007 11:32:25 -0800 Daniel Walker [EMAIL PROTECTED] wrote:

 On Tue, 2007-12-18 at 19:04 -0800, Geoff Levand wrote:
 
  Unfortunately there wasn't enough context in the patch to see
  that there is a down() earlier in the routine, and that the patch
  does indeed remove an incorrectly placed down().  Here is the
  entire routine, marked with what the patch removes.
  
 
 Andrew have you had a chance to review this?
 

Confused.  I did review it: http://lkml.org/lkml/2007/12/18/384
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/15] adjust PVOP_CALL/VCALL macros for x86_64

2007-12-20 Thread Glauber de Oliveira Costa

This patch adjust the PVOP_VCALL and PVOP_CALL macros to
work with x86_64. It has a different calling convention, and
we use auxiliary macros to account for both calling conventions
as cleanly as possible

Comments are adjusted accordingly.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |   87 +---
 1 files changed, 65 insertions(+), 22 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:07.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:17.0 
-0800
@@ -320,7 +320,7 @@
  * runtime.
  *
  * Normally, a call to a pv_op function is a simple indirect call:
- * (paravirt_ops.operations)(args...).
+ * (pv_op_struct.operations)(args...).
  *
  * Unfortunately, this is a relatively slow operation for modern CPUs,
  * because it cannot necessarily determine what the destination
@@ -330,11 +330,17 @@
  * calls are essentially free, because the call and return addresses
  * are completely predictable.)
  *
- * These macros rely on the standard gcc regparm(3) calling
+ * For i386, these macros rely on the standard gcc regparm(3) calling
  * convention, in which the first three arguments are placed in %eax,
  * %edx, %ecx (in that order), and the remaining arguments are placed
  * on the stack.  All caller-save registers (eax,edx,ecx) are expected
  * to be modified (either clobbered or used for return values).
+ * X86_64, on the other hand, already specifies a register-based calling
+ * conventions, returning at %rax, with parameteres going on %rdi, %rsi,
+ * %rdx, and %rcx. Note that for this reason, x86_64 does not need any
+ * special handling for dealing with 4 arguments, unlike i386.
+ * However, x86_64 also have to clobber all caller saved registers, which
+ * unfortunately, are quite a bit (r8 - r11)
  *
  * The call instruction itself is marked by placing its start address
  * and size into the .parainstructions section, so that
@@ -357,10 +363,12 @@
  * the return type.  The macro then uses sizeof() on that type to
  * determine whether its a 32 or 64 bit value, and places the return
  * in the right register(s) (just %eax for 32-bit, and %edx:%eax for
- * 64-bit).
+ * 64-bit). For x86_64 machines, it just returns at %rax regardless of
+ * the return value size.
  *
  * 64-bit arguments are passed as a pair of adjacent 32-bit arguments
- * in low,high order.
+ * i386 also passes 64-bit arguments as a pair of adjacent 32-bit arguments
+ * in low,high order
  *
  * Small structures are passed and returned in registers.  The macro
  * calling convention can't directly deal with this, so the wrapper
@@ -370,46 +378,67 @@
  * means that all uses must be wrapped in inline functions.  This also
  * makes sure the incoming and outgoing types are always correct.
  */
+#ifdef CONFIG_X86_32
+#define PVOP_VCALL_ARGSunsigned long __eax, __edx, 
__ecx
+#define PVOP_CALL_ARGS PVOP_VCALL_ARGS
+#define PVOP_VCALL_CLOBBERS=a (__eax), =d (__edx), \
+   =c (__ecx)
+#define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS
+#define EXTRA_CLOBBERS
+#define VEXTRA_CLOBBERS
+#else
+#define PVOP_VCALL_ARGSunsigned long __edi, __esi, __edx, __ecx
+#define PVOP_CALL_ARGS PVOP_VCALL_ARGS, __eax
+#define PVOP_VCALL_CLOBBERS=D (__edi),   \
+   =S (__esi), =d (__edx), \
+   =c (__ecx)
+
+#define PVOP_CALL_CLOBBERS PVOP_VCALL_CLOBBERS, =a (__eax)
+
+#define EXTRA_CLOBBERS  , r8, r9, r10, r11
+#define VEXTRA_CLOBBERS , rax, r8, r9, r10, r11
+#endif
+
 #define __PVOP_CALL(rettype, op, pre, post, ...)   \
({  \
rettype __ret;  \
-   unsigned long __eax, __edx, __ecx;  \
+   PVOP_CALL_ARGS; \
+   /* This is 32-bit specific, but is okay in 64-bit */\
+   /* since this condition will never hold */  \
if (sizeof(rettype)  sizeof(unsigned long)) {  \
asm volatile(pre\
 paravirt_alt(PARAVIRT_CALL)\
 post   \
-: =a (__eax), =d (__edx),  \
-  =c (__ecx) \
+: PVOP_CALL_CLOBBERS   \
 : paravirt_type(op),   \

[PATCH 01/15] change paravirt_32.c name

2007-12-20 Thread Glauber de Oliveira Costa

This patch changes paravirt_32.c to paravirt.c. The goal
is to have paravirt support in x86_64, so we do it in a common file

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/Makefile_32   |2 +-
 arch/x86/kernel/paravirt.c|  475 +
 arch/x86/kernel/paravirt_32.c |  472 
 3 files changed, 476 insertions(+), 473 deletions(-)
 create mode 100644 arch/x86/kernel/paravirt.c
 delete mode 100644 arch/x86/kernel/paravirt_32.c

Index: linux-2.6-x86/arch/x86/kernel/Makefile_32
===
--- linux-2.6-x86.orig/arch/x86/kernel/Makefile_32  2007-12-20 
19:07:08.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/Makefile_32   2007-12-20 19:07:15.0 
-0800
@@ -48,7 +48,7 @@
 obj-$(CONFIG_MGEODE_LX)+= geode_32.o mfgpt_32.o
 
 obj-$(CONFIG_VMI)  += vmi_32.o vmiclock_32.o
-obj-$(CONFIG_PARAVIRT) += paravirt_32.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o
 obj-y  += pcspeaker.o
 
 obj-$(CONFIG_SCx200)   += scx200_32.o
Index: linux-2.6-x86/arch/x86/kernel/paravirt.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-x86/arch/x86/kernel/paravirt.c2007-12-20 19:07:15.0 
-0800
@@ -0,0 +1,475 @@
+/*  Paravirtualization interfaces
+Copyright (C) 2006 Rusty Russell IBM Corporation
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+
+2007 - x86_64 support added by Glauber de Oliveira Costa, Red Hat Inc
+*/
+
+#include linux/errno.h
+#include linux/module.h
+#include linux/efi.h
+#include linux/bcd.h
+#include linux/highmem.h
+
+#include asm/bug.h
+#include asm/paravirt.h
+#include asm/desc.h
+#include asm/setup.h
+#include asm/arch_hooks.h
+#include asm/time.h
+#include asm/irq.h
+#include asm/delay.h
+#include asm/fixmap.h
+#include asm/apic.h
+#include asm/tlbflush.h
+#include asm/timer.h
+
+/* nop stub */
+void _paravirt_nop(void)
+{
+}
+
+static void __init default_banner(void)
+{
+   printk(KERN_INFO Booting paravirtualized kernel on %s\n,
+  pv_info.name);
+}
+
+char *memory_setup(void)
+{
+   return pv_init_ops.memory_setup();
+}
+
+/* Simple instruction patching code. */
+#define DEF_NATIVE(ops, name, code)\
+   extern const char start_##ops##_##name[], end_##ops##_##name[]; \
+   asm(start_ #ops _ #name :  code ; end_ #ops _ #name :)
+
+DEF_NATIVE(pv_irq_ops, irq_disable, cli);
+DEF_NATIVE(pv_irq_ops, irq_enable, sti);
+DEF_NATIVE(pv_irq_ops, restore_fl, push %eax; popf);
+DEF_NATIVE(pv_irq_ops, save_fl, pushf; pop %eax);
+DEF_NATIVE(pv_cpu_ops, iret, iret);
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, sti; sysexit);
+DEF_NATIVE(pv_mmu_ops, read_cr2, mov %cr2, %eax);
+DEF_NATIVE(pv_mmu_ops, write_cr3, mov %eax, %cr3);
+DEF_NATIVE(pv_mmu_ops, read_cr3, mov %cr3, %eax);
+DEF_NATIVE(pv_cpu_ops, clts, clts);
+DEF_NATIVE(pv_cpu_ops, read_tsc, rdtsc);
+
+/* Undefined instruction for dealing with missing ops pointers. */
+static const unsigned char ud2a[] = { 0x0f, 0x0b };
+
+static unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+unsigned long addr, unsigned len)
+{
+   const unsigned char *start, *end;
+   unsigned ret;
+
+   switch(type) {
+#define SITE(ops, x)   \
+   case PARAVIRT_PATCH(ops.x): \
+   start = start_##ops##_##x;  \
+   end = end_##ops##_##x;  \
+   goto patch_site
+
+   SITE(pv_irq_ops, irq_disable);
+   SITE(pv_irq_ops, irq_enable);
+   SITE(pv_irq_ops, restore_fl);
+   SITE(pv_irq_ops, save_fl);
+   SITE(pv_cpu_ops, iret);
+   SITE(pv_cpu_ops, irq_enable_syscall_ret);
+   SITE(pv_mmu_ops, read_cr2);
+   SITE(pv_mmu_ops, read_cr3);
+   SITE(pv_mmu_ops, write_cr3);
+   SITE(pv_cpu_ops, clts);
+   SITE(pv_cpu_ops, read_tsc);
+#undef SITE
+
+   patch_site:
+   ret = paravirt_patch_insns(ibuf, len, start, end);
+   break;
+
+   default:
+

[PATCH 03/15] cleanup write_tsc

2007-12-20 Thread Glauber de Oliveira Costa

write_tsc() does not need to be enclosed in any paravirt closure,
as it uses wrmsr(). So we rip off the duplicate in msr.h
and the definition from paravirt.h

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/msr.h  |2 --
 include/asm-x86/paravirt.h |2 --
 2 files changed, 0 insertions(+), 4 deletions(-)

Index: linux-2.6-x86/include/asm-x86/msr.h
===
--- linux-2.6-x86.orig/include/asm-x86/msr.h2007-12-20 19:06:59.0 
-0800
+++ linux-2.6-x86/include/asm-x86/msr.h 2007-12-20 19:07:18.0 -0800
@@ -153,8 +153,6 @@
 #define rdtscll(val)   \
((val) = native_read_tsc())
 
-#define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
-
 #define rdpmc(counter,low,high)\
do {\
u64 _l = native_read_pmc(counter);  \
Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:17.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:18.0 
-0800
@@ -657,8 +657,6 @@
 }
 #define calculate_cpu_khz() (pv_time_ops.get_cpu_khz())
 
-#define write_tsc(val1,val2) wrmsr(0x10, val1, val2)
-
 static inline unsigned long long paravirt_read_pmc(int counter)
 {
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/15] provide paravirtualized hook for rdtscp

2007-12-20 Thread Glauber de Oliveira Costa

This patch adds a field in pv_cpu_ops for a paravirtualized hook
for rdtscp, needed for x86_64.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/paravirt.c |1 +
 include/asm-x86/paravirt.h |   22 ++
 2 files changed, 23 insertions(+), 0 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/paravirt.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/paravirt.c   2007-12-20 
19:07:15.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/paravirt.c2007-12-20 19:07:22.0 
-0800
@@ -374,6 +374,7 @@
.write_msr = native_write_msr_safe,
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
+   .read_tscp = native_read_tscp,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
.load_gdt = native_load_gdt,
Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:18.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:22.0 
-0800
@@ -120,6 +120,7 @@
 
u64 (*read_tsc)(void);
u64 (*read_pmc)(int counter);
+   unsigned long long (*read_tscp)(unsigned int *aux);
 
/* These two are jmp to, not actually called. */
void (*irq_enable_syscall_ret)(void);
@@ -668,6 +669,27 @@
high = _l  32;\
 } while(0)
 
+static inline unsigned long long paravirt_rdtscp(unsigned int *aux)
+{
+   return PVOP_CALL1(u64, pv_cpu_ops.read_tscp, aux);
+}
+
+#define rdtscp(low, high, aux) \
+do {   \
+   int __aux;  \
+   unsigned long __val = paravirt_rdtscp(__aux);  \
+   (low) = (u32)__val; \
+   (high) = (u32)(__val  32);\
+   (aux) = __aux;  \
+} while (0)
+
+#define rdtscpll(val, aux) \
+do {   \
+   unsigned long __aux;\
+   val = paravirt_rdtscp(__aux);  \
+   (aux) = __aux;  \
+} while (0)
+
 static inline void load_TR_desc(void)
 {
PVOP_VCALL0(pv_cpu_ops.load_tr_desc);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/15] change assembly definition of paravirt_patch_site

2007-12-20 Thread Glauber de Oliveira Costa

To account for differences in x86_64, we change the macros that
create raw instances of the paravirt_patch_site struct.
We need to align 64-pointers to 64-bit boundaries, so we add an alignment
directive. Also, we need to make room for a word-sized pointer,
instead of a fixed 32-bit one

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |   16 +---
 1 files changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:22.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:25.0 
-0800
@@ -5,6 +5,7 @@
 
 #ifdef CONFIG_PARAVIRT
 #include asm/page.h
+#include asm/asm.h
 
 /* Bitmask of what can be clobbered: usually at least eax. */
 #define CLBR_NONE 0x0
@@ -281,7 +282,8 @@
 #define _paravirt_alt(insn_string, type, clobber)  \
771:\n\t insn_string \n 772:\n\
.pushsection .parainstructions,\a\\n\
- .long 771b\n\
+   _ASM_ALIGN \n \
+   _ASM_PTR  771b\n  \
  .byte  type \n\
  .byte 772b-771b\n   \
  .short  clobber \n\
@@ -1159,17 +1161,25 @@
 
 #define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 4)
 
-#define PARA_SITE(ptype, clobbers, ops)\
+#define _PVSITE(ptype, clobbers, ops, word, algn)  \
 771:;  \
ops;\
 772:;  \
.pushsection .parainstructions,a; \
-.long 771b;\
+.align algn;   \
+word 771b; \
 .byte ptype;   \
 .byte 772b-771b;   \
 .short clobbers;   \
.popsection
 
+
+#ifdef CONFIG_X86_64
+#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8)
+#else
+#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4)
+#endif
+
 #define INTERRUPT_RETURN   \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,   \
  jmp *%cs:pv_cpu_ops+PV_CPU_iret)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 06/15] adjust assembly macros to x86_64 as well.

2007-12-20 Thread Glauber de Oliveira Costa

This patch adjust the paravirt macros used in assembly code
to accomodate for x86_64 as well.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |   18 --
 1 files changed, 12 insertions(+), 6 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:25.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:27.0 
-0800
@@ -1159,8 +1159,6 @@
 
 #else  /* __ASSEMBLY__ */
 
-#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 4)
-
 #define _PVSITE(ptype, clobbers, ops, word, algn)  \
 771:;  \
ops;\
@@ -1175,8 +1173,14 @@
 
 
 #ifdef CONFIG_X86_64
+#define PV_SAVE_REGS   pushq %rax; pushq %rdi; pushq %rcx; pushq %rdx
+#define PV_RESTORE_REGS popq %rdx; popq %rcx; popq %rdi; popq %rax
+#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 8)
 #define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8)
 #else
+#define PV_SAVE_REGS   pushl %eax; pushl %edi; pushl %ecx; pushl %edx
+#define PV_RESTORE_REGS popl %edx; popl %ecx; popl %edi; popl %eax
+#define PARA_PATCH(struct, off)((PARAVIRT_PATCH_##struct + (off)) / 4)
 #define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4)
 #endif
 
@@ -1186,25 +1190,27 @@
 
 #define DISABLE_INTERRUPTS(clobbers)   \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
- pushl %eax; pushl %ecx; pushl %edx;   \
+ PV_SAVE_REGS; \
  call *%cs:pv_irq_ops+PV_IRQ_irq_disable;  \
- popl %edx; popl %ecx; popl %eax)  \
+ PV_RESTORE_REGS;) \
 
 #define ENABLE_INTERRUPTS(clobbers)\
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_enable), clobbers,  \
- pushl %eax; pushl %ecx; pushl %edx;   \
+ PV_SAVE_REGS; \
  call *%cs:pv_irq_ops+PV_IRQ_irq_enable;   \
- popl %edx; popl %ecx; popl %eax)
+ PV_RESTORE_REGS;)
 
 #define ENABLE_INTERRUPTS_SYSCALL_RET  \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_syscall_ret),\
  CLBR_NONE,\
  jmp *%cs:pv_cpu_ops+PV_CPU_irq_enable_syscall_ret)
 
+#ifdef CONFIG_X86_32
 #define GET_CR0_INTO_EAX   \
push %ecx; push %edx;   \
call *pv_cpu_ops+PV_CPU_read_cr0;   \
pop %edx; pop %ecx
+#endif
 
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PARAVIRT */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/15] add macro for privileged x86_64 operation

2007-12-20 Thread Glauber de Oliveira Costa

i386 has a macro GET_CR0_INTO_EAX, used in early trap handling code.
x86_64 has similar needs, only it needs to put cr2 into rcx. We provide
a macro for such task, in the same way

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:28.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:29.0 
-0800
@@ -1227,6 +1227,12 @@
push %ecx; push %edx;   \
call *pv_cpu_ops+PV_CPU_read_cr0;   \
pop %edx; pop %ecx
+#else
+#define GET_CR2_INTO_RCX   \
+   call *pv_mmu_ops+PV_MMU_read_cr2;   \
+   movq %rax, %rcx;\
+   xorq %rax, %rax;
+
 #endif
 
 #endif /* __ASSEMBLY__ */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/15] change irq functions to accomodate x86_64

2007-12-20 Thread Glauber de Oliveira Costa

This patch changes the irq handling function definitions
in paravirt.h (like raw_local_irq_disable) to accomodate for x86_64.
The differences are in the calling convention.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |   43 ++-
 1 files changed, 30 insertions(+), 13 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:27.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:07:28.0 
-0800
@@ -1085,52 +1085,68 @@
 extern struct paravirt_patch_site __parainstructions[],
__parainstructions_end[];
 
+#ifdef CONFIG_X86_32
+#define PV_SAVE_REGS pushl %%ecx; pushl %%edx;
+#define PV_RESTORE_REGS popl %%edx; popl %%ecx
+#define PV_FLAGS_ARG 0
+#define PV_EXTRA_CLOBBERS
+#define PV_VEXTRA_CLOBBERS
+#else
+/* We save some registers, but all of them, that's too much. We clobber all
+ * caller saved registers but the argument parameter */
+#define PV_SAVE_REGS pushq %%rdi;
+#define PV_RESTORE_REGS popq %%rdi;
+#define PV_EXTRA_CLOBBERS EXTRA_CLOBBERS, rcx , rdx
+#define PV_VEXTRA_CLOBBERS EXTRA_CLOBBERS, rdi, rcx , rdx
+#define PV_FLAGS_ARG D
+#endif
+
 static inline unsigned long __raw_local_save_flags(void)
 {
unsigned long f;
 
-   asm volatile(paravirt_alt(pushl %%ecx; pushl %%edx;
+   asm volatile(paravirt_alt(PV_SAVE_REGS
  PARAVIRT_CALL
- popl %%edx; popl %%ecx)
+ PV_RESTORE_REGS)
 : =a(f)
 : paravirt_type(pv_irq_ops.save_fl),
   paravirt_clobber(CLBR_EAX)
-: memory, cc);
+: memory, cc PV_VEXTRA_CLOBBERS);
return f;
 }
 
 static inline void raw_local_irq_restore(unsigned long f)
 {
-   asm volatile(paravirt_alt(pushl %%ecx; pushl %%edx;
+   asm volatile(paravirt_alt(PV_SAVE_REGS
  PARAVIRT_CALL
- popl %%edx; popl %%ecx)
+ PV_RESTORE_REGS)
 : =a(f)
-: 0(f),
+: PV_FLAGS_ARG(f),
   paravirt_type(pv_irq_ops.restore_fl),
   paravirt_clobber(CLBR_EAX)
-: memory, cc);
+: memory, cc PV_EXTRA_CLOBBERS);
 }
 
 static inline void raw_local_irq_disable(void)
 {
-   asm volatile(paravirt_alt(pushl %%ecx; pushl %%edx;
+   asm volatile(paravirt_alt(PV_SAVE_REGS
  PARAVIRT_CALL
- popl %%edx; popl %%ecx)
+ PV_RESTORE_REGS)
 :
 : paravirt_type(pv_irq_ops.irq_disable),
   paravirt_clobber(CLBR_EAX)
-: memory, eax, cc);
+: memory, eax, cc PV_EXTRA_CLOBBERS);
 }
 
 static inline void raw_local_irq_enable(void)
 {
-   asm volatile(paravirt_alt(pushl %%ecx; pushl %%edx;
+   asm volatile(paravirt_alt(PV_SAVE_REGS
  PARAVIRT_CALL
- popl %%edx; popl %%ecx)
+ PV_RESTORE_REGS)
 :
 : paravirt_type(pv_irq_ops.irq_enable),
   paravirt_clobber(CLBR_EAX)
-: memory, eax, cc);
+: memory, eax, cc PV_EXTRA_CLOBBERS);
 }
 
 static inline unsigned long __raw_local_irq_save(void)
@@ -1205,6 +1221,7 @@
  CLBR_NONE,\
  jmp *%cs:pv_cpu_ops+PV_CPU_irq_enable_syscall_ret)
 
+
 #ifdef CONFIG_X86_32
 #define GET_CR0_INTO_EAX   \
push %ecx; push %edx;   \
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/15] replace privileged instructions with paravirt macros

2007-12-20 Thread Glauber de Oliveira Costa

The assembly code in entry_64.S issues a bunch of privileged instructions,
like cli, sti, swapgs, and others. Paravirt guests are forbidden to do so,
and we then replace them with macros that will do the right thing.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/entry_64.S |  101 +--
 1 files changed, 59 insertions(+), 42 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/entry_64.S
===
--- linux-2.6-x86.orig/arch/x86/kernel/entry_64.S   2007-12-20 
19:06:59.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/entry_64.S2007-12-20 19:08:08.0 
-0800
@@ -50,6 +50,7 @@
 #include asm/hw_irq.h
 #include asm/page.h
 #include asm/irqflags.h
+#include asm/paravirt.h
 
.code64
 
@@ -57,6 +58,13 @@
 #define retint_kernel retint_restore_args
 #endif 
 
+#ifdef CONFIG_PARAVIRT
+ENTRY(native_irq_enable_syscall_ret)
+   movq%gs:pda_oldrsp,%rsp
+   swapgs
+   sysretq
+#endif /* CONFIG_PARAVIRT */
+
 
 .macro TRACE_IRQS_IRETQ offset=ARGOFFSET
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -216,14 +224,21 @@
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTERrip,rcx
/*CFI_REGISTER  rflags,r11*/
-   swapgs
+   SWAPGS_UNSAFE_STACK
+   /*
+* A hypervisor implementation might want to use a label
+* after the swapgs, so that it can do the swapgs
+* for the guest and jump here on syscall.
+*/
+ENTRY(system_call_after_swapgs)
+
movq%rsp,%gs:pda_oldrsp 
movq%gs:pda_kernelstack,%rsp
/*
 * No need to follow this irqs off/on section - it's straight
 * and short:
 */
-   sti 
+   ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,1
movq  %rax,ORIG_RAX-ARGOFFSET(%rsp) 
movq  %rcx,RIP-ARGOFFSET(%rsp)
@@ -246,7 +261,7 @@
 sysret_check:  
LOCKDEP_SYS_EXIT
GET_THREAD_INFO(%rcx)
-   cli
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
@@ -260,9 +275,7 @@
CFI_REGISTERrip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER  rflags,r11*/
-   movq%gs:pda_oldrsp,%rsp
-   swapgs
-   sysretq
+   ENABLE_INTERRUPTS_SYSCALL_RET
 
CFI_RESTORE_STATE
/* Handle reschedules */
@@ -271,7 +284,7 @@
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
TRACE_IRQS_ON
-   sti
+   ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
@@ -282,7 +295,7 @@
/* Handle a signal */ 
 sysret_signal:
TRACE_IRQS_ON
-   sti
+   ENABLE_INTERRUPTS(CLBR_NONE)
testl $(_TIF_SIGPENDING|_TIF_SINGLESTEP|_TIF_MCE_NOTIFY),%edx
jz1f
 
@@ -295,7 +308,7 @@
 1: movl $_TIF_NEED_RESCHED,%edi
/* Use IRET because user could have changed frame. This
   works because ptregscall_common has called FIXUP_TOP_OF_STACK. */
-   cli
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check

@@ -327,7 +340,7 @@
  */
.globl int_ret_from_sys_call
 int_ret_from_sys_call:
-   cli
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $3,CS-ARGOFFSET(%rsp)
je retint_restore_args
@@ -349,20 +362,20 @@
bt $TIF_NEED_RESCHED,%edx
jnc  int_very_careful
TRACE_IRQS_ON
-   sti
+   ENABLE_INTERRUPTS(CLBR_NONE)
pushq %rdi
CFI_ADJUST_CFA_OFFSET 8
call schedule
popq %rdi
CFI_ADJUST_CFA_OFFSET -8
-   cli
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
 
/* handle signals and tracing -- both require a full stack frame */
 int_very_careful:
TRACE_IRQS_ON
-   sti
+   ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_REST
/* Check for syscall exit trace */  
testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP),%edx
@@ -385,7 +398,7 @@
 1: movl $_TIF_NEED_RESCHED,%edi
 int_restore_rest:
RESTORE_REST
-   cli
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
CFI_ENDPROC
@@ -506,7 +519,7 @@
CFI_DEF_CFA_REGISTERrbp
testl $3,CS(%rdi)
je 1f
-   swapgs  
+   SWAPGS
/* irqcount is used to check if a CPU is already on an interrupt
   stack or not. While this is essentially redundant with preempt_count
   it is a little cheaper to use a separate counter in the PDA
@@ -527,7 +540,7 @@
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
 ret_from_intr:
-   cli 
+   DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
@@ -556,13 +569,13 @@
/*
 * The iretq

[PATCH 09/15] adds paravirt hook for swapgs

2007-12-20 Thread Glauber de Oliveira Costa

This patch adds paravirt hook for swapgs operation, which is a privileged
operation in x86_64.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/paravirt.c  |1 +
 include/asm-x86/paravirt.h  |9 +
 include/asm-x86/processor.h |8 
 3 files changed, 18 insertions(+), 0 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/paravirt.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/paravirt.c   2007-12-20 
19:07:22.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/paravirt.c2007-12-20 19:08:06.0 
-0800
@@ -390,6 +390,7 @@
 
.irq_enable_syscall_ret = native_irq_enable_syscall_ret,
.iret = native_iret,
+   .swapgs = native_swapgs,
 
.set_iopl_mask = native_set_iopl_mask,
.io_delay = native_io_delay,
Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:07:29.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:08:06.0 
-0800
@@ -127,6 +127,8 @@
void (*irq_enable_syscall_ret)(void);
void (*iret)(void);
 
+   void (*swapgs)(void);
+
struct pv_lazy_ops lazy_mode;
 };
 
@@ -1228,6 +1230,13 @@
call *pv_cpu_ops+PV_CPU_read_cr0;   \
pop %edx; pop %ecx
 #else
+#define SWAPGS \
+   PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
+ PV_SAVE_REGS  \
+ call *pv_cpu_ops+PV_CPU_swapgs;   \
+ PV_RESTORE_REGS   \
+)
+
 #define GET_CR2_INTO_RCX   \
call *pv_mmu_ops+PV_MMU_read_cr2;   \
movq %rax, %rcx;\
Index: linux-2.6-x86/include/asm-x86/processor.h
===
--- linux-2.6-x86.orig/include/asm-x86/processor.h  2007-12-20 
19:06:59.0 -0800
+++ linux-2.6-x86/include/asm-x86/processor.h   2007-12-20 19:08:06.0 
-0800
@@ -435,6 +435,13 @@
 #endif
 }
 
+static inline void native_swapgs(void)
+{
+#ifdef CONFIG_X86_64
+   asm volatile(swapgs ::: memory);
+#endif
+}
+
 #ifdef CONFIG_PARAVIRT
 #include asm/paravirt.h
 #else
@@ -456,6 +463,7 @@
 }
 
 #define set_iopl_mask native_set_iopl_mask
+#define SWAPGS swapgs
 #endif /* CONFIG_PARAVIRT */
 
 /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 11/15] cleanup CLI_STRING, STI_STRING and friends

2007-12-20 Thread Glauber de Oliveira Costa

Since the advent of ticket locking, CLI_STRING, STI_STRING, and friends
are not used anymore. They can now be safely deleted.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/spinlock.h |9 -
 1 files changed, 0 insertions(+), 9 deletions(-)

Index: linux-2.6-x86/include/asm-x86/spinlock.h
===
--- linux-2.6-x86.orig/include/asm-x86/spinlock.h   2007-12-20 
19:06:59.0 -0800
+++ linux-2.6-x86/include/asm-x86/spinlock.h2007-12-20 19:08:09.0 
-0800
@@ -19,15 +19,6 @@
  * (the type definitions are in asm/spinlock_types.h)
  */
 
-#ifdef CONFIG_PARAVIRT
-#include asm/paravirt.h
-#else
-#define CLI_STRING cli
-#define STI_STRING sti
-#define CLI_STI_CLOBBERS
-#define CLI_STI_INPUT_ARGS
-#endif /* CONFIG_PARAVIRT */
-
 #ifdef CONFIG_X86_32
 typedef char _slock_t;
 # define LOCK_INS_DEC decb
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 12/15] add CLBR_ defines for x86_64.

2007-12-20 Thread Glauber de Oliveira Costa

x86_64 needs a potentially larger clobber list than i386, due to its calling
convention. So we add more CLBR_ defines for it.
Note that CLBR_ANY is different for each of the architectures, since it 
comprises
the notion of All call clobbers in this architecture

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 include/asm-x86/paravirt.h |   23 ++-
 1 files changed, 18 insertions(+), 5 deletions(-)

Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:08:06.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:08:10.0 
-0800
@@ -8,11 +8,24 @@
 #include asm/asm.h
 
 /* Bitmask of what can be clobbered: usually at least eax. */
-#define CLBR_NONE 0x0
-#define CLBR_EAX 0x1
-#define CLBR_ECX 0x2
-#define CLBR_EDX 0x4
-#define CLBR_ANY 0x7
+#define CLBR_NONE 0
+#define CLBR_EAX  (1  0)
+#define CLBR_ECX  (1  1)
+#define CLBR_EDX  (1  2)
+
+#ifdef CONFIG_X86_64
+#define CLBR_RSI  (1  3)
+#define CLBR_RDI  (1  4)
+#define CLBR_R8   (1  5)
+#define CLBR_R9   (1  6)
+#define CLBR_R10  (1  7)
+#define CLBR_R11  (1  8)
+#define CLBR_ANY  ((1  9) - 1)
+#include asm/desc_defs.h
+#else
+/* CLBR_ANY should match all regs platform has. For i386, that's just it */
+#define CLBR_ANY  ((1  3) - 1)
+#endif /* X86_64 */
 
 #ifndef __ASSEMBLY__
 #include linux/types.h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 13/15] move patching code to arch-specific file.

2007-12-20 Thread Glauber de Oliveira Costa

The core patching code for paravirt is sufficiently different
among i386 and x86_64, and we move them to specific files.

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/Makefile_32 |2 +-
 arch/x86/kernel/paravirt.c  |   50 ---
 include/asm-x86/paravirt.h  |8 +++
 3 files changed, 9 insertions(+), 51 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/Makefile_32
===
--- linux-2.6-x86.orig/arch/x86/kernel/Makefile_32  2007-12-20 
19:07:15.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/Makefile_32   2007-12-20 19:08:11.0 
-0800
@@ -48,7 +48,7 @@
 obj-$(CONFIG_MGEODE_LX)+= geode_32.o mfgpt_32.o
 
 obj-$(CONFIG_VMI)  += vmi_32.o vmiclock_32.o
-obj-$(CONFIG_PARAVIRT) += paravirt.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_32.o
 obj-y  += pcspeaker.o
 
 obj-$(CONFIG_SCx200)   += scx200_32.o
Index: linux-2.6-x86/arch/x86/kernel/paravirt.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/paravirt.c   2007-12-20 
19:08:06.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/paravirt.c2007-12-20 19:08:11.0 
-0800
@@ -58,59 +58,9 @@
extern const char start_##ops##_##name[], end_##ops##_##name[]; \
asm(start_ #ops _ #name :  code ; end_ #ops _ #name :)
 
-DEF_NATIVE(pv_irq_ops, irq_disable, cli);
-DEF_NATIVE(pv_irq_ops, irq_enable, sti);
-DEF_NATIVE(pv_irq_ops, restore_fl, push %eax; popf);
-DEF_NATIVE(pv_irq_ops, save_fl, pushf; pop %eax);
-DEF_NATIVE(pv_cpu_ops, iret, iret);
-DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, sti; sysexit);
-DEF_NATIVE(pv_mmu_ops, read_cr2, mov %cr2, %eax);
-DEF_NATIVE(pv_mmu_ops, write_cr3, mov %eax, %cr3);
-DEF_NATIVE(pv_mmu_ops, read_cr3, mov %cr3, %eax);
-DEF_NATIVE(pv_cpu_ops, clts, clts);
-DEF_NATIVE(pv_cpu_ops, read_tsc, rdtsc);
-
 /* Undefined instruction for dealing with missing ops pointers. */
 static const unsigned char ud2a[] = { 0x0f, 0x0b };
 
-static unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
-unsigned long addr, unsigned len)
-{
-   const unsigned char *start, *end;
-   unsigned ret;
-
-   switch(type) {
-#define SITE(ops, x)   \
-   case PARAVIRT_PATCH(ops.x): \
-   start = start_##ops##_##x;  \
-   end = end_##ops##_##x;  \
-   goto patch_site
-
-   SITE(pv_irq_ops, irq_disable);
-   SITE(pv_irq_ops, irq_enable);
-   SITE(pv_irq_ops, restore_fl);
-   SITE(pv_irq_ops, save_fl);
-   SITE(pv_cpu_ops, iret);
-   SITE(pv_cpu_ops, irq_enable_syscall_ret);
-   SITE(pv_mmu_ops, read_cr2);
-   SITE(pv_mmu_ops, read_cr3);
-   SITE(pv_mmu_ops, write_cr3);
-   SITE(pv_cpu_ops, clts);
-   SITE(pv_cpu_ops, read_tsc);
-#undef SITE
-
-   patch_site:
-   ret = paravirt_patch_insns(ibuf, len, start, end);
-   break;
-
-   default:
-   ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
-   break;
-   }
-
-   return ret;
-}
-
 unsigned paravirt_patch_nop(void)
 {
return 0;
Index: linux-2.6-x86/include/asm-x86/paravirt.h
===
--- linux-2.6-x86.orig/include/asm-x86/paravirt.h   2007-12-20 
19:08:10.0 -0800
+++ linux-2.6-x86/include/asm-x86/paravirt.h2007-12-20 19:08:11.0 
-0800
@@ -308,6 +308,11 @@
 #define paravirt_alt(insn_string)  \
_paravirt_alt(insn_string, %c[paravirt_typenum], 
%c[paravirt_clobber])
 
+/* Simple instruction patching code. */
+#define DEF_NATIVE(ops, name, code)\
+   extern const char start_##ops##_##name[], end_##ops##_##name[]; \
+   asm(start_ #ops _ #name :  code ; end_ #ops _ #name :)
+
 unsigned paravirt_patch_nop(void);
 unsigned paravirt_patch_ignore(unsigned len);
 unsigned paravirt_patch_call(void *insnbuf,
@@ -322,6 +327,9 @@
 unsigned paravirt_patch_insns(void *insnbuf, unsigned len,
  const char *start, const char *end);
 
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len);
+
 int paravirt_disable_iospace(void);
 
 /*
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 14/15] x86_64 patching functions

2007-12-20 Thread Glauber de Oliveira Costa

Like i386, x86_64 also need to include its own patching function.
(Well, if you're not in a hurry, and don't care about speed, you don't
really _need_ ;-)) 

So here they are. Not much different in essence from i386

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/Makefile_64 |1 +
 arch/x86/kernel/paravirt_patch_64.c |   56 +++
 2 files changed, 57 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/paravirt_patch_64.c

Index: linux-2.6-x86/arch/x86/kernel/Makefile_64
===
--- linux-2.6-x86.orig/arch/x86/kernel/Makefile_64  2007-12-20 
19:05:52.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/Makefile_64   2007-12-20 19:08:14.0 
-0800
@@ -41,6 +41,7 @@
 obj-$(CONFIG_K8_NB)+= k8.o
 obj-$(CONFIG_AUDIT)+= audit_64.o
 obj-$(CONFIG_EFI)  += efi.o efi_64.o efi_stub_64.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_64.o
 
 obj-$(CONFIG_MODULES)  += module_64.o
 obj-$(CONFIG_PCI)  += early-quirks.o
Index: linux-2.6-x86/arch/x86/kernel/paravirt_patch_64.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6-x86/arch/x86/kernel/paravirt_patch_64.c   2007-12-20 
19:08:14.0 -0800
@@ -0,0 +1,56 @@
+#include asm/paravirt.h
+#include asm/asm-offsets.h
+
+DEF_NATIVE(pv_irq_ops, irq_disable, cli);
+DEF_NATIVE(pv_irq_ops, irq_enable, sti);
+DEF_NATIVE(pv_irq_ops, restore_fl, pushq %rdi; popfq);
+DEF_NATIVE(pv_irq_ops, save_fl, pushfq; popq %rax);
+DEF_NATIVE(pv_cpu_ops, iret, iretq);
+DEF_NATIVE(pv_mmu_ops, read_cr2, movq %cr2, %rax);
+DEF_NATIVE(pv_mmu_ops, read_cr3, movq %cr3, %rax);
+DEF_NATIVE(pv_mmu_ops, write_cr3, movq %rdi, %cr3);
+DEF_NATIVE(pv_mmu_ops, flush_tlb_single, invlpg (%rdi));
+DEF_NATIVE(pv_cpu_ops, clts, clts);
+DEF_NATIVE(pv_cpu_ops, wbinvd, wbinvd);
+
+/* the three commands give us more control to how to return from a syscall */
+DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, movq %gs: 
__stringify(pda_oldrsp) , %rsp; swapgs; sysretq;);
+DEF_NATIVE(pv_cpu_ops, swapgs, swapgs);
+
+unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
+ unsigned long addr, unsigned len)
+{
+   const unsigned char *start, *end;
+   unsigned ret;
+
+#define PATCH_SITE(ops, x) \
+   case PARAVIRT_PATCH(ops.x): \
+   start = start_##ops##_##x;  \
+   end = end_##ops##_##x;  \
+   goto patch_site
+   switch(type) {
+   PATCH_SITE(pv_irq_ops, restore_fl);
+   PATCH_SITE(pv_irq_ops, save_fl);
+   PATCH_SITE(pv_irq_ops, irq_enable);
+   PATCH_SITE(pv_irq_ops, irq_disable);
+   PATCH_SITE(pv_cpu_ops, iret);
+   PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+   PATCH_SITE(pv_cpu_ops, swapgs);
+   PATCH_SITE(pv_mmu_ops, read_cr2);
+   PATCH_SITE(pv_mmu_ops, read_cr3);
+   PATCH_SITE(pv_mmu_ops, write_cr3);
+   PATCH_SITE(pv_cpu_ops, clts);
+   PATCH_SITE(pv_mmu_ops, flush_tlb_single);
+   PATCH_SITE(pv_cpu_ops, wbinvd);
+
+   patch_site:
+   ret = paravirt_patch_insns(ibuf, len, start, end);
+   break;
+
+   default:
+   ret = paravirt_patch_default(type, clobbers, ibuf, addr, len);
+   break;
+   }
+#undef PATCH_SITE
+   return ret;
+}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 15/15] replace x86_read/write_per_cpu with a common function.

2007-12-20 Thread Glauber de Oliveira Costa

x86_read_per_cpu() and its writeish sister are not present in x86_64. So in
this patch, we replace them with __get_cpu_var(), which is present in both

Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/paravirt.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6-x86/arch/x86/kernel/paravirt.c
===
--- linux-2.6-x86.orig/arch/x86/kernel/paravirt.c   2007-12-20 
19:08:11.0 -0800
+++ linux-2.6-x86/arch/x86/kernel/paravirt.c2007-12-20 19:08:18.0 
-0800
@@ -238,18 +238,18 @@
 
 static inline void enter_lazy(enum paravirt_lazy_mode mode)
 {
-   BUG_ON(x86_read_percpu(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
+   BUG_ON(__get_cpu_var(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
BUG_ON(preemptible());
 
-   x86_write_percpu(paravirt_lazy_mode, mode);
+   __get_cpu_var(paravirt_lazy_mode) = mode;
 }
 
 void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
 {
-   BUG_ON(x86_read_percpu(paravirt_lazy_mode) != mode);
+   BUG_ON(__get_cpu_var(paravirt_lazy_mode) != mode);
BUG_ON(preemptible());
 
-   x86_write_percpu(paravirt_lazy_mode, PARAVIRT_LAZY_NONE);
+   __get_cpu_var(paravirt_lazy_mode) = PARAVIRT_LAZY_NONE;
 }
 
 void paravirt_enter_lazy_mmu(void)
@@ -274,7 +274,7 @@
 
 enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
 {
-   return x86_read_percpu(paravirt_lazy_mode);
+   return __get_cpu_var(paravirt_lazy_mode);
 }
 
 struct pv_info pv_info = {
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Arjan van de Ven


Parag Warudkar wrote:

On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.


0.8 is easy on single core today.
multicore just increases how idle you can be for a given core.




If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...



I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up. If there is something running even on the other
CPU - that is going to cause an IPI, reschedule, TLB invalidation etc.
which will make it very likely in practice that each CPU will be
interrupted in reasonable amount of time.


this is not correct; many machines are idle waiting for network data. Think of 
webservers...



Of course there are theoretical cases where we could land into a
situation where a CPU in a multiprocessor machine is IDLE infinitely
and that causes the watchdog that happens to be bound to run on the
same CPU to not run. To take care of these unlikely cases I think the
timer mechanism should have a reasonable limit on how long a CPU can
go IDLE if there are deferrable timers.


how about something else instead: a timer mechanism that takes a range instead..
that at least has defined semantics; the deferrable semantics really are 
indefinite.
Lets keep at least the semantics clear and clean.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] ps3: vuart: fix error path locking

2007-12-20 Thread Daniel Walker

On Thu, 2007-12-20 at 12:06 -0800, Andrew Morton wrote:
 On Thu, 20 Dec 2007 11:32:25 -0800 Daniel Walker [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-12-18 at 19:04 -0800, Geoff Levand wrote:
  
   Unfortunately there wasn't enough context in the patch to see
   that there is a down() earlier in the routine, and that the patch
   does indeed remove an incorrectly placed down().  Here is the
   entire routine, marked with what the patch removes.
   
  
  Andrew have you had a chance to review this?
  
 
 Confused.  I did review it: http://lkml.org/lkml/2007/12/18/384

Yeah, but Geoff countered http://lkml.org/lkml/2007/12/18/409

Do you still think the patch is wrong, given the whole function?

Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Matthew Wilcox wrote:


Here's how BARs work ... when you write 0x to the BAR, it
ignores all the set bits that are less than the size of the BAR.  So,
assuming this is a 256MB BAR (like my G33 is), what ends up written to
this BAR is 0xf000.  Now, because this is graphics, apparently it's
special and embedded in the chipset, even though it looks like it's a
PCI device.  So it actually gets priority over MMCONFIG which is also
mapped to 0xf000.

For your case of a 64-bit BAR, you could write 0x to the high
32-bits first, then write to the low 32-bits, then reset the low, then
high bits, and you'd avoid the problem.  But the G33 has a 32-bit BAR
with the same problem, so it won't work for that case.

BARs that are behind bridges don't have this problem (they can't decode
memory accesses that aren't forwarded to them).  BARs on devices which
have memory IO disabled also don't have theis problem, but disabling
devices has its problems (as does probing BARs for active devices anyway
...).


Thanks for the detailed explanation.

 The question is how large can 32-bit BARs get.  As we've seen, 256MB
 exist, and are causing pain.  I can't imagine any PCI device
 manufacturer thinks they can allocate 2GB of the low space, but we could
 potentially mis-size a large BAR by not using 0x.


Point well taken. Graphics devices understandably consume a lot of memory
space, and are likely to consume even more in the not-too-distant future.


I'm really not clear on the purpose of your patchset.  Was it all to
address this one problem?



No. My patch-set does not address this problem at all, but rather the
larger problem of having mmconfig-unfriendly devices on buses that are
out of reach of the unreachable_devices() routine and bitmap.

This problem is one I encountered during my testing and mentioned in
my preamble as not being fixable by my patch-set.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Matthew Wilcox

On Thu, Dec 20, 2007 at 03:05:52PM -0500, Loic Prylli wrote:
 I am not familiar with the tg3 driver, just trying to give a 5 minutes
 look, it seems the typical cases where the pci-conf-space is used
 intensively are with some rev in combination with the 82801
 (TG3_FLG2_ICH_WORKAROUND) which I don't think support mmconfig anyway,
 as well as some very specific PCI-X combinations
 (TG3_FLAG_PCIX_TARGET_HWBUG) which are also very unlikely to support
 mmconfig.

It's not a question of whether the card supports mmconfig or not -- the
card can't tell whether a first-256-byte pci config transaction was
initiated through mmconfig, type1, type2, or even a bios call.

I'm just hacking together an implementation based on Ivan's suggestion
to always use type 1 for the first 64 bytes.

-- 
Intel are signing my paycheques ... these opinions are still mine
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: /dev/urandom uses uninit bytes, leaks user data

2007-12-20 Thread Phillip Susi


Andrew Lutomirski wrote:

I understand that there's no way that /dev/random can provide good
output if there's insufficient entropy.  But it still shouldn't leak
arbitrary bits of user data that were never meant to be put into the
pool at all.


It doesn't leak it though, it consumes it, and it then vanishes into the 
entropy pool, never to be seen again.



Step 1: Boot a system without a usable entropy source.
Step 2: add some (predictable) entropy from userspace which isn't a
multiple of 4, so up to three extra bytes get added.
Step 3: Read a few bytes of /dev/random and send them over the network.


Only root can do 1 and 2, at which point, it's already game over.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Fwd: Re: [PATCH 0/5]PCI: x86 MMCONFIG]

2007-12-20 Thread Tony Camuso


Ivan Kokshaysky wrote:

On Thu, Dec 20, 2007 at 12:08:33PM -0700, Matthew Wilcox wrote:

On Thu, Dec 20, 2007 at 02:04:31PM -0500, Tony Camuso wrote:

Does anybody see a down side to this?

It'll be slower than it would be if we used mmconfig directly.  Now yes,
nobody should be using pci config space in performance critical paths
... but see the tg3 driver.


Use type 1 just for the first 64 bytes and tg3 will be happy. All we need
is to avoid touching BARs with mmconfig.

Ivan.


Sounds good to me.

However, this does add another in-line test for every pci config access.

The existing test is a lookup in the unreachable_devices bitmap. Even though
the bitmap only covers the first 16 buses, the lookup is performed for every
pci config access.

We would be adding another test,

  if the offset is less than 64 bytes, use legacy pci config
  else use mmconfig

The emitted code would probably only produce one branch, so it shouldn't
present a performance degradation.

Any objections to taking this tact?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] sky2: Use deferrable timer for watchdog

2007-12-20 Thread Krzysztof Oledzki




On Thu, 20 Dec 2007, Parag Warudkar wrote:


On Dec 20, 2007 2:22 PM, Kok, Auke [EMAIL PROTECTED] wrote:

ok, that's just bad and if there's no user-defineable limit to the deferral I
definately don't like this change.

Can I safely assume that any irq will cause all deferred timers to run?


I think even other causes for wakeup like process related ones will
cause the CPU to go busy and run the timers.
This, coupled with the fact that no one is yet able to reach 0 wakeups
per second makes it pretty unlikely that deferrable timers will be
deferred indefinitely.



If this is the case then for e1000 this patch is still OK since the watchdog 
needs
to run (1) after a link up/down interrupt or (2) to update statistics. Those
statistics won't increase if there is no traffic of course...



I think it is reasonable for Network driver watchdogs to use a
deferrable timer - if the machine is 100% IDLE there is no one needing
the network to be up.


Please note tha being connected to a network does not only mean to send 
but also to receive.


Best regards,

Krzysztof Oledzki
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: iommu dma mapping alignment requirements

2007-12-20 Thread Benjamin Herrenschmidt

Adding A few more people to the discussion. You may well be right and we
would have to provide the same alignment, though that sucks a bit as one
of the reason we switched to 4K for the IOMMU is that the iommu space
available on pSeries is very small and we were running out of it with
64K pages and lots of networking activity.

On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
 Hey Roland (and any iommu/ppc/dma experts out there):
 
 I'm debugging a data corruption issue that happens on PPC64 systems 
 running rdma on kernels where the iommu page size is 4KB yet the host 
 page size is 64KB.  This feature was added to the PPC64 code recently, 
 and is in kernel.org from 2.6.23.  So if the kernel is built with a 4KB 
 page size, no problems.  If the kernel is prior to 2.6.23 then 64KB page 
   configs work too. Its just a problem when the iommu page size != host 
 page size.
 
 It appears that my problem boils down to a single host page of memory 
 that is mapped for dma, and the dma address returned by dma_map_sg() is 
 _not_ 64KB aligned.  Here is an example:
 
 app registers va 0x2d9a3000 len 12288
 ib_umem_get() creates and maps a umem and chunk that looks like (dumping 
 state from a registered user memory region):
 
  umem len 12288 off 12288 pgsz 65536 shift 16
  chunk 0: nmap 1 nents 1
  sglist[0] page 0xc0930b08 off 0 len 65536 dma_addr 
  5bff4000 dma_len 65536
  
 
 So the kernel maps 1 full page for this MR.  But note that the dma 
 address is 5bff4000 which is 4KB aligned, not 64KB aligned.  I 
 think this is causing grief to the RDMA HW.
 
 My first question is: Is there an assumption or requirement in linux 
 that dma_addressess should have the same alignment as the host address 
 they are mapped to?  IE the rdma core is mapping the entire 64KB page, 
 but the mapping doesn't begin on a 64KB page boundary.
 
 If this mapping is considered valid, then perhaps the rdma hw is at 
 fault here.  But I'm wondering if this is an PPC/iommu bug.
 
 BTW:  Here is what the Memory Region looks like to the HW:
 
  TPT entry:  stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
  perms RW rem_inv_dis 0 addr_type VATO
  bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
  len 12288 va 2d9a3000 bind_cnt 0
  PBL: 5bff4000
 
 
 
 Any thoughts?
 
 Steve.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ofa-general] iommu dma mapping alignment requirements

2007-12-20 Thread Benjamin Herrenschmidt


On Thu, 2007-12-20 at 13:29 -0600, Steve Wise wrote:

 Or based on the alignment of vaddr actually...

The later wouldn't be realistic. What I think might be necessay, though
it would definitely cause us problems with running out of iommu space
(which is the reason we did the switch down to 4K), is to provide
alignment to the real page size, and alignement to the allocation order
for dma_map_consistent.

It might be possible to -tweak- and only provide alignment to the page
size for allocations that are larger than IOMMU_PAGE_SIZE. That would
solve the problem with small network packets eating up too much iommu
space though.

What do you think ?

Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 6 7 8 9 10 11 12 13 14 >

1001 - 1100 of 1372 matches

Mail list logo