from:"willem"

Re: git guidance

2007-11-28 Thread willem


Dave Quigley wrote:

On Wed, 2007-11-28 at 16:57 +0100, Tilman Schmidt wrote:
  

Dave Quigley schrieb:


There is a project listed on the kernel.org git page called guilt. I
find it very useful. It is much more responsive than stgit and it
actually has a git backend which quilt does not.

On Wed, 2007-11-28 at 00:20 +0100, Jan Engelhardt wrote:
  

On Nov 27 2007 23:33, Tilman Schmidt wrote:


Well, it did. So now I'm back to keeping a virgin kernel source tree
alongside my development area in order to produce diffs. That can't
be right?

  

No, it can't. Use stgit/quilt ;p


In which respect would stgit/quilt/guilt help me? At first glance
they just seem to add another level of complexity.

Thanks,
Tilman




These tools allow you to maintain a set of patches with very little
effort. More importantly it removes a lot of the git specifics from your
development process. For example this is how I use guilt for a new patch
set.

I take my fresh tree and do a guilt-init in the base. This will create a
new patch series. I then need to create a patch to modify something LSM
related (guilt-new patch_name). Things like stgit/quilt/git use the
idea of a stack of patches. At this point if you were to type
guilt-series you would only see the one patch we just created. This
patch is going to be one logical set of changes (it should also produce
a compilable and working kernel). You can make whatever modifications
you need to make to your files and at this point you need to do one of
two things. If they were already in the tree you just type guilt-refresh
and under your .git/patches/branch_name directory you will see a file
named patch_name which contains your patch. Otherwise you need to do a
guilt-add file_name and then a guilt-refresh. The idea here is that
you have a moved your workflow from managing a series of commits and
then breaking out patches from a final version to one where you think in
terms of the patches and make modifications to them instead. In my
example I said I was doing something LSM related. Lets say the first
patch added a new hook and its implementation in the various modules. We
can now add a second patch using the guilt-new command and this one will
add uses of that new hook. At this point we have a stack that looks like
this.

patch that adds users
patch that adds hook

I can pop and push patches onto this stack to have a version of my
kernel tree at any state within the patch set. At this point lets say we
have posted the patch set and have feedback. I need to apply this
feedback to the patch that adds the LSM hook. Since my top patch
(guilt-top) is currently at the one that adds the users of the hook I
need to pop off that patch and get to the one that creates the hook
(guilt-pop). After doing this I'm at a kernel tree state which just has
the changes which add the hook. I make my modifications, type
guilt-refresh to create a new patch and then guilt-push my second patch
on and make sure everything is still working.

As you can see there is almost no git knowledge required to use this
system and it allows you to focus on development instead of the
versioning system. One useful feature is that when Linus adds new
patches and I want to rebase my set against the current tree It only
takes 3 commands to rebase the patch set (Assuming all goes well).

guilt-push -a #push all patches onto the stack
git-fetch #pull down the index
guilt-rebase FETCH_HEAD #Rebase our patches should do a merge and
#reapply all patches

These are just some basics about guilt. Jeff has written a better
tutorial with a sample repository for you to work with if your
interested. I don't know if this will help your development process but
I can tell you from experience breaking patches by hand was a pain in
the ass and a huge waste of time and I'm glad to have a tool like this
  

Where can I find that tutorial ?

regards

now.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Linux and iptables

2008-01-06 Thread willem


Hello,
I have a kubuntu system with a firestarter Firewall.
I did purchase a Sweex router. This router has also a firewall.
If I connect the router then i have two Firewalls.
The result is that no visitor from outside can visit my wiki's, because 
they have to pass 2 Firewalls.


How can I disable the firestarter Firewall with iptables ?
The Sweex router has a Mikkisoft Firewall , so I do not know anything 
about it and I can not disable it.


regards  Wim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

can we discuss

2018-07-28 Thread willem

Linux and iptables

2008-01-06 Thread willem


Hello,
I have a kubuntu system with a firestarter Firewall.
I did purchase a Sweex router. This router has also a firewall.
If I connect the router then i have two Firewalls.
The result is that no visitor from outside can visit my wiki's, because 
they have to pass 2 Firewalls.


How can I disable the firestarter Firewall with iptables ?
The Sweex router has a Mikkisoft Firewall , so I do not know anything 
about it and I can not disable it.


regards  Wim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: git guidance

2007-11-28 Thread willem


Dave Quigley wrote:

On Wed, 2007-11-28 at 16:57 +0100, Tilman Schmidt wrote:
  

Dave Quigley schrieb:


There is a project listed on the kernel.org git page called guilt. I
find it very useful. It is much more responsive than stgit and it
actually has a git backend which quilt does not.

On Wed, 2007-11-28 at 00:20 +0100, Jan Engelhardt wrote:
  

On Nov 27 2007 23:33, Tilman Schmidt wrote:


Well, it did. So now I'm back to keeping a virgin kernel source tree
alongside my development area in order to produce diffs. That can't
be right?

  

No, it can't. Use stgit/quilt ;p


In which respect would stgit/quilt/guilt help me? At first glance
they just seem to add another level of complexity.

Thanks,
Tilman




These tools allow you to maintain a set of patches with very little
effort. More importantly it removes a lot of the git specifics from your
development process. For example this is how I use guilt for a new patch
set.

I take my fresh tree and do a guilt-init in the base. This will create a
new patch series. I then need to create a patch to modify something LSM
related (guilt-new ). Things like stgit/quilt/git use the
idea of a stack of patches. At this point if you were to type
guilt-series you would only see the one patch we just created. This
patch is going to be one logical set of changes (it should also produce
a compilable and working kernel). You can make whatever modifications
you need to make to your files and at this point you need to do one of
two things. If they were already in the tree you just type guilt-refresh
and under your .git/patches/ directory you will see a file
named  which contains your patch. Otherwise you need to do a
guilt-add  and then a guilt-refresh. The idea here is that
you have a moved your workflow from managing a series of commits and
then breaking out patches from a final version to one where you think in
terms of the patches and make modifications to them instead. In my
example I said I was doing something LSM related. Lets say the first
patch added a new hook and its implementation in the various modules. We
can now add a second patch using the guilt-new command and this one will
add uses of that new hook. At this point we have a stack that looks like
this.




I can pop and push patches onto this stack to have a version of my
kernel tree at any state within the patch set. At this point lets say we
have posted the patch set and have feedback. I need to apply this
feedback to the patch that adds the LSM hook. Since my top patch
(guilt-top) is currently at the one that adds the users of the hook I
need to pop off that patch and get to the one that creates the hook
(guilt-pop). After doing this I'm at a kernel tree state which just has
the changes which add the hook. I make my modifications, type
guilt-refresh to create a new patch and then guilt-push my second patch
on and make sure everything is still working.

As you can see there is almost no git knowledge required to use this
system and it allows you to focus on development instead of the
versioning system. One useful feature is that when Linus adds new
patches and I want to rebase my set against the current tree It only
takes 3 commands to rebase the patch set (Assuming all goes well).

guilt-push -a #push all patches onto the stack
git-fetch #pull down the index
guilt-rebase FETCH_HEAD #Rebase our patches should do a merge and
#reapply all patches

These are just some basics about guilt. Jeff has written a better
tutorial with a sample repository for you to work with if your
interested. I don't know if this will help your development process but
I can tell you from experience breaking patches by hand was a pain in
the ass and a huge waste of time and I'm glad to have a tool like this
  

Where can I find that tutorial ?

regards

now.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

can we discuss

2018-07-28 Thread willem

Re: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Willem Konynenberg


[EMAIL PROTECTED] wrote:
[lots of sensible comments to get this discussion on track]

 So we have: user space has file names and uses open() or mount().
 Kernel space has device paths.
 
 In principle the kernel could just number the devices it sees 1,2,...
 and export information about them, so that user space can choose
 the right number.
 The part about exporting information is good. User space needs to
 be able to ask if a certain beast is a CD reader, and if so what
 manufacturer and model.
 But the part about numbering 1,2,... may not be good enough, e.g.
 because it does not survive reboots. If we remain Unix-like and use
 device nodes in user space to pair a file name with a number, then
 it would be very nice if the number encoded the device path uniquely.
 Many programs expect this.
 It cannot be done in all cases, but a good approximation is obtained
 if the number is a hash of the device path. In so far the hash is
 collision free we obtain numbers that stay unique over a reboot.

I think here we might learn from the comments that people made
about how AIX and OSF/1/Tru64 have been doing this.

You want the relation between a given /dev/xyz node and the
corresponding device in the kernel to be relatively stable from
the perspective of the user, so you can use it to implement device
naming and access policies.  What stable means technically is a
matter of policy, since it depends very much on the perspective
of the user.

You also want the relation between a given /dev/xyz node and the
major/minor number of the related device to be fixed, so you
don't need to recreate all device nodes on every boot.

The issue is then how to establish the relation between major/minor
numbers and (potentially very dynamic) kernel-internal device
entities, in such a way that the type of stability desired by
the user is provided.

As people noted, the mechanism used by AIX and Tru64 is essentially
to have this relationship established by a user-level tool using
a configuration file with policy information, and a status file
to maintain state information across reboots.
This tool can implement any policy it likes, which means we can
avoid all the policy discusions where the kernel is concerned,
and can instead focus on methods.

For the people who want things to be backward compatible, it
would be sufficient to implement a default tool that will
set all the major/minor numbers exactly like they are now,
so all existing device nodes will keep working as they did.

In order to support hot plugging of devices, I suppose this
tool should either run as a daemon, or set itself up to be
called by the kernel whenever something happens.  (I think
I would prefer the daemon...)   Sort of like /sbin/hotplug.

The next issue to discuss is then how to shape the mechanism by
which the device manager tool and the kernel exchange information
on what devices are out there, and what major/minor numbers each
should be assigned.
The sort of information that a device manager might want to
use, depending on how it is implemented and configured, would
include such things as: device driver id, I/O bus, I/O port,
IRQ nr, SCSI card/bus/id/lun, partition id, serial number,
device model/type, filesystem label, filesystem UUID, phase of
the moon, etc, etc.
Basically, anything that a driver or related subsystem can come
up with that might help identify a device or resource in any
particular situation, depending on user preferences.
I think this may fit somewhere in /proc/sys/dev or thereabout.


The big advantage of this approach is basically that we can
continue to use the good old device nodes in the filesystem
as we have done for the last 30 years, with all the advantages,
but without the major/minor number allocation troubles.


Hmm, this all sounds too simple.
I must be talking nonsense.   Please wake me up.  ;-)



-- 
 Willem Konynenberg [EMAIL PROTECTED]
I am not able rightly to apprehend the kind of confusion of ideas
that could provoke such a question  --  Charles Babbage
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8

2001-05-24 Thread Willem Riede


Dawson Engler wrote:
 
 Hi All,
 
 Enclosed are 103 potential errors where code gets a pointer from a
 possibly-failing routine (kmalloc, etc) and dereferences it without

 [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they 
copy SRpnt back by *aSRpnt, implies it could be NULL

No. It implies SRpnt could have changed. The functions flagged
(osst_read_back_buffer_and_rewrite and osst_reposition_and_retry)
cannot be reached with SRpnt == NULL. So these are false alarms.

 
/u2/engler/mc/oses/linux/2.4.4/drivers/scsi/osst.c:1163:osst_read_back_buffer_and_rewrite:
 ERROR:NULL::1163: Using unknown ptr SRpnt illegally! set by 
'osst_do_scsi':1163 [nbytes = 216]
 #if DEBUG
 if (debugging)
 printk(OSST_DEB_MSG osst%d: About to attempt to write to 
frame %d\n, dev, new_block+i);
 #endif
 SRpnt = osst_do_scsi(SRpnt, STp, cmd, OS_FRAME_SIZE, SCSI_DATA_WRITE,
 Start ---
 STp-timeout, MAX_WRITE_RETRIES, TRUE);
 
 ... DELETED 46 lines ...
 
 }
 }
 if (flag) {
 if ((SRpnt-sr_sense_buffer[ 2]  0x0f) == 13 
  SRpnt-sr_sense_buffer[12] ==  0 
 Error ---
  SRpnt-sr_sense_buffer[13] ==  2) {
 printk(KERN_ERR osst%d: Volume overflow in write 
error recovery\n, dev);
 vfree((void *)buffer);
 return (-EIO);  /* hit end of tape = 
fail */
 

Regards. Willem Riede.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8

2001-05-24 Thread Willem Riede


Junfeng Yang wrote:
 
 On Thu, 24 May 2001, Willem Riede wrote:
 
  Dawson Engler wrote:
  
   Hi All,
  
   Enclosed are 103 potential errors where code gets a pointer from a
   possibly-failing routine (kmalloc, etc) and dereferences it without
  
   [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they 
copy SRpnt back by *aSRpnt, implies it could be NULL
 
  No. It implies SRpnt could have changed. The functions flagged
  (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry)
  cannot be reached with SRpnt == NULL. So these are false alarms.
 
 these are false positives if osst_read_back_buffer_and rewrite can't be
 reached with SRpnt == NULL. It seems that osst_do_scsi will not change
 SRpnt unless it is NULL though.

That is currently true, and the re-assignment of SRpnt is superfluous but
harmless. It is not a design constraint though that SRpnt cannot change
(except it can't change to NULL), so I prefer to leave the code as is.

 In other words, SRpnt is changed by
 osst_do_scsi = the initial argument SRpnt == NULL. Probabaly the pointer
 aSRpnt is useless.
 
The pointer aSRpnt is not useless, it's used to communicate the current
value of SRpnt throughout the driver.

Regards, Willem Riede.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] wrong stat of NTFS volume in linux-kernel 2.4.0

2001-01-07 Thread Willem Dekker


I tested the 2.4.0 kernel and it reported a wrong volume size for 
ntfs partitions. The size of the partition (as reported by df -k) was
a factor of 8 too high. After some searching I found that 
super.c in /usr/src/linux/fs/ntfs was changed. In my previous kernel 
2.2.16 a division by the number of blocks in a cluster  was made although 
it included a remark about the 64 bit division routine, and thus used only a
32 bit division. 

In the latest kernel (2.4.0) the division is completly removed, and thus the
volume size is incorrect. 
I included a patch for the situation. This routine presupposes that the 
number of blocks in a cluster is always a power of 2, and uses right shifts.
As far as I know all NTFS cluster sizes are a power of two, which is confirmed
by the options of the format command in NT and some knowledgebase articles
of Microsoft. If not an message will appear in the syslog. (printk)

Willem Dekker


Patch to adapt super.c of the ntfs filesystem. On a stock linux
2.4.0 kernel. 


--- fs/ntfs/super.orig  Tue Nov  7 20:22:35 2000
+++ fs/ntfs/super.c Sun Jan  7 22:34:43 2001
@@ -300,6 +300,82 @@
return 0;
 }
 
+static ntfs_u64 div_by_power_of_2(ntfs_u64 numerator, int denominator) 
+{
+   switch(denominator) {
+   case 0 : 
+   printk("Division by zero error at div_by_power_of_2");
+   return 0;
+   case 1 :
+   return numerator ;
+   case 2 :
+   return numerator  1;
+   case 4 :
+   return numerator  2;
+   case 8 :
+   return numerator  3;
+   case 16 :
+   return numerator  4;
+   case 32 :
+   return numerator  5;
+   case 64 :
+   return numerator  6;
+   case 128 :
+   return numerator  7;
+   case 256 :
+   return numerator  8;
+   case 512 :
+   return numerator  9;
+   case 1024 :
+   return numerator  10;
+   case 2048 :
+   return numerator  11;
+   case 4096 :
+   return numerator  12;
+   case 8192 :
+   return numerator  13;
+   case 16384 :
+   return numerator  14;
+   case 32768 :
+   return numerator  15;
+   case 65536 :
+   return numerator  16;
+   case 131072 :
+   return numerator  17;
+   case 262144 :
+   return numerator  18;
+   case 524288 :
+   return numerator  19;
+   case 1048576 :
+   return numerator  20;
+   case 2097152 :
+   return numerator  21;
+   case 4194304 :
+   return numerator  22;
+   case 8388608 :
+   return numerator  23;
+   case 16777216 :
+   return numerator  24;
+   case 33554432 :
+   return numerator  25;
+   case 67108864 :
+   return numerator  26;
+   case 134217728 :
+   return numerator  27;
+   case 268435456 :
+   return numerator  28;
+   case 536870912 :
+   return numerator  29;
+   case 1073741824 :
+   return numerator  30;
+   case 2147483648U :
+   return numerator  31;
+   default : 
+   printk("Not dividing by power of 2 in div_by_power_of_2!!");
+   return 0;
+   }
+   return 0;
+}
 /*
  * Writes the volume size into vol_size. Returns 0 if successful
  * or error.
@@ -323,6 +399,7 @@
io.size=vol-clustersize;
ntfs_getput_clusters(vol,0,0,io);
*vol_size = NTFS_GETU64(cluster0+0x28);
+*vol_size = div_by_power_of_2(*vol_size,(vol-clusterfactor));
ntfs_free(cluster0);
return 0;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: osst ide-2.2.19 conflict?

2001-06-23 Thread Willem Riede


Anil B. Somayaji wrote:
 
 In the ide.2.2.19.05042001 patch, there is the following bit of code
 in ide-scsi.c, which prevents the ide-scsi driver from allowing access
 to an OnStream DI-30 tape drive.  This is strange, since this same
 drive can be used with the included ide-scsi + osst drivers in the
 stock 2.2.19 kernel.  If you delete this bit, however, ide-scsi
 recognizes the drive, and the osst driver seems to work fine.
 
 Here's the offending code:
 
   #ifndef CONFIG_BLK_DEV_IDETAPE
/*
 * The Onstream DI-30 does not handle clean emulation, yet.
 */
if (strstr(drive-id-model, OnStream DI-30)) {
printk(ide-tape: ide-scsi emulation is not supported for %s.\n, 
drive-id-model);
continue;
}
   #endif /* CONFIG_BLK_DEV_IDETAPE */
 
 Any reason for this to stay in the ide patch, or is it now obsolete?
 
It is obsolete, and can safely be removed.

Success. Willem Riede.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] drivers/scsi/osst.c: make code static

2005-03-22 Thread Willem Riede

On 03/22/2005 09:28:07 AM, Adrian Bunk wrote:
 This patch makes needlessly global code static.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

James, I agree with this, can you put it in BK, please?

Signed-off-by: Willem Riede [EMAIL PROTECTED]
 
 ---
 
 This patch was already sent on:
 - 28 Feb 2005
 
  drivers/scsi/osst.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 --- linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c.old 2005-02-28 
 19:36:05.0 +0100
 +++ linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c 2005-02-28 
 19:36:25.0 +0100
 @@ -24,7 +24,7 @@
  */
  
  static const char * cvsid = $Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede 
 Exp $;
 -const char * osst_version = 0.99.3;
 +static const char * osst_version = 0.99.3;
  
  /* The failure to reconnect firmware bug */
  #define OSST_FW_NEED_POLL_MIN 10601 /*(107A)*/
 @@ -170,7 +170,7 @@
  static int osst_probe(struct device *);
  static int osst_remove(struct device *);
  
 -struct scsi_driver osst_template = {
 +static struct scsi_driver osst_template = {
   .owner  = THIS_MODULE,
   .gendrv = {
   .name   =  osst,
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] drivers/scsi/osst.c: remove unused code

2005-03-23 Thread Willem Riede

On 03/22/2005 04:59:48 PM, Adrian Bunk wrote:
 Thanks to both the Coverity checker and GNU gcc, it was found that this 
 variable is completely unused.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

And it is so obvious when your attention is drawn to it.
The code that did use it moved to os_scsi_tape_flush() recently.

James, could you make this change in BK too, please?

Signed-off-by: Willem Riede [EMAIL PROTECTED]

 --- linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c.old 2005-03-22 
 21:04:36.0 +0100
 +++ linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c 2005-03-22 
 22:09:32.0 +0100
 @@ -4770,9 +4770,6 @@ static int os_scsi_tape_close(struct ino
  {
   int   result = 0;
   struct osst_tape* STp= filp-private_data;
 - struct scsi_request * SRpnt  = NULL;
 -
 - if (SRpnt) scsi_release_request(SRpnt);
  
   if (STp-door_locked == ST_LOCKED_AUTO)
   do_door_lock(STp, 0);
 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: NCQ support NVidia NForce4 (CK804) SATAII

2005-08-14 Thread Willem Riede

On Wed, 10 Aug 2005 20:54:35 +, Allen Martin wrote:

 Likely the only way nForce4 NCQ support could be added under Linux would
 be with a closed source binary driver, and no one really wants that,
 especially for storage / boot volume.  We decided it wasn't worth the
 headache of a binary driver for this one feature.  Future nForce
 chipsets will have a redesigned SATA controller where we can be more
 open about documenting it.

That is disappointing. I was seriously considering a motherboard with your
chipset because of its impressive specifications, but now I have to
reconsider (I'm one of those folks that never bought an nVidia graphics
board due to the lack of open full-function driver). I _hate_ not being
able to use all features.

Any chance your company will reconsider? Are you in a position to make
that decision? Is there a VP I can write to (politely) to support the case?

I don't understand your reluctance in this case (I do for your graphics
processors) - it's not as if adding this function to sata_nv would
expose your crown jewels - you write yourself that next time you'd use a
different (better) interface...

Regards, Willem Riede.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Willem Konynenberg

[EMAIL PROTECTED] wrote:
[lots of sensible comments to get this discussion "on track"]

> So we have: user space has file names and uses open() or mount().
> Kernel space has device paths.
> 
> In principle the kernel could just number the devices it sees 1,2,...
> and export information about them, so that user space can choose
> the right number.
> The part about exporting information is good. User space needs to
> be able to ask if a certain beast is a CD reader, and if so what
> manufacturer and model.
> But the part about numbering 1,2,... may not be good enough, e.g.
> because it does not survive reboots. If we remain Unix-like and use
> device nodes in user space to pair a file name with a number, then
> it would be very nice if the number encoded the device path uniquely.
> Many programs expect this.
> It cannot be done in all cases, but a good approximation is obtained
> if the number is a hash of the device path. In so far the hash is
> collision free we obtain numbers that stay unique over a reboot.

I think here we might learn from the comments that people made
about how AIX and OSF/1/Tru64 have been doing this.

You want the relation between a given /dev/xyz node and the
corresponding device in the kernel to be relatively stable from
the perspective of the user, so you can use it to implement device
naming and access policies.  What "stable" means technically is a
matter of policy, since it depends very much on the perspective
of the user.

You also want the relation between a given /dev/xyz node and the
major/minor number of the related device to be "fixed", so you
don't need to recreate all device nodes on every boot.

The issue is then how to establish the relation between major/minor
numbers and (potentially very dynamic) kernel-internal device
entities, in such a way that the type of stability desired by
the user is provided.

As people noted, the mechanism used by AIX and Tru64 is essentially
to have this relationship established by a user-level tool using
a configuration file with policy information, and a status file
to maintain state information across reboots.
This tool can implement any policy it likes, which means we can
avoid all the policy discusions where the kernel is concerned,
and can instead focus on methods.

For the people who want things to be backward compatible, it
would be sufficient to implement a default tool that will
set all the major/minor numbers exactly like they are now,
so all existing device nodes will keep working as they did.

In order to support hot plugging of devices, I suppose this
tool should either run as a daemon, or set itself up to be
called by the kernel whenever something happens.  (I think
I would prefer the daemon...)   Sort of like /sbin/hotplug.

The next issue to discuss is then how to shape the mechanism by
which the device manager tool and the kernel exchange information
on what devices are out there, and what major/minor numbers each
should be assigned.
The sort of information that a device manager might want to
use, depending on how it is implemented and configured, would
include such things as: device driver id, I/O bus, I/O port,
IRQ nr, SCSI card/bus/id/lun, partition id, serial number,
device model/type, filesystem label, filesystem UUID, phase of
the moon, etc, etc.
Basically, anything that a driver or related subsystem can come
up with that might help identify a device or resource in any
particular situation, depending on user preferences.
I think this may fit somewhere in /proc/sys/dev or thereabout.

The big advantage of this approach is basically that we can
continue to use the good old device nodes in the filesystem
as we have done for the last 30 years, with all the advantages,
but without the major/minor number allocation troubles.

Hmm, this all sounds too simple.
I must be talking nonsense.   Please wake me up.  ;-)

-- 
 Willem Konynenberg <[EMAIL PROTECTED]>
I am not able rightly to apprehend the kind of confusion of ideas
that could provoke such a question  --  Charles Babbage
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8

2001-05-24 Thread Willem Riede


Dawson Engler wrote:
> 
> Hi All,
> 
> Enclosed are 103 potential errors where code gets a pointer from a
> possibly-failing routine (kmalloc, etc) and dereferences it without
>
> [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they 
>copy SRpnt back by *aSRpnt, implies it could be NULL

No. It implies SRpnt could have changed. The functions flagged
(osst_read_back_buffer_and_rewrite and osst_reposition_and_retry)
cannot be reached with SRpnt == NULL. So these are false alarms.

> 
>/u2/engler/mc/oses/linux/2.4.4/drivers/scsi/osst.c:1163:osst_read_back_buffer_and_rewrite:
> ERROR:NULL::1163: Using unknown ptr "SRpnt" illegally! set by 
>'osst_do_scsi':1163 [nbytes = 216]
> #if DEBUG
> if (debugging)
> printk(OSST_DEB_MSG "osst%d: About to attempt to write to 
>frame %d\n", dev, new_block+i);
> #endif
> SRpnt = osst_do_scsi(SRpnt, STp, cmd, OS_FRAME_SIZE, SCSI_DATA_WRITE,
> Start --->
> STp->timeout, MAX_WRITE_RETRIES, TRUE);
> 
> ... DELETED 46 lines ...
> 
> }
> }
> if (flag) {
> if ((SRpnt->sr_sense_buffer[ 2] & 0x0f) == 13 &&
>  SRpnt->sr_sense_buffer[12] ==  0 &&
> Error --->
>  SRpnt->sr_sense_buffer[13] ==  2) {
> printk(KERN_ERR "osst%d: Volume overflow in write 
>error recovery\n", dev);
> vfree((void *)buffer);
> return (-EIO);  /* hit end of tape = 
>fail */
> 

Regards. Willem Riede.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8

2001-05-24 Thread Willem Riede

Junfeng Yang wrote:
> 
> On Thu, 24 May 2001, Willem Riede wrote:
> 
> > Dawson Engler wrote:
> > >
> > > Hi All,
> > >
> > > Enclosed are 103 potential errors where code gets a pointer from a
> > > possibly-failing routine (kmalloc, etc) and dereferences it without
> > >
> > > [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they 
>copy SRpnt back by *aSRpnt, implies it could be NULL
> >
> > No. It implies SRpnt could have changed. The functions flagged
> > (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry)
> > cannot be reached with SRpnt == NULL. So these are false alarms.
> 
> these are false positives if osst_read_back_buffer_and rewrite can't be
> reached with SRpnt == NULL. It seems that osst_do_scsi will not change
> SRpnt unless it is NULL though.

That is currently true, and the re-assignment of SRpnt is superfluous but
harmless. It is not a design constraint though that SRpnt cannot change
(except it can't change to NULL), so I prefer to leave the code as is.

> In other words, SRpnt is changed by
> osst_do_scsi <=> the initial argument SRpnt == NULL. Probabaly the pointer
> aSRpnt is useless.
> 
The pointer aSRpnt is not useless, it's used to communicate the current
value of SRpnt throughout the driver.

Regards, Willem Riede.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: PATCH: ide-scsi.c to allow claiming of Onstream drives

2000-09-13 Thread Willem Riede

Andre Hedrick wrote:
> 
> On Wed, 13 Sep 2000, Matthew Dharm wrote:
> 
> > Attached is a patch to ide-scsi.c against 2.4.0-test8.  Please consider
> > applying it.
> >
> > This patch removes the logic which causes ide-scsi to refuse to attach
> > itself to an IDE OnStream drive.  While these drives are not supported by
> > the standard st.c driver, there is userspace support (using the sg
> > interface) which works with the ATAPI drive with ide-scsi patched with this
> > patch.
> >
> > So there really is no reason for ide-scsi to reject this device.
> 
> Yes,
> 
> There is a reason that it was put there, DI-30's require special IO that
> st.o does not know how to do.
> 
> Look at all the hoops that it took to get ide-tape to work.
> If ide-scsi would have done the trick, it would have saved headaches.

Indeed. But in the mean time, a scsi driver for Onstream drives has
been written (by me). Together with Kai Makisara I have resolved how 
to get those drives detected by this new driver (osst) and not by st.

The second patch that Matt refers to inserts that logic into st so
it does not try to support these drives and fails due to the drive
particulars. Since SCSI, USB and IDE versions of these drives exist,
patching st is more appropriate than ide-scsi.

Regards. Willem Riede.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: osst & ide-2.2.19 conflict?

2001-06-23 Thread Willem Riede


"Anil B. Somayaji" wrote:
> 
> In the ide.2.2.19.05042001 patch, there is the following bit of code
> in ide-scsi.c, which prevents the ide-scsi driver from allowing access
> to an OnStream DI-30 tape drive.  This is strange, since this same
> drive can be used with the included ide-scsi + osst drivers in the
> stock 2.2.19 kernel.  If you delete this bit, however, ide-scsi
> recognizes the drive, and the osst driver seems to work fine.
> 
> Here's the offending code:
> 
>   #ifndef CONFIG_BLK_DEV_IDETAPE
>/*
> * The Onstream DI-30 does not handle clean emulation, yet.
> */
>if (strstr(drive->id->model, "OnStream DI-30")) {
>printk("ide-tape: ide-scsi emulation is not supported for %s.\n", 
>drive->id->model);
>continue;
>}
>   #endif /* CONFIG_BLK_DEV_IDETAPE */
> 
> Any reason for this to stay in the ide patch, or is it now obsolete?
> 
It is obsolete, and can safely be removed.

Success. Willem Riede.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] wrong stat of NTFS volume in linux-kernel 2.4.0

2001-01-07 Thread Willem Dekker


I tested the 2.4.0 kernel and it reported a wrong volume size for 
ntfs partitions. The size of the partition (as reported by df -k) was
a factor of 8 too high. After some searching I found that 
super.c in /usr/src/linux/fs/ntfs was changed. In my previous kernel 
2.2.16 a division by the number of blocks in a cluster  was made although 
it included a remark about the 64 bit division routine, and thus used only a
32 bit division. 

In the latest kernel (2.4.0) the division is completly removed, and thus the
volume size is incorrect. 
I included a patch for the situation. This routine presupposes that the 
number of blocks in a cluster is always a power of 2, and uses right shifts.
As far as I know all NTFS cluster sizes are a power of two, which is confirmed
by the options of the format command in NT and some knowledgebase articles
of Microsoft. If not an message will appear in the syslog. (printk)

Willem Dekker


Patch to adapt super.c of the ntfs filesystem. On a stock linux
2.4.0 kernel. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

--- fs/ntfs/super.orig  Tue Nov  7 20:22:35 2000
+++ fs/ntfs/super.c Sun Jan  7 22:34:43 2001
@@ -300,6 +300,82 @@
return 0;
 }
 
+static ntfs_u64 div_by_power_of_2(ntfs_u64 numerator, int denominator) 
+{
+   switch(denominator) {
+   case 0 : 
+   printk("Division by zero error at div_by_power_of_2");
+   return 0;
+   case 1 :
+   return numerator ;
+   case 2 :
+   return numerator >> 1;
+   case 4 :
+   return numerator >> 2;
+   case 8 :
+   return numerator >> 3;
+   case 16 :
+   return numerator >> 4;
+   case 32 :
+   return numerator >> 5;
+   case 64 :
+   return numerator >> 6;
+   case 128 :
+   return numerator >> 7;
+   case 256 :
+   return numerator >> 8;
+   case 512 :
+   return numerator >> 9;
+   case 1024 :
+   return numerator >> 10;
+   case 2048 :
+   return numerator >> 11;
+   case 4096 :
+   return numerator >> 12;
+   case 8192 :
+   return numerator >> 13;
+   case 16384 :
+   return numerator >> 14;
+   case 32768 :
+   return numerator >> 15;
+   case 65536 :
+   return numerator >> 16;
+   case 131072 :
+   return numerator >> 17;
+   case 262144 :
+   return numerator >> 18;
+   case 524288 :
+   return numerator >> 19;
+   case 1048576 :
+   return numerator >> 20;
+   case 2097152 :
+   return numerator >> 21;
+   case 4194304 :
+   return numerator >> 22;
+   case 8388608 :
+   return numerator >> 23;
+   case 16777216 :
+   return numerator >> 24;
+   case 33554432 :
+   return numerator >> 25;
+   case 67108864 :
+   return numerator >> 26;
+   case 134217728 :
+   return numerator >> 27;
+   case 268435456 :
+   return numerator >> 28;
+   case 536870912 :
+   return numerator >> 29;
+   case 1073741824 :
+   return numerator >> 30;
+   case 2147483648U :
+   return numerator >> 31;
+   default : 
+   printk("Not dividing by power of 2 in div_by_power_of_2!!");
+   return 0;
+   }
+   return 0;
+}
 /*
  * Writes the volume size into vol_size. Returns 0 if successful
  * or error.
@@ -323,6 +399,7 @@
io.size=vol->clustersize;
ntfs_getput_clusters(vol,0,0,);
*vol_size = NTFS_GETU64(cluster0+0x28);
+*vol_size = div_by_power_of_2(*vol_size,(vol->clusterfactor));
ntfs_free(cluster0);
return 0;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [2.6 patch] drivers/scsi/osst.c: make code static

2005-03-22 Thread Willem Riede

On 03/22/2005 09:28:07 AM, Adrian Bunk wrote:
> This patch makes needlessly global code static.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

James, I agree with this, can you put it in BK, please?

Signed-off-by: Willem Riede <[EMAIL PROTECTED]>
 
> ---
> 
> This patch was already sent on:
> - 28 Feb 2005
> 
>  drivers/scsi/osst.c |4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> --- linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c.old 2005-02-28 
> 19:36:05.0 +0100
> +++ linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c 2005-02-28 
> 19:36:25.0 +0100
> @@ -24,7 +24,7 @@
>  */
>  
>  static const char * cvsid = "$Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede 
> Exp $";
> -const char * osst_version = "0.99.3";
> +static const char * osst_version = "0.99.3";
>  
>  /* The "failure to reconnect" firmware bug */
>  #define OSST_FW_NEED_POLL_MIN 10601 /*(107A)*/
> @@ -170,7 +170,7 @@
>  static int osst_probe(struct device *);
>  static int osst_remove(struct device *);
>  
> -struct scsi_driver osst_template = {
> +static struct scsi_driver osst_template = {
>   .owner  = THIS_MODULE,
>   .gendrv = {
>   .name   =  "osst",
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [2.6 patch] drivers/scsi/osst.c: remove unused code

2005-03-23 Thread Willem Riede

On 03/22/2005 04:59:48 PM, Adrian Bunk wrote:
> Thanks to both the Coverity checker and GNU gcc, it was found that this 
> variable is completely unused.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

And it is so obvious when your attention is drawn to it.
The code that did use it moved to os_scsi_tape_flush() recently.

James, could you make this change in BK too, please?

Signed-off-by: Willem Riede <[EMAIL PROTECTED]>

> --- linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c.old 2005-03-22 
> 21:04:36.0 +0100
> +++ linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c 2005-03-22 
> 22:09:32.0 +0100
> @@ -4770,9 +4770,6 @@ static int os_scsi_tape_close(struct ino
>  {
>   int   result = 0;
>   struct osst_tape* STp= filp->private_data;
> - struct scsi_request * SRpnt  = NULL;
> -
> - if (SRpnt) scsi_release_request(SRpnt);
>  
>   if (STp->door_locked == ST_LOCKED_AUTO)
>   do_door_lock(STp, 0);
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: NCQ support NVidia NForce4 (CK804) SATAII

2005-08-14 Thread Willem Riede

On Wed, 10 Aug 2005 20:54:35 +, Allen Martin wrote:

> Likely the only way nForce4 NCQ support could be added under Linux would
> be with a closed source binary driver, and no one really wants that,
> especially for storage / boot volume.  We decided it wasn't worth the
> headache of a binary driver for this one feature.  Future nForce
> chipsets will have a redesigned SATA controller where we can be more
> open about documenting it.

That is disappointing. I was seriously considering a motherboard with your
chipset because of its impressive specifications, but now I have to
reconsider (I'm one of those folks that never bought an nVidia graphics
board due to the lack of open full-function driver). I _hate_ not being
able to use all features.

Any chance your company will reconsider? Are you in a position to make
that decision? Is there a VP I can write to (politely) to support the case?

I don't understand your reluctance in this case (I do for your graphics
processors) - it's not as if adding this function to sata_nv would
expose your crown jewels - you write yourself that next time you'd use a
different (better) interface...

Regards, Willem Riede.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: status of kernel memory debugging?

2005-07-30 Thread Willem de Bruijn

 UML is still too strange for valgrind, despite progress on both sides
 (valgrind accepting more strange things and UML becoming less
 strange).

 I tried grinding UML a month or so ago, and its use of clone was a
 sticking point.


I think I read your remark, yes. Thanks for the update; I hoped some strides 
had been made in that direction since then. Personally,  I know to little 
about the topic to be of any use.

I guess the best option is then using slab caches per object type, so that I 
can at least find obvious memory leaks. 

On a sidenote, it'll be interesting to see what valgrind reports once (if?) 
the kernel gets a good grinding. 

Willem
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] ip6tnl: fix use after free of fb_tnl_dev

2013-11-14 Thread Willem de Bruijn

On Thu, Nov 14, 2013 at 9:47 AM, Nicolas Dichtel
nicolas.dich...@6wind.com wrote:
 Bug has been introduced by commit bb8140947a24 (ip6tnl: allow to use rtnl ops
 on fb tunnel).

 When ip6_tunnel.ko is unloaded, FB device is delete by rtnl_link_unregister()
 and then we try to use the pointer in ip6_tnl_destroy_tunnels().

 Let's add an handler for dellink, which will never remove the FB tunnel. With
 this patch it will no more be possible to remove it via 'ip link del ip6tnl0',
 but it's safer.

 The same fix was already proposed by Willem de Bruijn will...@google.com for
 sit interfaces.

 CC: Willem de Bruijn will...@google.com
 Reported-by: Steven Rostedt rost...@goodmis.org
 Signed-off-by: Nicolas Dichtel nicolas.dich...@6wind.com

Acked-by: Willem de Bruijn will...@google.com

Also ran a test similar to the one for sit: `modprobe ip6_tunnel;
rmmod ip6_tunnel` with CONFIG_DEBUG_SLAB=y. This exposed the bug at
HEAD, completes successfully with the patch applied.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 net-next 2/6] net: add low latency socket poll

2013-06-10 Thread Willem de Bruijn

On Mon, Jun 10, 2013 at 10:29 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
 Adds an ndo_ll_poll method and the code that supports it.
 This method can be used by low latency applications to busy-poll
 Ethernet device queues directly from the socket code.
 sysctl_net_ll_poll controls how many microseconds to poll.
 Default is zero (disabled).
 Individual protocol support will be added by subsequent patches.

 Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
 Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com
 Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
 ---

  Documentation/sysctl/net.txt |7 ++
  include/linux/netdevice.h|3 +
  include/linux/skbuff.h   |8 ++
  include/net/ll_poll.h|  148 
 ++
  include/net/sock.h   |4 +
  include/uapi/linux/snmp.h|1
  net/Kconfig  |   12 +++
  net/core/skbuff.c|4 +
  net/core/sock.c  |6 ++
  net/core/sysctl_net_core.c   |   10 +++
  net/ipv4/proc.c  |1
  net/socket.c |6 ++
  12 files changed, 208 insertions(+), 2 deletions(-)
  create mode 100644 include/net/ll_poll.h

 Acked-by: Eric Dumazet eduma...@google.com


Tested-by: Willem de Bruijn will...@google.com

Per Eliezer's request, I applied v10 to a small set of workloads
(netperf tcp_rr, udp_rr, 100x tcp_rr, 100x udp_rr and a poll()-based
tcp rr) for some functional testing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 net-next 1/6] net: add napi_id and hash

2013-06-10 Thread Willem de Bruijn

On Mon, Jun 10, 2013 at 5:22 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote:
 Adds a napi_id and a hashing mechanism to lookup a napi by id.
 This will be used by subsequent patches to implement low latency
 Ethernet device polling.
 Based on a code sample by Eric Dumazet.

 Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
 ---

  include/linux/netdevice.h |   29 ++
  net/core/dev.c|   59 
 +
  2 files changed, 88 insertions(+), 0 deletions(-)

 Signed-off-by: Eric Dumazet eduma...@google.com


Tested-by: Willem de Bruijn will...@google.com

(per Eliezer's request)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 net-next 3/6] udp: add low latency socket poll support

2013-06-10 Thread Willem de Bruijn

On Mon, Jun 10, 2013 at 10:31 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
 Add upport for busy-polling on UDP sockets.
 In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id
 from the skb into the sk.
 This is done at the earliest possible moment, right after we identify
 which socket this skb is for.
 In __skb_recv_datagram When there is no data and the user
 tries to read we busy poll.

 Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
 Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com
 Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
 ---

  net/core/datagram.c |4 
  net/ipv4/udp.c  |6 +-
  net/ipv6/udp.c  |6 +-
  3 files changed, 14 insertions(+), 2 deletions(-)

 Acked-by: Eric Dumazet eduma...@google.com


Tested-by: Willem de Bruijn will...@google.com

(per Eliezer's request)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v10 net-next 4/6] tcp: add low latency socket poll support.

2013-06-10 Thread Willem de Bruijn

On Mon, Jun 10, 2013 at 10:32 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote:
 Adds low latency socket poll support for TCP.
 In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id
 from the skb to the sk.
 In tcp_recvmsg(), when there is no data in the socket we busy-poll.
 This is a good example of how to add busy-poll support to more protocols.

 Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
 Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com
 Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
 ---

 Acked-by: Eric Dumazet eduma...@google.com


Tested-by: Willem de Bruijn will...@google.com

(per Eliezer's request)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 net-next 2/7] net: add low latency socket poll

2013-06-05 Thread Willem de Bruijn

On Wed, Jun 5, 2013 at 9:23 AM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Wed, 2013-06-05 at 13:34 +0300, Eliezer Tamir wrote:
 Adds an ndo_ll_poll method and the code that supports it.
 This method can be used by low latency applications to busy-poll
 Ethernet device queues directly from the socket code.
 sysctl_net_ll_poll controls how many microseconds to poll.
 Default is zero (disabled).
 Individual protocol support will be added by subsequent patches.

 Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
 Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com
 Tested-by: Willem de Bruijn will...@google.com
 Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com
 ---

 Are you sure this version was tested by Willem ?

I hadn't tested the latest revisions, indeed. Am testing this one right now.

 Acked-by: Eric Dumazet eduma...@google.com



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 net-next 3/4] ixgbe: Add support for ndo_ll_poll

2013-05-21 Thread Willem de Bruijn

On Tue, May 21, 2013 at 2:54 AM, Eliezer Tamir
eliezer.ta...@linux.intel.com wrote:
 On 20/05/2013 23:20, Or Gerlitz wrote:

 On Mon, May 20, 2013 at 1:16 PM, Eliezer Tamir
 eliezer.ta...@linux.intel.com wrote:

 Add the ixgbe driver code implementing ndo_ll_poll.
 It should be easy for other drivers to do something similar
 in order to enable support for CONFIG_INET_LL_RX_POLL


 I am not sure,


 Willem ported this to some undisclosed HW that they use at Google, his
 feedback was that it was not a major effort.

The core ndo_ll_poll implementation is generally a subset of a device
driver's existing napi callback. It cleans the queues, but it skips
napi_complete and unmasking of the IRQ.

+   ixgbe_for_each_ring(ring, q_vector-rx) {
+   found = ixgbe_clean_rx_irq(q_vector, ring, 4);
+   if (found)
+   break;
+   }

A subtle difference in the above code vs ixgbe_poll is that the
callback returns as soon as some data arrived on a queue, as opposed
to iterating over all queues. The budget is lower, too. Since not all
data arriving is necessarily destined towards polling socket, this may
or may not be an improvement.

Besides that, the driver has to mark the packet with
ll_mark_skb(cq-napi, skb);

On devices where tx completion interrupts share the same IRQ as rx
interrupts, the driver may also have to clean the tx queue once in a
while (at obvious tail latency cost). LLS does not disable the IRQ,
but I think the suggestion was to set its moderation threshold very
high to avoid net_rx_action/LLS lock contention. If so, starvation may
occur.

The most difficult bit is handling mutual exclusion with the
interrupt-driven receive path. The ixgbe port has its own internal
locking mechanism in anticipation of future use cases that can be
lock-free. As first approximation, I just took the napi-poll_lock,
similar to how netpoll handles mutual exclusion with net_rx_action.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 net-next 0/4] net: low latency Ethernet device polling

2013-05-21 Thread Willem de Bruijn

 2. How is the logic aware of RSS and RFS?

 With TCP sockets, the driver knows the specific ring it need to poll so
 this should be mapped and provide the best latency.


 This code is blissfully oblivious of RFS and RSS, it only assumes that the
 packets for a socket are likely to continue to come on the same queue.
 The code is designed to be correct even if you get your data on the wrong
 queue. (your performance will suffer but no more than that.)


For low latency, you don't want to have to wait for the IPI that RPS
sets to redirect packets to another CPU.  However, this feature works
extremely well with hardware flow steering (nfc/ntuple) and
accelerated RFS, where the device will enqueue directly on an rxqueue
owned exclusively by the destination cpu (if configured correctly).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] [BUG FIXES - 3.10.27] sit: More backports

2014-01-29 Thread Willem de Bruijn

 From d101450583c3a472a2a94904cfe13fd4e7d2f519 Mon Sep 17 00:00:00 2001
 From: Nicolas Dichtel nicolas.dich...@6wind.com
 Date: Wed, 29 Jan 2014 16:40:30 +0100
 Subject: [PATCH] sit: fix double free of fb_tunnel_dev on exit

 This problem was fixed upstream by commit 9434266f2c64 (sit: fix use after 
 free
 of fb_tunnel_dev).
 The upstream patch depends on upstream commit 5e6700b3bf98 (sit: add support 
 of
 x-netns), which was not backported into 3.10 branch.

 First, explain the problem: when the sit module is unloaded, sit_cleanup() is
 called.
 rmmod sit
 = sit_cleanup()
   = rtnl_link_unregister()
 = __rtnl_kill_links()
   = for_each_netdev(net, dev) {
 if (dev-rtnl_link_ops == ops)
 ops-dellink(dev, list_kill);
 }
 At this point, the FB device is deleted (and all sit tunnels).
   = unregister_pernet_device()
 = unregister_pernet_operations()
   = ops_exit_list()
 = sit_exit_net()
   = sit_destroy_tunnels()
   In this function, no tunnel is found.
   = unregister_netdevice_queue(sitn-fb_tunnel_dev, list);
 We delete the FB device a second time here!

 Because we cannot simply remove the second deletion (sit_exit_net() must 
 remove
 the FB device when a netns is deleted), we add an rtnl ops which delete all 
 sit
 device excepting the FB device and thus we can keep the explicit deletion in
 sit_exit_net().

 CC: Steven Rostedt rost...@goodmis.org
 Signed-off-by: Nicolas Dichtel nicolas.dich...@6wind.com
 ---
  net/ipv6/sit.c | 10 ++
  1 file changed, 10 insertions(+)

 diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
 index 0491264b8bfc..620d326e8fdd 100644
 --- a/net/ipv6/sit.c
 +++ b/net/ipv6/sit.c
 @@ -1507,6 +1507,15 @@ static const struct nla_policy 
 ipip6_policy[IFLA_IPTUN_MAX + 1] = {
  #endif
  };

 +static void ipip6_dellink(struct net_device *dev, struct list_head *head)
 +{
 +   struct net *net = dev_net(dev);
 +   struct sit_net *sitn = net_generic(net, sit_net_id);
 +
 +   if (dev != sitn-fb_tunnel_dev)
 +   unregister_netdevice_queue(dev, head);
 +}
 +
  static struct rtnl_link_ops sit_link_ops __read_mostly = {
 .kind   = sit,
 .maxtype= IFLA_IPTUN_MAX,
 @@ -1517,6 +1526,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly 
 = {
 .changelink = ipip6_changelink,
 .get_size   = ipip6_get_size,
 .fill_info  = ipip6_fill_info,
 +   .dellink= ipip6_dellink,
  };

  static struct xfrm_tunnel sit_handler __read_mostly = {
 --
 1.8.4.1

This looks good to me. It is the same as the backport sit: fix use
after free of fb_tunnel_dev (9434266f2c64), minus the small code
cleanup at the end of that patch that is not relevant to 3.10.27 (do
not define sit_net *sitn in sit_exit_net if it is not used there. But in
3.10.27 it is used, in unregister_netdevice_queue).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'

2014-07-25 Thread Willem de Bruijn

 alternative fix for xt_bpf.h could be to replace:
  /* only used in the kernel */
  struct sk_filter *filter __attribute__((aligned(8)));
 with
  /* only used in the kernel */
  void *filter __attribute__((aligned(8)));
 
 but this 'void *' approach may further break broken userspace,
 whereas the fix implemented here is more seamless.

 Yep, that's not good, 'struct sk_filter' should never have been in a uapi
 file actually.

This follows a convention in include/uapi/linux/netfilter/*.h that
likely predates the introduction of uapi. A search for Used
internally by the kernel shows many more examples. I should not have
included filter.h, however. The common behavior when using pointers
to kernel-internal structures is to have a forward declaration. I suggest
making that change, instead of changing to void *. This avoids having
to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c:

-#include linux/filter.h
 #include linux/types.h

 #define XT_BPF_MAX_NUM_INSTR   64

+struct sk_filter;
+
 struct xt_bpf_info {

I can send this as a separate patch to net-next, if that helps.

 You can just send me a patch to change it to void. It's an internal
 kernel pointer as the comment states. There is **no** way that
 userspace can lurk with that from iptables at all.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'

2014-07-25 Thread Willem de Bruijn

 This follows a convention in include/uapi/linux/netfilter/*.h that
 likely predates the introduction of uapi. A search for Used
 internally by the kernel shows many more examples. I should not have
 included filter.h, however. The common behavior when using pointers
 to kernel-internal structures is to have a forward declaration. I suggest
 making that change, instead of changing to void *. This avoids having
 to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c:

 that will not avoid typecast.
 Either 'void *' approach or extra 'struct sk_filter;' approach, both need
 type casts to 'struct bpf_prog' in xt_bpf.c
 (because of SK_RUN_FILTER macro)
 Therefore I prefer extra 'struct sk_filter;' approach.

I hadn't noticed that your patch makes the same change that I
proposed. Nothing in userspace should touch that pointer, so it is
fine to change its type to struct bpf_prog* at the same time. No need
for typecasts.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'

2014-07-25 Thread Willem de Bruijn

On Fri, Jul 25, 2014 at 2:43 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Fri, Jul 25, 2014 at 11:32 AM, Willem de Bruijn will...@google.com wrote:
 This follows a convention in include/uapi/linux/netfilter/*.h that
 likely predates the introduction of uapi. A search for Used
 internally by the kernel shows many more examples. I should not have
 included filter.h, however. The common behavior when using pointers
 to kernel-internal structures is to have a forward declaration. I suggest
 making that change, instead of changing to void *. This avoids having
 to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c:

 that will not avoid typecast.
 Either 'void *' approach or extra 'struct sk_filter;' approach, both need
 type casts to 'struct bpf_prog' in xt_bpf.c
 (because of SK_RUN_FILTER macro)
 Therefore I prefer extra 'struct sk_filter;' approach.

 I hadn't noticed that your patch makes the same change that I
 proposed. Nothing in userspace should touch that pointer, so it is
 fine to change its type to struct bpf_prog* at the same time. No need
 for typecasts.

 really? I don't think it's a good idea to expose kernel struct type
 to user space. How is it even going to compile?

a forward declaration.

 #include linux/filter.h brings different files in kernel and in user space.
 struct bpf_prog is undefined in user space and compiler will complain.
 Adding 'struct bpf_prog;' will be ugly.
 imo the lesser evil is adding 'struct sk_filter;' and doing type casts
 in kernel.

but the exact same argument applies to sk_filter. If that struct is
renamed everywhere else, then the result will only be more confusing.
A forward declaration is the standard workaround to all such cases in
include/uapi/linux/netfilter. See for instance xt_connlimit.h. This is
sufficient to allow userspace build to succeed, without exposing any
kernel structure detail. If you don't even want to leak the name, then
let's make it void *. Keeping a declaration for sk_filter, while
sk_filter is renamed everywhere else is the least good option, in my
opinion.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LKP] [net] 4ed2d765dfa: ltp.recv01.4.TFAIL

2014-10-01 Thread Willem de Bruijn

On Wed, Oct 1, 2014 at 11:29 AM, Fengguang Wu fengguang...@intel.com wrote:
 Hi Willem,

 FYI, we noticed the below LTP failures on commit

Thanks for the report, Fengguang. The failures are

recv01  4  TFAIL  :  recv01.c:142: invalid flags set ; returned -1
(expected -1), errno 11 (expected 22)
recvfrom014  TFAIL  :  recvfrom01.c:164: invalid socket length ;
returned -1 (expected -1), errno 11 (expected 22)
recvfrom016  TFAIL  :  recvfrom01.c:164: invalid flags set ;
returned -1 (expected -1), errno 11 (expected 22)
recvmsg019  TFAIL  :  recvmsg01.c:228: invalid flags set ;
returned -1 (expected -1), errno 11 (expected 22)

In other words, these functions return EAGAIN now, when they used to
return EINVAL. This will happen from this patch onwards if flags
includes MSG_ERRQUEUE. The patch introduced error queue support for
TCP sockets. Like other recvmsg implementations, the error queue code
will act on some flags and ignore others. Previously, the code
happened to check MSG_OOB first and return EINVAL. The error queue
code ignores this flag and returns EAGAIN if no data is waiting.

This was recently discussed in

  LTP recv/recvmsg tests failing on 3.17
  http://comments.gmane.org/gmane.linux.network/331750

where the consensus was that applications should not expect
particular error codes when passing unsupported combinations
of flags.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LKP] [net] 4ed2d765dfa:

2014-11-24 Thread Willem de Bruijn

On Mon, Nov 24, 2014 at 7:38 AM, Yuanhan Liu
yuanhan@linux.intel.com wrote:
 FYI, we noticed the below changes on

 commit 4ed2d765dfaccff5ebdac68e2064b59125033a3b (net-timestamp: TCP 
 timestamping)

 e7fd2885385157d4  4ed2d765dfaccff5ebdac68e20
   --
:10 100%   5:5 ltp.recv01.fail
:10 100%   5:5 ltp.recvfrom01.fail
:10 100%   5:5 ltp.recvmsg01.fail

This has been discussed before:

  LTP recv/recvmsg tests failing on 3.17
  http://comments.gmane.org/gmane.linux.network/331750

and

  https://lkml.org/lkml/2014/10/1/379
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] af_packet: don't pass empty blocks for PACKET_V3

2015-02-06 Thread Willem de Bruijn

On Thu, Feb 5, 2015 at 10:54 PM, Alexander Drozdov al.droz...@gmail.com wrote:
 On 05.02.2015 23:01:38 +0300 Willem de Bruijn wrote:

 On Wed, Feb 4, 2015 at 9:58 PM, Alexander Drozdov al.droz...@gmail.com
 wrote:

 Don't close an empty block on timeout. Its meaningless to
 pass it to the user. Moreover, passing empty blocks wastes
 CPU  buffer space increasing probability of packets
 dropping on small timeouts.

 Side effect of this patch is indefinite user-space wait
 in poll on idle links. But, I believe its better to set
 timeout for poll(2) when needed than to get empty blocks
 every millisecond when not needed.

 This change would break existing applications that have come
 to depend on the periodic signal.

 I don't disagree with the argument that the data ready signal
 should be sent only when a block is full or a timer expires and
 at least some data is waiting, but that is moot at this point.

 I missed something. As pointed by Guy Harris g...@alum.mit.edu,
 before the previous patch periodic signal was not delivered. The previous
 patch
 (da413eec729dae5dc by Dan Collins d...@dcollins.co.nz) is for 3.19 kernel
 only. Should we care about existing 3.19-only applications?

It does sound reasonable to expect processes to handle infinite sleep
on no data if that is the historical behavior of the interface.


 Signed-off-by: Alexander Drozdov al.droz...@gmail.com
 ---
   net/packet/af_packet.c | 10 +-
   1 file changed, 9 insertions(+), 1 deletion(-)

 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 9cfe2e1..9a2f70a 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -698,6 +698,10 @@ static void prb_retire_rx_blk_timer_expired(unsigned
 long data)

  if (pkc-last_kactive_blk_num == pkc-kactive_blk_num) {
  if (!frozen) {
 +   if (!BLOCK_NUM_PKTS(pbd)) {
 +   /* An empty block. Just refresh the
 timer. */
 +   goto refresh_timer;
 +   }
  prb_retire_current_block(pkc, po,
 TP_STATUS_BLK_TMO);
  if (!prb_dispatch_next_block(pkc, po))
  goto refresh_timer;
 @@ -798,7 +802,11 @@ static void prb_close_block(struct tpacket_kbdq_core
 *pkc1,
  h1-ts_last_pkt.ts_sec = last_pkt-tp_sec;
  h1-ts_last_pkt.ts_nsec = last_pkt-tp_nsec;
  } else {
 -   /* Ok, we tmo'd - so get the current time */
 +   /* Ok, we tmo'd - so get the current time.
 +*
 +* It shouldn't really happen as we don't close empty
 +* blocks. See prb_retire_rx_blk_timer_expired().
 +*/
  struct timespec ts;
  getnstimeofday(ts);
  h1-ts_last_pkt.ts_sec = ts.tv_sec;
 --
 1.9.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] packet: bail out of packet_snd() if L2 header creation fails

2015-01-11 Thread Willem de Bruijn

On Sun, Jan 11, 2015 at 1:35 PM, Eric Dumazet eric.duma...@gmail.com wrote:
 On Sun, 2015-01-11 at 13:01 -0500, Christoph Jaeger wrote:
 Due to a misplaced parenthesis, the expression

   (unlikely(offset)  0),

 which expands to

   (__builtin_expect(!!(offset), 0)  0),

 never evaluates to true. Therefore, when sending packets with
 PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended
 if the creation of the layer 2 header fails.

 Spotted by Coverity - CID 1259975 (Operands don't affect result).

 Fixes: 9c7077622dd9 (packet: make packet_snd fail on len smaller than l2 
 header)
 Signed-off-by: Christoph Jaeger c...@linux.com
 ---

 Nice catch !

 Acked-by: Eric Dumazet eduma...@google.com


Indeed. I'm responsible for that typo. Thanks a lot for catching it!

Acked-by: Willem de Bruijn will...@google.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] af_packet: don't pass empty blocks for PACKET_V3

2015-02-05 Thread Willem de Bruijn

On Wed, Feb 4, 2015 at 9:58 PM, Alexander Drozdov al.droz...@gmail.com wrote:
 Don't close an empty block on timeout. Its meaningless to
 pass it to the user. Moreover, passing empty blocks wastes
 CPU  buffer space increasing probability of packets
 dropping on small timeouts.

 Side effect of this patch is indefinite user-space wait
 in poll on idle links. But, I believe its better to set
 timeout for poll(2) when needed than to get empty blocks
 every millisecond when not needed.

This change would break existing applications that have come
to depend on the periodic signal.

I don't disagree with the argument that the data ready signal
should be sent only when a block is full or a timer expires and
at least some data is waiting, but that is moot at this point.


 Signed-off-by: Alexander Drozdov al.droz...@gmail.com
 ---
  net/packet/af_packet.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 9cfe2e1..9a2f70a 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -698,6 +698,10 @@ static void prb_retire_rx_blk_timer_expired(unsigned 
 long data)

 if (pkc-last_kactive_blk_num == pkc-kactive_blk_num) {
 if (!frozen) {
 +   if (!BLOCK_NUM_PKTS(pbd)) {
 +   /* An empty block. Just refresh the timer. */
 +   goto refresh_timer;
 +   }
 prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO);
 if (!prb_dispatch_next_block(pkc, po))
 goto refresh_timer;
 @@ -798,7 +802,11 @@ static void prb_close_block(struct tpacket_kbdq_core 
 *pkc1,
 h1-ts_last_pkt.ts_sec = last_pkt-tp_sec;
 h1-ts_last_pkt.ts_nsec = last_pkt-tp_nsec;
 } else {
 -   /* Ok, we tmo'd - so get the current time */
 +   /* Ok, we tmo'd - so get the current time.
 +*
 +* It shouldn't really happen as we don't close empty
 +* blocks. See prb_retire_rx_blk_timer_expired().
 +*/
 struct timespec ts;
 getnstimeofday(ts);
 h1-ts_last_pkt.ts_sec = ts.tv_sec;
 --
 1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] af_packet: pass checksum validation status to the user

2015-03-20 Thread Willem de Bruijn

On Fri, Mar 20, 2015 at 2:46 AM, Alexander Drozdov al.droz...@gmail.com wrote:

 On 19.03.2015 21:50:03 +0300 Willem de Bruijn wrote:

 On Thu, Mar 19, 2015 at 2:08 PM, Alexander Drozdov al.droz...@gmail.com
 wrote:

 On Thu, 19 Mar 2015 11:29:32 -0400, Willem de Bruijn will...@google.com
 wrote:


 On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov
 al.droz...@gmail.com
 wrote:


 Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
 af_packet user that at least the transport header checksum
 has been already validated.


 This changes the interface slightly. Processes should be treating
 this previously unused bit as reserved and other flags have
 been added to the bitmap in this manner as well, so this should
 then be safe here, too.

 For now, the flag may be set for incoming packets only.


 Why?


 I can't figure out how af_packet could get that the outgoing
 packet's checksum has been validated. Checksumming on
 output from skbuff.h tells that skb-ip_summed should
 equal to CHECKSUM_NONE in that case,


 CHECKSUM_UNNECESSARY is a valid flag on the outgoing
 path according to that documentation. And, indeed, I also see
 no checksum scrubbing on forwarding paths in practice (but I
 may be wrong there, only took a quick glance).

 but that is not true for me (in my tests, skb-ip_summed ==
 CHECKSUM_UNNECESSARY for forwarded packets in some
 cases).


 When you disable hardware checksum offload and generate
 packets locally, you do see the expected CHECKSUM_NONE
 value?


 Yes, I do, but see below.

 You cannot change the semantics of the flag afterwards.



 I think the semantics of the flag won't be changed if one set the flag
 for outgoing packets. If the flag is not set (for any directions)
 then that is not mean that the packet checksum is invalid. The user just
 can
 then
 checksum the packet by itself. So, the user may check the flag for any
 packet
 right now.


 I see. So it is a hint. Okay. It would be nice if it behaves as
 expected in as many cases as possible from the outset. This
 would include PACKET_OUTGOING and CHECKSUM_NONE.


 I've just done some testing, and I've found that packets generated by
 'nping --badsum' (socket(PF_INET, SOCK_RAW, IPPROTO_RAW)) have CHECKSUM_NONE
 when they are viewed by af_packet. I've used rather old Linux kernel
 for the tests, but isn't that a reason to not set TP_STATUS_CSUM_VALID for
 outgoing packets right now?

Yes. That's a very good example.

 Please also note the flag and semantics in
 Documentation/networking/packet_mmap.txt


 I'll do it and I'll resend the patches with the note.

Thanks.




 Better to support both directions from the start.

 Signed-off-by: Alexander Drozdov al.droz...@gmail.com
 ---
include/uapi/linux/if_packet.h | 1 +
net/packet/af_packet.c | 9 +
2 files changed, 10 insertions(+)

 diff --git a/include/uapi/linux/if_packet.h
 b/include/uapi/linux/if_packet.h
 index da2d668..053bd10 100644
 --- a/include/uapi/linux/if_packet.h
 +++ b/include/uapi/linux/if_packet.h
 @@ -99,6 +99,7 @@ struct tpacket_auxdata {
#define TP_STATUS_VLAN_VALID   (1  4) /* auxdata has valid
 tp_vlan_tci */
#define TP_STATUS_BLK_TMO  (1  5)
#define TP_STATUS_VLAN_TPID_VALID  (1  6) /* auxdata has valid
 tp_vlan_tpid */
 +#define TP_STATUS_CSUM_VALID   (1  7)

/* Tx ring - header status */
#define TP_STATUS_AVAILABLE  0
 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 6ecf8dd..3f09dda 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb,
 struct
 net_device *dev,

   if (skb-ip_summed == CHECKSUM_PARTIAL)
   status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   status |= TP_STATUS_CSUM_VALID;

   if (snaplen  res)
   snaplen = res;
 @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb,
 struct socket *sock,
   aux.tp_status = TP_STATUS_USER;
   if (skb-ip_summed == CHECKSUM_PARTIAL)
   aux.tp_status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   aux.tp_status |= TP_STATUS_CSUM_VALID;
 +


 These two sections are near duplicates. I'd move the entire status
 initialization, including existing TP_STATUS_USER and
 TP_STATUS_CSUMNOTREADY fields to a helper function.

 It's a bit unfortunately that we have to use an extra bit to add a
 signal
 that is a near inverse of the existing TP_STATUS_CSUMNOTREADY
 (bar CHECKSUM_NONE). I do not immediately see a better way,
 either, though. And tp_status has plenty room at 32 bits

Re: [PATCH 2/2] af_packet: pass checksum validation status to the user

2015-03-19 Thread Willem de Bruijn

On Thu, Mar 19, 2015 at 2:08 PM, Alexander Drozdov al.droz...@gmail.com wrote:
 On Thu, 19 Mar 2015 11:29:32 -0400, Willem de Bruijn will...@google.com
 wrote:

 On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov al.droz...@gmail.com
 wrote:

 Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
 af_packet user that at least the transport header checksum
 has been already validated.

 This changes the interface slightly. Processes should be treating
 this previously unused bit as reserved and other flags have
 been added to the bitmap in this manner as well, so this should
 then be safe here, too.

 For now, the flag may be set for incoming packets only.

 Why?

 I can't figure out how af_packet could get that the outgoing
 packet's checksum has been validated. Checksumming on
 output from skbuff.h tells that skb-ip_summed should
 equal to CHECKSUM_NONE in that case,

CHECKSUM_UNNECESSARY is a valid flag on the outgoing
path according to that documentation. And, indeed, I also see
no checksum scrubbing on forwarding paths in practice (but I
may be wrong there, only took a quick glance).

 but that is not true for me (in my tests, skb-ip_summed ==
 CHECKSUM_UNNECESSARY for forwarded packets in some
 cases).

When you disable hardware checksum offload and generate
packets locally, you do see the expected CHECKSUM_NONE
value?

 You cannot change the semantics of the flag afterwards.


 I think the semantics of the flag won't be changed if one set the flag
 for outgoing packets. If the flag is not set (for any directions)
 then that is not mean that the packet checksum is invalid. The user just can
 then
 checksum the packet by itself. So, the user may check the flag for any
 packet
 right now.

I see. So it is a hint. Okay. It would be nice if it behaves as
expected in as many cases as possible from the outset. This
would include PACKET_OUTGOING and CHECKSUM_NONE.

Please also note the flag and semantics in
Documentation/networking/packet_mmap.txt


 Better to support both directions from the start.

 Signed-off-by: Alexander Drozdov al.droz...@gmail.com
 ---
   include/uapi/linux/if_packet.h | 1 +
   net/packet/af_packet.c | 9 +
   2 files changed, 10 insertions(+)

 diff --git a/include/uapi/linux/if_packet.h
 b/include/uapi/linux/if_packet.h
 index da2d668..053bd10 100644
 --- a/include/uapi/linux/if_packet.h
 +++ b/include/uapi/linux/if_packet.h
 @@ -99,6 +99,7 @@ struct tpacket_auxdata {
   #define TP_STATUS_VLAN_VALID   (1  4) /* auxdata has valid
 tp_vlan_tci */
   #define TP_STATUS_BLK_TMO  (1  5)
   #define TP_STATUS_VLAN_TPID_VALID  (1  6) /* auxdata has valid
 tp_vlan_tpid */
 +#define TP_STATUS_CSUM_VALID   (1  7)

   /* Tx ring - header status */
   #define TP_STATUS_AVAILABLE  0
 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 6ecf8dd..3f09dda 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct
 net_device *dev,

  if (skb-ip_summed == CHECKSUM_PARTIAL)
  status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   status |= TP_STATUS_CSUM_VALID;

  if (snaplen  res)
  snaplen = res;
 @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb,
 struct socket *sock,
  aux.tp_status = TP_STATUS_USER;
  if (skb-ip_summed == CHECKSUM_PARTIAL)
  aux.tp_status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   aux.tp_status |= TP_STATUS_CSUM_VALID;
 +

 These two sections are near duplicates. I'd move the entire status
 initialization, including existing TP_STATUS_USER and
 TP_STATUS_CSUMNOTREADY fields to a helper function.

 It's a bit unfortunately that we have to use an extra bit to add a signal
 that is a near inverse of the existing TP_STATUS_CSUMNOTREADY
 (bar CHECKSUM_NONE). I do not immediately see a better way,
 either, though. And tp_status has plenty room at 32 bits.

  aux.tp_len = PACKET_SKB_CB(skb)-origlen;
  aux.tp_snaplen = skb-len;
  aux.tp_mac = 0;
 --
 1.9.1



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] af_packet: pass checksum validation status to the user

2015-03-19 Thread Willem de Bruijn

On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov al.droz...@gmail.com wrote:
 Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
 af_packet user that at least the transport header checksum
 has been already validated.

This changes the interface slightly. Processes should be treating
this previously unused bit as reserved and other flags have
been added to the bitmap in this manner as well, so this should
then be safe here, too.

 For now, the flag may be set for incoming packets only.

Why? You cannot change the semantics of the flag afterwards.
Better to support both directions from the start.

 Signed-off-by: Alexander Drozdov al.droz...@gmail.com
 ---
  include/uapi/linux/if_packet.h | 1 +
  net/packet/af_packet.c | 9 +
  2 files changed, 10 insertions(+)

 diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
 index da2d668..053bd10 100644
 --- a/include/uapi/linux/if_packet.h
 +++ b/include/uapi/linux/if_packet.h
 @@ -99,6 +99,7 @@ struct tpacket_auxdata {
  #define TP_STATUS_VLAN_VALID   (1  4) /* auxdata has valid 
 tp_vlan_tci */
  #define TP_STATUS_BLK_TMO  (1  5)
  #define TP_STATUS_VLAN_TPID_VALID  (1  6) /* auxdata has valid 
 tp_vlan_tpid */
 +#define TP_STATUS_CSUM_VALID   (1  7)

  /* Tx ring - header status */
  #define TP_STATUS_AVAILABLE  0
 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
 index 6ecf8dd..3f09dda 100644
 --- a/net/packet/af_packet.c
 +++ b/net/packet/af_packet.c
 @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
 net_device *dev,

 if (skb-ip_summed == CHECKSUM_PARTIAL)
 status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   status |= TP_STATUS_CSUM_VALID;

 if (snaplen  res)
 snaplen = res;
 @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct 
 socket *sock,
 aux.tp_status = TP_STATUS_USER;
 if (skb-ip_summed == CHECKSUM_PARTIAL)
 aux.tp_status |= TP_STATUS_CSUMNOTREADY;
 +   else if (skb-pkt_type != PACKET_OUTGOING 
 +(skb-ip_summed == CHECKSUM_COMPLETE ||
 + skb_csum_unnecessary(skb)))
 +   aux.tp_status |= TP_STATUS_CSUM_VALID;
 +

These two sections are near duplicates. I'd move the entire status
initialization, including existing TP_STATUS_USER and
TP_STATUS_CSUMNOTREADY fields to a helper function.

It's a bit unfortunately that we have to use an extra bit to add a signal
that is a near inverse of the existing TP_STATUS_CSUMNOTREADY
(bar CHECKSUM_NONE). I do not immediately see a better way,
either, though. And tp_status has plenty room at 32 bits.

 aux.tp_len = PACKET_SKB_CB(skb)-origlen;
 aux.tp_snaplen = skb-len;
 aux.tp_mac = 0;
 --
 1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: linux-next: build warnings after merge of the net-next tree

2015-05-14 Thread Willem de Bruijn

On Thu, May 14, 2015 at 3:12 AM, Stephen Rothwell s...@canb.auug.org.au wrote:
 Hi all,

 After merging the net-next tree, today's linux-next build (sparc64 defconfig)
 produced these warnings:

 In file included from arch/sparc/include/asm/cmpxchg.h:4:0,
  from arch/sparc/include/asm/atomic_64.h:11,
  from arch/sparc/include/asm/atomic.h:4,
  from include/linux/atomic.h:4,
  from include/linux/spinlock.h:416,
  from include/linux/mmzone.h:7,
  from include/linux/gfp.h:5,
  from include/linux/mm.h:9,
  from net/packet/af_packet.c:56:
 net/packet/af_packet.c: In function 'packet_rcv_has_room':
 arch/sparc/include/asm/cmpxchg_64.h:43:22: warning: value computed is not 
 used [-Wunused-value]
  #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned 
 long)(x),(ptr),sizeof(*(ptr
   ^
 net/packet/af_packet.c:1314:3: note: in expansion of macro 'xchg'
xchg(po-pressure, !has_room);
^
 net/packet/af_packet.c: In function 'packet_poll':
 arch/sparc/include/asm/cmpxchg_64.h:43:22: warning: value computed is not 
 used [-Wunused-value]
  #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned 
 long)(x),(ptr),sizeof(*(ptr
   ^
 net/packet/af_packet.c:3817:3: note: in expansion of macro 'xchg'
xchg(po-pressure, 0);
^

 Introduced by commit 2ccdbaa6d55b (packet: rollover lock contention
 avoidance).

I just sent http://patchwork.ozlabs.org/patch/472362/ to resolve
these. Thanks for the report.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: net: heap-out-of-bounds in sock_setsockopt

2015-12-16 Thread Willem de Bruijn

>
> Hmm, we should exclude the raw socket case, something like the
> following, but I am not sure if the check is too strict or not, also
> not sure if we should return an error for this raw socket case.

No, SOF_TIMESTAMPING_OPT_ID with SOCK_RAW/IPPROTO_TCP
is legitimate. It should fall through to initializing sk->sk_tskey to zero.
Only stream sockets should use the special case where numbering
is bytestream and computed by subtracting the seqno from the seqno
at the time that the option is enabled.

>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 765be83..c26e80a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -872,7 +872,7 @@ int sock_setsockopt(struct socket *sock, int
> level, int optname,
>
> if (val & SOF_TIMESTAMPING_OPT_ID &&
> !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> -   if (sk->sk_protocol == IPPROTO_TCP) {
> +   if (sk->sk_protocol == IPPROTO_TCP &&
> sk->sk_type == SOCK_STREAM) {
> if (sk->sk_state != TCP_ESTABLISHED) {
> ret = -EINVAL;
> break;

I made the same error when later returning the tskey in
__skb_tstamp_tx:

if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) {
serr->ee.ee_data = skb_shinfo(skb)->tskey;
if (sk->sk_protocol == IPPROTO_TCP)
serr->ee.ee_data -= sk->sk_tskey;
}

This has no effect if sk->sk_tskey is initialized to 0 with your patch. Still,
it should not treat SOCK_RAW as a TCP sock here, either. Please
add this if you're about to send the patch. Or I can send it separately,
if you prefer. Thanks for the quick fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Boot failure when using NFS on OMAP based evms

2016-04-07 Thread Willem de Bruijn

 Currently linux-next is failing to boot via NFS on my AM335x GP evm,
 AM437x GP evm and Beagle X15. I bisected the problem down to the commit
 "udp: remove headers from UDP packets before queueing".
>>>
>>> Thanks for the report, and apologies for breaking your configuration.
>>> I had missed that sunrpc can dequeue skbs from a udp receive
>>> queue and makes assumptions about the layout of those packets. rxrpc
>>> does the same. From what I can tell so far, those are the only two
>>> protocols that do this. I have verified that the following fixes rxrpc for 
>>> me
>>>
>>
>> Thank you for your quick response. I verified with all of the above
>> suggested changes that NFS works again on my 3 evms.
>
> Thanks a lot for testing, Franklin. I will send out the two patches.

Patches sent to netdev. I'll do another scan to verify that there are
no additional protocols that dequeue skbs from udp receive queues
and expect udp headers present.

Re: Boot failure when using NFS on OMAP based evms

2016-04-07 Thread Willem de Bruijn

On Thu, Apr 7, 2016 at 10:36 AM, Franklin S Cooper Jr. <fcoo...@ti.com> wrote:
>
>
> On 04/06/2016 09:22 PM, Willem de Bruijn wrote:
>> On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr. <fcoo...@ti.com> wrote:
>>> Hi All,
>>>
>>> Currently linux-next is failing to boot via NFS on my AM335x GP evm,
>>> AM437x GP evm and Beagle X15. I bisected the problem down to the commit
>>> "udp: remove headers from UDP packets before queueing".
>>>
>>> I had to revert the following three commits to get things working again:
>>>
>>> e6afc8ace6dd5cef5e812f26c72579da8806f5ac
>>> udp: remove headers from UDP packets before queueing
>>>
>>> 627d2d6b550094d88f9e518e15967e7bf906ebbf
>>> udp: enable MSG_PEEK at non-zero offset
>>>
>>> b9bb53f3836f4eb2bdeb3447be11042bd29c2408
>>> sock: convert sk_peek_offset functions to WRITE_ONCE
>>>
>>
>> Thanks for the report, and apologies for breaking your configuration.
>> I had missed that sunrpc can dequeue skbs from a udp receive
>> queue and makes assumptions about the layout of those packets. rxrpc
>> does the same. From what I can tell so far, those are the only two
>> protocols that do this. I have verified that the following fixes rxrpc for me
>>
>> --- a/net/rxrpc/ar-input.c
>> +++ b/net/rxrpc/ar-input.c
>> @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv
>> *sp, struct sk_buff *skb)
>> struct rxrpc_wire_header whdr;
>>
>> /* dig out the RxRPC connection details */
>> -   if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 
>> 0)
>> +   if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0)
>> return -EBADMSG;
>> -   if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
>> +   if (!pskb_pull(skb, sizeof(whdr)))
>> BUG();
>>
>> I have not yet been able to reproduce the sunrpc/nfs issue, but I
>> suspect that the following might fix it. I will try to create an NFS
>> setup.
>>
>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>> index 2df87f7..8ab40ba 100644
>> --- a/net/sunrpc/socklib.c
>> +++ b/net/sunrpc/socklib.c
>> @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr,
>> struct sk_buff *skb)
>> struct xdr_skb_reader   desc;
>>
>> desc.skb = skb;
>> -   desc.offset = sizeof(struct udphdr);
>> +   desc.offset = 0;
>> desc.count = skb->len - desc.offset;
>>
>> if (skb_csum_unnecessary(skb))
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 1413cdc..71d6072 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>> svsk->sk_sk->sk_stamp = skb->tstamp;
>> set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be
>> more data... */
>>
>> -   len  = skb->len - sizeof(struct udphdr);
>> +   len  = skb->len;
>> rqstp->rq_arg.len = len;
>>
>> rqstp->rq_prot = IPPROTO_UDP;
>> @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
>> skb_free_datagram_locked(svsk->sk_sk, skb);
>> } else {
>> /* we can use it in-place */
>> -   rqstp->rq_arg.head[0].iov_base = skb->data +
>> -   sizeof(struct udphdr);
>> +   rqstp->rq_arg.head[0].iov_base = skb->data;
>> rqstp->rq_arg.head[0].iov_len = len;
>> if (skb_checksum_complete(skb))
>> goto out_free;
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 65e7595..c1fc7b2 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
>> u32 _xid;
>> __be32 *xp;
>>
>> -   repsize = skb->len - sizeof(struct udphdr);
>> +   repsize = skb->len;
>> if (repsize < 4) {
>> dprintk("RPC:   impossible RPC reply size %d!\n", 
>> repsize);
>> return;
>> }
>>
>>
>>
>> /* Copy the XID from the skb... */
>> -   xp = skb_header_pointer(skb, sizeof(struct udphdr),
>> -   sizeof(_xid), &_xid);
>> +   xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
>> if (xp == NULL)
>> return;
>>
>
>
> Thank you for your quick response. I verified with all of the above
> suggested changes that NFS works again on my 3 evms.

Thanks a lot for testing, Franklin. I will send out the two patches.

Re: Boot failure when using NFS on OMAP based evms

2016-04-06 Thread Willem de Bruijn

On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr.  wrote:
> Hi All,
>
> Currently linux-next is failing to boot via NFS on my AM335x GP evm,
> AM437x GP evm and Beagle X15. I bisected the problem down to the commit
> "udp: remove headers from UDP packets before queueing".
>
> I had to revert the following three commits to get things working again:
>
> e6afc8ace6dd5cef5e812f26c72579da8806f5ac
> udp: remove headers from UDP packets before queueing
>
> 627d2d6b550094d88f9e518e15967e7bf906ebbf
> udp: enable MSG_PEEK at non-zero offset
>
> b9bb53f3836f4eb2bdeb3447be11042bd29c2408
> sock: convert sk_peek_offset functions to WRITE_ONCE
>

Thanks for the report, and apologies for breaking your configuration.
I had missed that sunrpc can dequeue skbs from a udp receive
queue and makes assumptions about the layout of those packets. rxrpc
does the same. From what I can tell so far, those are the only two
protocols that do this. I have verified that the following fixes rxrpc for me

--- a/net/rxrpc/ar-input.c
+++ b/net/rxrpc/ar-input.c
@@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv
*sp, struct sk_buff *skb)
struct rxrpc_wire_header whdr;

/* dig out the RxRPC connection details */
-   if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 0)
+   if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0)
return -EBADMSG;
-   if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr)))
+   if (!pskb_pull(skb, sizeof(whdr)))
BUG();

I have not yet been able to reproduce the sunrpc/nfs issue, but I
suspect that the following might fix it. I will try to create an NFS
setup.

diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index 2df87f7..8ab40ba 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr,
struct sk_buff *skb)
struct xdr_skb_reader   desc;

desc.skb = skb;
-   desc.offset = sizeof(struct udphdr);
+   desc.offset = 0;
desc.count = skb->len - desc.offset;

if (skb_csum_unnecessary(skb))
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1413cdc..71d6072 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
svsk->sk_sk->sk_stamp = skb->tstamp;
set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be
more data... */

-   len  = skb->len - sizeof(struct udphdr);
+   len  = skb->len;
rqstp->rq_arg.len = len;

rqstp->rq_prot = IPPROTO_UDP;
@@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
skb_free_datagram_locked(svsk->sk_sk, skb);
} else {
/* we can use it in-place */
-   rqstp->rq_arg.head[0].iov_base = skb->data +
-   sizeof(struct udphdr);
+   rqstp->rq_arg.head[0].iov_base = skb->data;
rqstp->rq_arg.head[0].iov_len = len;
if (skb_checksum_complete(skb))
goto out_free;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 65e7595..c1fc7b2 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt,
u32 _xid;
__be32 *xp;

-   repsize = skb->len - sizeof(struct udphdr);
+   repsize = skb->len;
if (repsize < 4) {
dprintk("RPC:   impossible RPC reply size %d!\n", repsize);
return;
}



/* Copy the XID from the skb... */
-   xp = skb_header_pointer(skb, sizeof(struct udphdr),
-   sizeof(_xid), &_xid);
+   xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid);
if (xp == NULL)
return;

Re: Tool for sampling /proc/net/softnet_stat statistics

2016-03-07 Thread Willem de Bruijn

On Mon, Mar 7, 2016 at 10:36 AM, Jesper Dangaard Brouer
 wrote:
> Hi Google,
>
> While playing with RPS, I needed to read stats from
> /proc/net/softnet_stat and the tools I could find [1] and [2] was not
> very good.
>
> I lack of better, I coded up my own tool softnet_stat.pl here:
>  
> https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
>
> The output format/columns in /proc/net/softnet_stat is undocumented,
> plus values are printed in hex.  E.g. to decode the columns you need to
> read kernel function kernel softnet_seq_show() in
> kernel/net/core/net-procfs.c.
>
> To make things easier I wrote this small perl script for get
> so human readable statistics from /proc/net/softnet_stat.

Very nice. Thanks for sharing, Jesper. I maintained something similar,
but never got around to clean up and upstream it. Will start using
yours, instead.

A few points, from using my earlier tool:

A minimum cut-off value is helpful, especially on beefy servers, to
suppress the many 0 rows. Preferably configurable, to also be able to
suppress low-rate background traffic when analyzing a few large
streams. My default was 500.

The number of columns has grown with kernel versions. The latest
column is flow_limit, added in 3.11 at99bbc7074190. It is helpful for
the script to be robust against both older and future kernels. On
which note, to be able to support these kinds of tools, any new
columns to such procfs files should be appended, not inserted as for
instance in https://patchwork.ozlabs.org/patch/574171/

Time squeeze is an exception, in that it is number of squeeze events
per second, not number of packets squeezed. This is often
misunderstood if not explained clearly.

Re: [patch -next] udp: fix if statement in SIOCINQ ioctl

2016-04-18 Thread Willem de Bruijn

On Mon, Apr 18, 2016 at 8:19 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Mon, 2016-04-18 at 11:44 +0300, Dan Carpenter wrote:
>> We deleted a line of code and accidentally made the "return put_user()"
>> part of the if statement when it's supposed to be unconditional.
>>
>> Fixes: 9f9a45beaa96 ('udp: do not expect udp headers on ioctl SIOCINQ')
>> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com>
>
> Acked-by: Eric Dumazet <eduma...@google.com>

Acked-by: Willem de Bruijn <will...@google.com>

Thanks for catching this.

[PATCH tip/timers/core] uapi glibc compat: make linux/time.h compile after libc time.h files

2016-08-10 Thread Willem de Bruijn

From: Willem de Bruijn <will...@google.com>

Add libc-compat workaround for definitions in linux/time.h that
duplicate those in libc time.h, sys/time.h and bits/time.h.

With this change, userspace builds succeeds when linux/time.h is
included after those libc files. The inverse requires changes to
those userspace headers.

Without this patch, when compiling the following program after
make headers_install:

  echo -e "#include \n#include " | \
gcc -Wall -Werror -Iusr/include -c -xc -

gcc gives these errors:

  #include 
  #include 

In file included from ../test_time.c:3:0:
/usr/include/time.h:120:8: error: redefinition of ‘struct timespec’
 struct timespec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:9:8: note: originally defined here
 struct timespec {
^
In file included from ../test_time.c:3:0:
/usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’
 struct itimerspec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:34:8: note: originally defined here
 struct itimerspec {

and this warning by indirect inclusion of bits/time.h:

In file included from ../test_time.c:4:0:
./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror]
 #define TIMER_ABSTIME   0x01
 ^
In file included from /usr/include/time.h:41:0,
 from ../test_time.c:3:
/usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location 
of the previous definition
 #   define TIMER_ABSTIME  1
 ^

Ran the same program for sys/time.h and bits/time.h. The _SYS_TIME_H
test resolves similar errors for timeval, timezone, itimerval and
warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF.

Signed-off-by: Willem de Bruijn <will...@google.com>
---
 include/uapi/linux/libc-compat.h | 50 
 include/uapi/linux/time.h| 15 
 2 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index e4f048e..9c3cbf5 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -146,6 +146,43 @@
 #define __UAPI_DEF_XATTR   1
 #endif
 
+/* Definitions for time.h */
+#if defined(__timespec_defined)
+#define __UAPI_DEF_TIMESPEC0
+#else
+#define __UAPI_DEF_TIMESPEC1
+#endif
+
+#if defined(_TIME_H) && defined(__USE_POSIX199309)
+#define __UAPI_DEF_ITIMERSPEC  0
+#else
+#define __UAPI_DEF_ITIMERSPEC  1
+#endif
+
+/* Definitions for sys/time.h */
+#if defined(_SYS_TIME_H)
+#define __UAPI_DEF_TIMEVAL 0
+#define __UAPI_DEF_ITIMERVAL   0
+#define __UAPI_DEF_ITIMER_WHICH0
+#else
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#endif
+
+/* Definitions for bits/time.h */
+#if defined(_BITS_TIME_H)
+#define __UAPI_DEF_ABSTIME 0
+#else
+#define __UAPI_DEF_ABSTIME 1
+#endif
+
+#if defined(_SYS_TIME_H) && defined(__USE_BSD)
+#define __UAPI_DEF_TIMEZONE0
+#else
+#define __UAPI_DEF_TIMEZONE1
+#endif
+
 /* If we did not see any headers from any supported C libraries,
  * or we are being included in the kernel, then define everything
  * that we need. */
@@ -182,6 +219,19 @@
 /* Definitions for xattr.h */
 #define __UAPI_DEF_XATTR   1
 
+/* Definitions for time.h */
+#define __UAPI_DEF_TIMESPEC1
+#define __UAPI_DEF_ITIMERSPEC  1
+
+/* Definitions for sys/time.h */
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#define __UAPI_DEF_TIMEZONE1
+
+/* Definitions for bits/time.h */
+#define __UAPI_DEF_ABSTIME 1
+
 #endif /* __GLIBC__ */
 
 #endif /* _UAPI_LIBC_COMPAT_H */
diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index e75e1b6..4e7333c 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -1,9 +1,11 @@
 #ifndef _UAPI_LINUX_TIME_H
 #define _UAPI_LINUX_TIME_H
 
+#include 
 #include 
 
 
+#if __UAPI_DEF_TIMESPEC
 #ifndef _STRUCT_TIMESPEC
 #define _STRUCT_TIMESPEC
 struct timespec {
@@ -11,35 +13,46 @@ struct timespec {
longtv_nsec;/* nanoseconds */
 };
 #endif
+#endif
 
+#if __UAPI_DEF_TIMEVAL
 struct timeval {
__kernel_time_t tv_sec; /* seconds */
__kernel_suseconds_ttv_usec;/* microseconds */
 };
+#endif
 
+#if __UAPI_DEF_TIMEZONE
 struct timezone {
int tz_minuteswest; /* minutes west of Greenwich */
int tz_dsttime; /* type of dst correction */
 };
+#endif
 
 
 /*
  * Names of the interval timers, and structure
  * defining a timer setting:
  */
+#if __UAPI_DEF_ITIMER_WHICH
 #

Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much

2016-07-09 Thread Willem de Bruijn

On Sat, Jul 9, 2016 at 6:43 AM, Michal Kubecek <mkube...@suse.cz> wrote:
> On Sat, Jul 09, 2016 at 11:48:49AM +0200, Daniel Borkmann wrote:
>> On 07/09/2016 02:20 AM, Alexei Starovoitov wrote:
>> >On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote:
>> >>On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote:
>> >>>If socket filter truncates an udp packet below the length of UDP header
>> >>>in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
>> >>>BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
>> >>>kernel is configured that way) can be easily enforced by an unprivileged
>> >>>user which was reported as CVE-2016-6162. For a reproducer, see
>> >>>http://seclists.org/oss-sec/2016/q3/8
>> >>>
>> >>>Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before 
>> >>>queueing")
>> >>>Reported-by: Marco Grassi <marco@gmail.com>
>> >>>Signed-off-by: Michal Kubecek <mkube...@suse.cz>
>> >>>---
>
>> >>Acked-by: Eric Dumazet <eduma...@google.com>
>> >
>> >this is incomplete fix. Please do not apply. See discussion at 
>> >security@kernel
>>
>> Ohh well, didn't see it earlier before starting the discussion at 
>> security@...
>>
>> I'm okay if we take this for now as a quick band aid and find a better
>> way how to deal with the underlying issue long-term so that it's
>> /guaranteed/ that it doesn't bite us any further in such fragile ways.
>
> Agreed. As rc7 is due in a day or two, rushing a complex and intrusive
> solution in might be too risky.

Acked-by: Willem de Bruijn <will...@google.com>

Thanks, Michal.

Re: Missing include file in include/uapi/linux/errqueue.h?

2016-07-10 Thread Willem de Bruijn

On Sun, Jul 10, 2016 at 1:43 AM, Brooks Moses  wrote:
> On Sat, Jul 9, 2016 at 10:36 AM, Brooks Moses  wrote:
>> I've been attempting to qualify the Linux 4.5.2 user-space headers for
>> a toolchain release, and ran into what looks like a missing include
>> file in include/uapi/linux/errqueue.h.  In particular,
>> https://github.com/torvalds/linux/commit/f24b9be5957b38bb420b838115040dc2031b7d0c
>> adds the following to this file:
>>
>> +struct scm_timestamping {
>> + struct timespec ts[3];
>> +};
>>
>> However, struct timespec is defined in time.h, which isn't included
>> either in 4.5.2 or in current head.  Is this simply a missing #include
>> line,

It is. I missed that in my original patch.

> or am I misunderstanding something?
>
> As a followup: Unfortunately the obvious fix -- adding "#include
> " -- causes other problems, since linux/time.h is
> incompatible with the glibc time.h such that including both of them
> into the same compilation unit causes errors about redefined types.

If these conflicts between libc and uapi time.h can be resolved through
include/uapi/linux/libc-compat.h, then we can apply the obvious fix of
including linux/time.h in linux/errqueue.h.

> And we, at least, have some programs that want to include
> linux/errqueue.h and (glibc's) time.h.  The fix of adding "#include
> " to linux/errqueue.h seems to work for us, but I'm not sure
> that won't cause problems in the other direction for other people.

That breaks kernel compilation.

Re: Missing include file in include/uapi/linux/errqueue.h?

2016-07-10 Thread Willem de Bruijn

On Sun, Jul 10, 2016 at 9:56 AM, Willem de Bruijn <will...@google.com> wrote:
> On Sun, Jul 10, 2016 at 1:43 AM, Brooks Moses <bmo...@google.com> wrote:
>> On Sat, Jul 9, 2016 at 10:36 AM, Brooks Moses <bmo...@google.com> wrote:
>>> I've been attempting to qualify the Linux 4.5.2 user-space headers for
>>> a toolchain release, and ran into what looks like a missing include
>>> file in include/uapi/linux/errqueue.h.  In particular,
>>> https://github.com/torvalds/linux/commit/f24b9be5957b38bb420b838115040dc2031b7d0c
>>> adds the following to this file:
>>>
>>> +struct scm_timestamping {
>>> + struct timespec ts[3];
>>> +};
>>>
>>> However, struct timespec is defined in time.h, which isn't included
>>> either in 4.5.2 or in current head.  Is this simply a missing #include
>>> line,
>
> It is. I missed that in my original patch.
>
>> or am I misunderstanding something?
>>
>> As a followup: Unfortunately the obvious fix -- adding "#include
>> " -- causes other problems, since linux/time.h is
>> incompatible with the glibc time.h such that including both of them
>> into the same compilation unit causes errors about redefined types.
>
> If these conflicts between libc and uapi time.h can be resolved through
> include/uapi/linux/libc-compat.h, then we can apply the obvious fix of
> including linux/time.h in linux/errqueue.h.

The below patch (bar whitespace mangling by gmail) avoids redefinition
conflicts for me if linux/time.h is included after time.h. The other direction
would also require changes to the userspace headers, as stated in
libc-compat.h.

>> And we, at least, have some programs that want to include
>> linux/errqueue.h and (glibc's) time.h.  The fix of adding "#include
>> " to linux/errqueue.h seems to work for us, but I'm not sure
>> that won't cause problems in the other direction for other people.
>
> That breaks kernel compilation.

A simpler other option is to explicitly include either of the time.h
files depending on the environment, similar to uapi/linux/lightnvm.h.

timespec itself is also explicitly defined in uapi/linux/coda.h, but that
won't avoid redefinition if time.h is included afterwards.

The rough libc-compat.h patch:

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..6b1cdc6 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -2,6 +2,7 @@
 #define _UAPI_LINUX_ERRQUEUE_H

 #include 
+#include 

 struct sock_extended_err {
__u32   ee_errno;
diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index e4f048e..ef33ea9 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -146,6 +146,16 @@
 #define __UAPI_DEF_XATTR   1
 #endif

+#if defined(_TIME_H)
+#define __UAPI_DEF_TIMESPEC0
+#define __UAPI_DEF_ITIMERSPEC  0
+#define __UAPI_DEF_ABSTIME 0
+#else
+#define __UAPI_DEF_TIMESPEC1
+#define __UAPI_DEF_ITIMERSPEC  1
+#define __UAPI_DEF_ABSTIME 1
+#endif /* _TIME_H */
+
 /* If we did not see any headers from any supported C libraries,
  * or we are being included in the kernel, then define everything
  * that we need. */
@@ -182,6 +192,11 @@
 /* Definitions for xattr.h */
 #define __UAPI_DEF_XATTR   1

+/* Definitions for time.h */
+#define __UAPI_DEF_TIMESPEC1
+#define __UAPI_DEF_ITIMERSPEC  1
+#define __UAPI_DEF_ABSTIME 1
+
 #endif /* __GLIBC__ */

 #endif /* _UAPI_LIBC_COMPAT_H */
diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index e75e1b6..7129c54 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -1,9 +1,10 @@
 #ifndef _UAPI_LINUX_TIME_H
 #define _UAPI_LINUX_TIME_H

+#include 
 #include 

-
+#if __UAPI_DEF_TIMESPEC
 #ifndef _STRUCT_TIMESPEC
 #define _STRUCT_TIMESPEC
 struct timespec {
@@ -11,6 +12,7 @@ struct timespec {
longtv_nsec;/* nanoseconds */
 };
 #endif
+#endif

 struct timeval {
__kernel_time_t tv_sec; /* seconds */
@@ -31,10 +33,12 @@ struct timezone {
 #defineITIMER_VIRTUAL  1
 #defineITIMER_PROF 2

+#if __UAPI_DEF_TIMESPEC
 struct itimerspec {
struct timespec it_interval;/* timer period */
struct timespec it_value;   /* timer expiration */
 };
+#endif

 struct itimerval {
struct timeval it_interval; /* timer interval */
@@ -64,6 +68,8 @@ struct itimerval {
 /*
  * The various flags for setting POSIX.1b interval timers:
  */
+#if __UAPI_DEF_ABSTIME
 #define TIMER_ABSTIME  0x01
+#endif

 #endif /* _UAPI_LINUX_TIME_H */

Re: [PATCH tip/timers/core] uapi glibc compat: make linux/time.h compile after libc time.h files

2016-08-19 Thread Willem de Bruijn

On Wed, Aug 17, 2016 at 3:33 PM, John Stultz <john.stu...@linaro.org> wrote:
> On Wed, Aug 10, 2016 at 8:45 AM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> From: Willem de Bruijn <will...@google.com>
>>
>> Add libc-compat workaround for definitions in linux/time.h that
>> duplicate those in libc time.h, sys/time.h and bits/time.h.
>>
>> With this change, userspace builds succeeds when linux/time.h is
>> included after those libc files. The inverse requires changes to
>> those userspace headers.
>>
>> Without this patch, when compiling the following program after
>> make headers_install:
>>
>>   echo -e "#include \n#include " | \
>> gcc -Wall -Werror -Iusr/include -c -xc -
>>
>> gcc gives these errors:
>>
>>
>> Signed-off-by: Willem de Bruijn <will...@google.com>
>> ---
>>  include/uapi/linux/libc-compat.h | 50 
>> 
>>  include/uapi/linux/time.h| 15 
>>  2 files changed, 65 insertions(+)
>
>
> So I don't have any objection to this. But I'm not sure if I'm the
> right path for such a change to go through.
>
> From the commit history, it seems David Miller or Andrew Morton would
> be the right folks to take this.
>
> thanks
> -john

It is a prerequisite for including linux/time.h from
uapi/linux/errqueue.h, a network file

https://lkml.org/lkml/2016/7/10/10

If making changes to these two files through net-next does not cause
conflicts with other branches, I can resend this patch to netdev.

Re: [PATCH net-next 2/2] errqueue: include linux/time.h

2016-09-12 Thread Willem de Bruijn

On Mon, Sep 12, 2016 at 3:26 PM, kbuild test robot <l...@intel.com> wrote:
> Hi Willem,
>
> [auto build test ERROR on net-next/master]
>
> url:
> https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431
> config: i386-defconfig (attached as .config)
> compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):

This error report shows that breakage can occur with applications that
include linux/errqueue.h before the libc time headers.

The libc-compat definitions in this patch set only fix compilation
when uapi headers are included after the userspace headers.

These errors indeed go away when errqueue.h is included after the
userspace time includes, as in the diff below.

For the inverse, the libc headers need additional #if __UAPI_DEF_FOO
changes, as described in include/uapli/linux/libc-compat.h. Those
changes are a noop without kernel definitions, so arguably that libc
patch should be merged before this kernel patch.

I will remove this patch set from the patchwork queue for now.

diff --git a/Documentation/networking/timestamping/txtimestamp.c
b/Documentation/networking/timestamping/txtimestamp.c
index 5df0704..f073801 100644
--- a/Documentation/networking/timestamping/txtimestamp.c
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -37,7 +37,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -59,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 #include

[PATCH net-next 0/2] uapi: include time.h from errqueue.h

2016-09-12 Thread Willem de Bruijn

From: Willem de Bruijn <will...@google.com>

It was reported that linux/errqueue.h requires linux/time.h, but that
adding the include directly may cause userspace conflicts between
linux/time.h and glibc time.h:

  https://lkml.org/lkml/2016/7/10/10

Address the conflicts using the standard libc-compat approach, then
add the #include to errqueue.h

The first patch is a resubmit. It was previously submitted to
tip/timers/core, but given the commit history, the maintainer
suggested this tree, instead.

  https://lkml.org/lkml/2016/8/10/748

This also allows sending the follow-up as part of the patchset.

Willem de Bruijn (2):
  uapi glibc compat: make linux/time.h compile with user time.h files
  errqueue: include linux/time.h

 include/uapi/linux/errqueue.h|  1 +
 include/uapi/linux/libc-compat.h | 50 
 include/uapi/linux/time.h| 15 
 3 files changed, 66 insertions(+)

-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 2/2] errqueue: include linux/time.h

2016-09-12 Thread Willem de Bruijn

From: Willem de Bruijn <will...@google.com>

struct scm_timestamping has fields of type struct timespec. Now that
it is safe to include linux/time.h and time.h at the same time,
include linux/time.h directly in linux/errqueue.h

Without this patch, when compiling the following program after
make headers_install:

gcc -Wall -Werror -Iusr/include -c -xc - <
  static struct scm_timestamping tss;
  int main(void) { tss.ts[0].tv_sec = 1; return 0; }
EOF

gcc gives this error:

  In file included from :1:0:
  usr/include/linux/errqueue.h:33:18: error: array type has incomplete element 
type
struct timespec ts[3];

Reported-by: Brooks Moses <bmo...@google.com>
Signed-off-by: Willem de Bruijn <will...@google.com>
---
 include/uapi/linux/errqueue.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..abafec8 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -1,6 +1,7 @@
 #ifndef _UAPI_LINUX_ERRQUEUE_H
 #define _UAPI_LINUX_ERRQUEUE_H
 
+#include 
 #include 
 
 struct sock_extended_err {
-- 
2.8.0.rc3.226.g39d4020

[PATCH net-next 1/2] uapi glibc compat: make linux/time.h compile with user time.h files

2016-09-12 Thread Willem de Bruijn

From: Willem de Bruijn <will...@google.com>

Add libc-compat workaround for definitions in linux/time.h that
duplicate those in libc time.h, sys/time.h and bits/time.h.

With this change, userspace builds succeeds when linux/time.h is
included after libc time.h and when it is included after sys/time.h.

The inverse requires additional changes to those userspace headers.

Without this patch, when compiling the following program after
make headers_install:

  echo -e "#include \n#include " | \
gcc -Wall -Werror -Iusr/include -c -xc -

gcc gives these errors:

  #include 
  #include 

In file included from ../test_time.c:3:0:
/usr/include/time.h:120:8: error: redefinition of ‘struct timespec’
 struct timespec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:9:8: note: originally defined here
 struct timespec {
^
In file included from ../test_time.c:3:0:
/usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’
 struct itimerspec
^
In file included from ../test_time.c:2:0:
./usr/include/linux/time.h:34:8: note: originally defined here
 struct itimerspec {

and this warning by indirect inclusion of bits/time.h:

In file included from ../test_time.c:4:0:
./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror]
 #define TIMER_ABSTIME   0x01
 ^
In file included from /usr/include/time.h:41:0,
 from ../test_time.c:3:
/usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location 
of the previous definition
 #   define TIMER_ABSTIME  1
 ^

The _SYS_TIME_H variant resolves similar errors for timeval, timezone,
itimerval and warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF.

Ran the same program for sys/time.h and bits/time.h.

Signed-off-by: Willem de Bruijn <will...@google.com>
---
 include/uapi/linux/libc-compat.h | 50 
 include/uapi/linux/time.h| 15 
 2 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h
index 44b8a6b..b08b0f5 100644
--- a/include/uapi/linux/libc-compat.h
+++ b/include/uapi/linux/libc-compat.h
@@ -165,6 +165,43 @@
 #define __UAPI_DEF_XATTR   1
 #endif
 
+/* Definitions for time.h */
+#if defined(__timespec_defined)
+#define __UAPI_DEF_TIMESPEC0
+#else
+#define __UAPI_DEF_TIMESPEC1
+#endif
+
+#if defined(_TIME_H) && defined(__USE_POSIX199309)
+#define __UAPI_DEF_ITIMERSPEC  0
+#else
+#define __UAPI_DEF_ITIMERSPEC  1
+#endif
+
+/* Definitions for sys/time.h */
+#if defined(_SYS_TIME_H)
+#define __UAPI_DEF_TIMEVAL 0
+#define __UAPI_DEF_ITIMERVAL   0
+#define __UAPI_DEF_ITIMER_WHICH0
+#else
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#endif
+
+/* Definitions for bits/time.h */
+#if defined(_BITS_TIME_H)
+#define __UAPI_DEF_ABSTIME 0
+#else
+#define __UAPI_DEF_ABSTIME 1
+#endif
+
+#if defined(_SYS_TIME_H) && defined(__USE_BSD)
+#define __UAPI_DEF_TIMEZONE0
+#else
+#define __UAPI_DEF_TIMEZONE1
+#endif
+
 /* If we did not see any headers from any supported C libraries,
  * or we are being included in the kernel, then define everything
  * that we need. */
@@ -208,6 +245,19 @@
 /* Definitions for xattr.h */
 #define __UAPI_DEF_XATTR   1
 
+/* Definitions for time.h */
+#define __UAPI_DEF_TIMESPEC1
+#define __UAPI_DEF_ITIMERSPEC  1
+
+/* Definitions for sys/time.h */
+#define __UAPI_DEF_TIMEVAL 1
+#define __UAPI_DEF_ITIMERVAL   1
+#define __UAPI_DEF_ITIMER_WHICH1
+#define __UAPI_DEF_TIMEZONE1
+
+/* Definitions for bits/time.h */
+#define __UAPI_DEF_ABSTIME 1
+
 #endif /* __GLIBC__ */
 
 #endif /* _UAPI_LIBC_COMPAT_H */
diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h
index e75e1b6..4e7333c 100644
--- a/include/uapi/linux/time.h
+++ b/include/uapi/linux/time.h
@@ -1,9 +1,11 @@
 #ifndef _UAPI_LINUX_TIME_H
 #define _UAPI_LINUX_TIME_H
 
+#include 
 #include 
 
 
+#if __UAPI_DEF_TIMESPEC
 #ifndef _STRUCT_TIMESPEC
 #define _STRUCT_TIMESPEC
 struct timespec {
@@ -11,35 +13,46 @@ struct timespec {
longtv_nsec;/* nanoseconds */
 };
 #endif
+#endif
 
+#if __UAPI_DEF_TIMEVAL
 struct timeval {
__kernel_time_t tv_sec; /* seconds */
__kernel_suseconds_ttv_usec;/* microseconds */
 };
+#endif
 
+#if __UAPI_DEF_TIMEZONE
 struct timezone {
int tz_minuteswest; /* minutes west of Greenwich */
int tz_dsttime; /* type of dst correction */
 };
+#endif
 
 
 /*
  * Names of the interval timers, and structure
  * defining a timer set

Re: af_packet: use after free in prb_retire_rx_blk_timer_expired

2017-04-11 Thread Willem de Bruijn

On Mon, Apr 10, 2017 at 3:23 PM, Dave Jones  wrote:
> On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com wrote:
>  > Hi all,
>  >
>  > I seem to be hitting this use-after-free on a -next kernel using trinity:
>  >
>  > [  531.036054] BUG: KASAN: use-after-free in 
> prb_retire_rx_blk_timer_expired (net/packet/af_packet.c:688)

The retire_blk_timer is called after the pg_vec struct for this ring
was freed. This should not happen. packet_set_ring stops the timer
with del_timer_sync when tearing down the ring before freeing that
struct:

if (closing && (po->tp_version > TPACKET_V2)) {
/* Because we don't support block-based V3 on tx-ring */
if (!tx_ring)
prb_shutdown_retire_blk_timer(po, rb_queue);
}

if (pg_vec)
free_pg_vec(pg_vec, order, req->tp_block_nr);

This is a similar race to the use-after-free fixed by 84ac7260236a
("packet: fix race condition in packet_set_ring"). The previous race
was triggered by a call to setsockopt PACKET_VERSION changing
tp_version while the ring is active. It is not immediately obvious
what is the cause now. I suppose trinity does not give a trace of such
system calls on this file descriptor? That would be helpful.

The bug report shows both a timer firing after the packet_set_ring
call that freed the pg_vec, and later a CONFIG_DEBUG_OBJECTS_FREE
warning that the timer is still active when the socket is closed on
release of the last file descriptor.

Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg

2017-04-12 Thread Willem de Bruijn

===
> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg
> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128

Thanks for the report. This is accessing skb->dev from within recvmsg() at line

info->ipi_ifindex = skb->dev->ifindex;

Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on
errqueue with origin tstamp"). At this time the device may indeed have
gone away. I'm having a look at a way to read this in the receive BH
and store the ifindex.

Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg

2017-04-12 Thread Willem de Bruijn

On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote:
>> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>> > ===
>> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg
>> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128
>> >
>> > Thanks for the report. This is accessing skb->dev from within recvmsg() at 
>> > line
>> >
>> > info->ipi_ifindex = skb->dev->ifindex;
>> >
>> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on
>> > errqueue with origin tstamp"). At this time the device may indeed have
>> > gone away. I'm having a look at a way to read this in the receive BH
>> > and store the ifindex.
>>
>> Why not use skb_iif?

This code is called from the error path for transmit timestamps.

We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as
the first field in its control block. This also holds for the PKTINFO_SKB_CB
struct to which skb->cb is cast on dequeue when it copies pktinfo to userspace.
So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation
is even needed on dequeue, let alone the currently buggy line that touches
skb->dev.

This iif cast was added for this purpose in the receive path in 0b922b7a829c
("net: original ingress device index in PKTINFO").

The device pointer is valid on enqueue for all paths called from device drivers,
as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in
__dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but
there skb->dev is NULL.

The v6 path does need a conversion, but already does this in
ip6_datagram_recv_common_ctl. There, too, we can remove the buggy
logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg.

I will send a patch.

Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg

2017-04-12 Thread Willem de Bruijn

On Wed, Apr 12, 2017 at 6:25 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote:
>>> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn
>>> <willemdebruijn.ker...@gmail.com> wrote:
>>> > ===
>>> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg
>>> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128
>>> >
>>> > Thanks for the report. This is accessing skb->dev from within recvmsg() 
>>> > at line
>>> >
>>> > info->ipi_ifindex = skb->dev->ifindex;
>>> >
>>> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on
>>> > errqueue with origin tstamp"). At this time the device may indeed have
>>> > gone away. I'm having a look at a way to read this in the receive BH
>>> > and store the ifindex.
>>>
>>> Why not use skb_iif?
>
> This code is called from the error path for transmit timestamps.
>
> We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as
> the first field in its control block. This also holds for the PKTINFO_SKB_CB
> struct to which skb->cb is cast on dequeue when it copies pktinfo to 
> userspace.
> So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation
> is even needed on dequeue, let alone the currently buggy line that touches
> skb->dev.
>
> This iif cast was added for this purpose in the receive path in 0b922b7a829c
> ("net: original ingress device index in PKTINFO").
>
> The device pointer is valid on enqueue for all paths called from device 
> drivers,
> as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in
> __dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but
> there skb->dev is NULL.
>
> The v6 path does need a conversion, but already does this in
> ip6_datagram_recv_common_ctl. There, too, we can remove the buggy
> logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg.
>
> I will send a patch.

Sent http://patchwork.ozlabs.org/patch/750197

Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space

2017-08-06 Thread Willem de Bruijn

>> > +#ifdef __KERNEL__
>> > +#include 
>> > +#else
>> > +#include 
>> > +#endif /* __KERNEL__ */
>>
>> This will break applications that include  manually.
>> I previously sent a patch to use libc-compat to make compilation succeed
>> when both are included in the case where  is included after
>> .
>>
>>   https://lkml.org/lkml/2016/9/12/872
>>
>> The inverse will require changes to the libc header to avoid redefining
>> symbols already defined by 
>>
>> The second patch in that 2-patch set included 
>> unconditionally after the fix. This broke builds that also included
>>  in the wrong order. I did not resubmit the first patch as a
>> stand-alone, as it is not sufficient to avoid breakage.
>
> I wasn't aware of your change, but I was about to send this to fix the
> case when glibc  is included before :
>
> https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66

There are a few differences between the two. Including  does not
unconditionally define all the symbols. Some are conditional on additional
state, such as __timespec_defined.

> but you also ran into problems where  is included before
>  which need fixes in libc header side.
>
> So how to proceed with these?

The libc-compat change is a good fix that can be submitted on its own.

> I don't like leaving a few dozen non-compiling header files into uapi.

I agree, but I do not see a simple solution.

Unless libc has the analogous change, including either  or
 in userspace can unfortunately cause breakage.

The added include if __KERNEL__ is defined should be safe, though.

Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space

2017-08-06 Thread Willem de Bruijn

On Sun, Aug 6, 2017 at 5:52 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote:
> On Sun, Aug 06, 2017 at 05:42:13PM -0400, Willem de Bruijn wrote:
>> On Sun, Aug 6, 2017 at 5:33 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote:
>> > On Sun, Aug 06, 2017 at 05:24:20PM -0400, Willem de Bruijn wrote:
>> >> >> > +#ifdef __KERNEL__
>> >> >> > +#include 
>> >> >> > +#else
>> >> >> > +#include 
>> >> >> > +#endif /* __KERNEL__ */
>> >> >>
>> >> >> This will break applications that include  manually.
>> >> >> I previously sent a patch to use libc-compat to make compilation 
>> >> >> succeed
>> >> >> when both are included in the case where  is included 
>> >> >> after
>> >> >> .
>> >> >>
>> >> >>   https://lkml.org/lkml/2016/9/12/872
>> >> >>
>> >> >> The inverse will require changes to the libc header to avoid redefining
>> >> >> symbols already defined by 
>> >> >>
>> >> >> The second patch in that 2-patch set included 
>> >> >> unconditionally after the fix. This broke builds that also included
>> >> >>  in the wrong order. I did not resubmit the first patch as a
>> >> >> stand-alone, as it is not sufficient to avoid breakage.
>> >> >
>> >> > I wasn't aware of your change, but I was about to send this to fix the
>> >> > case when glibc  is included before :
>> >> >
>> >> > https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66
>> >>
>> >> There are a few differences between the two. Including  does not
>> >> unconditionally define all the symbols. Some are conditional on additional
>> >> state, such as __timespec_defined.
>> >
>> > Yep, your patch seems better for libc-compat.h. Could you send it again?
>>
>> Okay. Or feel free to include it in the patchset if that helps resolve
>> dependencies.
>
> If you don't have the time, I will send tomorrow a new version of this
> patch which fixes the commit topic and before that your libc-compat.h change
> so both could be applied together.

Please do. Thanks!

Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space

2017-08-06 Thread Willem de Bruijn

On Sun, Aug 6, 2017 at 5:33 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote:
> On Sun, Aug 06, 2017 at 05:24:20PM -0400, Willem de Bruijn wrote:
>> >> > +#ifdef __KERNEL__
>> >> > +#include 
>> >> > +#else
>> >> > +#include 
>> >> > +#endif /* __KERNEL__ */
>> >>
>> >> This will break applications that include  manually.
>> >> I previously sent a patch to use libc-compat to make compilation succeed
>> >> when both are included in the case where  is included after
>> >> .
>> >>
>> >>   https://lkml.org/lkml/2016/9/12/872
>> >>
>> >> The inverse will require changes to the libc header to avoid redefining
>> >> symbols already defined by 
>> >>
>> >> The second patch in that 2-patch set included 
>> >> unconditionally after the fix. This broke builds that also included
>> >>  in the wrong order. I did not resubmit the first patch as a
>> >> stand-alone, as it is not sufficient to avoid breakage.
>> >
>> > I wasn't aware of your change, but I was about to send this to fix the
>> > case when glibc  is included before :
>> >
>> > https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66
>>
>> There are a few differences between the two. Including  does not
>> unconditionally define all the symbols. Some are conditional on additional
>> state, such as __timespec_defined.
>
> Yep, your patch seems better for libc-compat.h. Could you send it again?

Okay. Or feel free to include it in the patchset if that helps resolve
dependencies.

>> > I don't like leaving a few dozen non-compiling header files into uapi.
>>
>> I agree, but I do not see a simple solution.
>>
>> Unless libc has the analogous change, including either  or
>>  in userspace can unfortunately cause breakage.
>>
>> The added include if __KERNEL__ is defined should be safe, though.
>
> Yes, for the kernel side, but your libc-compat change would nice for
> userspace, where something will break for sure, but providing source
> API compatibility is sometimes impossible.
>
> To summarize, this change from me, and your libc-compat.c for time.h, or?

I'm still afraid that this patch as is will break builds that include
 first.

Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space

2017-08-06 Thread Willem de Bruijn

On Sun, Aug 6, 2017 at 4:23 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Sun, Aug 6, 2017 at 12:44 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote:
>> linux/time.h conflicts with user space header time.h. Try to be compatible
>> with both.
>>
>> Fixes userspace compilation error:
>>
>> error: array type has incomplete element type
>>  struct timespec ts[3];
>>
>> Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
>> Cc: Willem de Bruijn <will...@google.com>
>> Cc: Soheil Hassas Yeganeh <soh...@google.com>
>> Cc: net...@vger.kernel.org
>> ---
>>  include/uapi/linux/errqueue.h | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
>> index 07bdce1f444a..b310b2c6d94f 100644
>> --- a/include/uapi/linux/errqueue.h
>> +++ b/include/uapi/linux/errqueue.h
>> @@ -3,6 +3,12 @@
>>
>>  #include 
>>
>> +#ifdef __KERNEL__
>> +#include 
>> +#else
>> +#include 
>> +#endif /* __KERNEL__ */
>
> This will break applications that include  manually.

Also, the patch title reads "include  in user space",
but it includes  in that environment.

Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space

2017-08-06 Thread Willem de Bruijn

On Sun, Aug 6, 2017 at 12:44 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote:
> linux/time.h conflicts with user space header time.h. Try to be compatible
> with both.
>
> Fixes userspace compilation error:
>
> error: array type has incomplete element type
>  struct timespec ts[3];
>
> Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi>
> Cc: Willem de Bruijn <will...@google.com>
> Cc: Soheil Hassas Yeganeh <soh...@google.com>
> Cc: net...@vger.kernel.org
> ---
>  include/uapi/linux/errqueue.h | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
> index 07bdce1f444a..b310b2c6d94f 100644
> --- a/include/uapi/linux/errqueue.h
> +++ b/include/uapi/linux/errqueue.h
> @@ -3,6 +3,12 @@
>
>  #include 
>
> +#ifdef __KERNEL__
> +#include 
> +#else
> +#include 
> +#endif /* __KERNEL__ */

This will break applications that include  manually.

I previously sent a patch to use libc-compat to make compilation succeed
when both are included in the case where  is included after
.

  https://lkml.org/lkml/2016/9/12/872

The inverse will require changes to the libc header to avoid redefining
symbols already defined by 

The second patch in that 2-patch set included 
unconditionally after the fix. This broke builds that also included
 in the wrong order. I did not resubmit the first patch as a
stand-alone, as it is not sufficient to avoid breakage.

Re: linux-next: build failure after merge of the netfilter tree

2017-05-17 Thread Willem de Bruijn

On Wed, May 17, 2017 at 1:02 AM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Tue, May 16, 2017 at 11:45 PM, Stephen Rothwell <s...@canb.auug.org.au> 
> wrote:
>> Hi all,
>>
>> After merging the netfilter tree, today's linux-next build (i386
>> defconfig) failed like this:
>>
>> net/netfilter/x_tables.c: In function 'xt_match_to_user':
>> net/netfilter/x_tables.c:303:13: error: implicit declaration of function 
>> 'COMPAT_XT_ALIGN' [-Werror=implicit-function-declaration]
>> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :  \
>>  ^
>> net/netfilter/x_tables.c:310:9: note: in expansion of macro 'XT_DATA_TO_USER'
>>  XT_DATA_TO_USER(u, m, match, 0);
>>  ^
>>
>> Caused by commit
>>
>>   324318f0248c ("netfilter: xtables: zero padding in data_to_user")
>>
>> In the !CONFIG_COMPAT case C_SIZE will always be zero, but the compiler
>> is still looking for the macro :-(
>
> Apologies for the breakage.
>
>> I added this cludge patch (I am sure it can be done better):
>>
>> From: Stephen Rothwell <s...@canb.auug.org.au>
>> Date: Wed, 17 May 2017 13:36:26 +1000
>> Subject: [PATCH] netfilter: xtables: fix for zero padding in data_to_user
>>
>> Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au>
>> ---
>>  net/netfilter/x_tables.c | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
>> index d17769599c10..2b1785993a92 100644
>> --- a/net/netfilter/x_tables.c
>> +++ b/net/netfilter/x_tables.c
>> @@ -296,12 +296,20 @@ int xt_data_to_user(void __user *dst, const void *src,
>>  }
>>  EXPORT_SYMBOL_GPL(xt_data_to_user);
>>
>> +#ifdef CONFIG_COMPAT
>>  #define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
>> xt_data_to_user(U->data, K->data,   \
>> K->u.kernel.TYPE->usersize, \
>> C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\
>> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :  \
>>  XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
>> +#else
>> +#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
>> +   xt_data_to_user(U->data, K->data,   \
>> +   K->u.kernel.TYPE->usersize, \
>> +   C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\
>> +   C_SIZE ? : XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
>> +#endif
>>
>
> I will send a patch to the netfilter to define a separate
> COMPAT_XT_DATA_TO_USER inside the CONFIG_COMPAT region further down
> the file. This also allows simplifying XT_DATA_TO_USER by removing
> those ternary statements.

Full patch out for review at http://patchwork.ozlabs.org/patch/763655/

Re: linux-next: build failure after merge of the netfilter tree

2017-05-16 Thread Willem de Bruijn

On Tue, May 16, 2017 at 11:45 PM, Stephen Rothwell  
wrote:
> Hi all,
>
> After merging the netfilter tree, today's linux-next build (i386
> defconfig) failed like this:
>
> net/netfilter/x_tables.c: In function 'xt_match_to_user':
> net/netfilter/x_tables.c:303:13: error: implicit declaration of function 
> 'COMPAT_XT_ALIGN' [-Werror=implicit-function-declaration]
> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :  \
>  ^
> net/netfilter/x_tables.c:310:9: note: in expansion of macro 'XT_DATA_TO_USER'
>  XT_DATA_TO_USER(u, m, match, 0);
>  ^
>
> Caused by commit
>
>   324318f0248c ("netfilter: xtables: zero padding in data_to_user")
>
> In the !CONFIG_COMPAT case C_SIZE will always be zero, but the compiler
> is still looking for the macro :-(

Apologies for the breakage.

> I added this cludge patch (I am sure it can be done better):
>
> From: Stephen Rothwell 
> Date: Wed, 17 May 2017 13:36:26 +1000
> Subject: [PATCH] netfilter: xtables: fix for zero padding in data_to_user
>
> Signed-off-by: Stephen Rothwell 
> ---
>  net/netfilter/x_tables.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d17769599c10..2b1785993a92 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -296,12 +296,20 @@ int xt_data_to_user(void __user *dst, const void *src,
>  }
>  EXPORT_SYMBOL_GPL(xt_data_to_user);
>
> +#ifdef CONFIG_COMPAT
>  #define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
> xt_data_to_user(U->data, K->data,   \
> K->u.kernel.TYPE->usersize, \
> C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\
> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :  \
>  XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
> +#else
> +#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
> +   xt_data_to_user(U->data, K->data,   \
> +   K->u.kernel.TYPE->usersize, \
> +   C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\
> +   C_SIZE ? : XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
> +#endif
>

I will send a patch to the netfilter to define a separate
COMPAT_XT_DATA_TO_USER inside the CONFIG_COMPAT region further down
the file. This also allows simplifying XT_DATA_TO_USER by removing
those ternary statements.

-#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
+#define XT_DATA_TO_USER(U, K, TYPE)\
xt_data_to_user(U->data, K->data,   \
K->u.kernel.TYPE->usersize, \
-   C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\
-   C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) :  \
-XT_ALIGN(K->u.kernel.TYPE->TYPE##size))
+   K->u.kernel.TYPE->TYPE##size,   \
+   XT_ALIGN(K->u.kernel.TYPE->TYPE##size))

and

+#define COMPAT_XT_DATA_TO_USER(U, K, TYPE, C_SIZE) \
+   xt_data_to_user(U->data, K->data,   \
+   K->u.kernel.TYPE->usersize, \
+   C_SIZE, \
+   COMPAT_XT_ALIGN(C_SIZE))
+

Re: linux-next: build failure after merge of the net-next tree

2017-05-21 Thread Willem de Bruijn

On Sun, May 21, 2017 at 9:16 PM, Stephen Rothwell  wrote:
> Hi all,
>
> After merging the net-next tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
>
> net/socket.c: In function 'put_ts_pktinfo':
> net/socket.c:695:28: error: 'SCM_TIMESTAMPING_PKTINFO' undeclared (first use 
> in this function)
>   put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_PKTINFO,
> ^
>
> Caused by commit
>
>   aad9c8c470f2 ("net: add new control message for incoming HW-timestamped 
> packets")
>
> This probably broke every architecture that has its own
> arch//include/uapi/asm/socket.h that does not include
> include/uapi/asm-generic/socket.h :-(

Indeed. I added the architecture specific versions in patch

  http://patchwork.ozlabs.org/patch/765238/

That fixes the powerpc build for me. The new option is now
defined in every file that also defines the last added such
option SCM_TIMESTAMPING_OPT_STATS. Apologies for
missing this earlier.

Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-14 Thread Willem de Bruijn

On Wed, Sep 13, 2017 at 10:40 PM, nixiaoming  wrote:
> If fanout_add is preempted after running po-> fanout = match
> and before running __fanout_link,
> it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink
>
> so, we need add mutex_lock(_mutex) to __unregister_prot_hook
> or add spin_lock(>bind_lock) before po-> fanout = match
>
> test on linux 4.1.42:
> ./trinity -c setsockopt -C 2 -X &
>
> BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()!
> Kernel panic - not syncing: BUG!
> CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW  O4.1.12 #1
> Hardware name: Hisilicon PhosphorHi1382 FPGA (DT)
> Call trace:
> [] dump_backtrace+0x0/0xf8
> [] show_stack+0x20/0x28
> [] dump_stack+0xac/0xe4
> [] panic+0xf8/0x268
> [] __unregister_prot_hook+0xa0/0x144
> [] packet_set_ring+0x280/0x5b4
> [] packet_setsockopt+0x320/0x950
> [] SyS_setsockopt+0xa4/0xd4
>
> Signed-off-by: nixiaoming 
> Tested-by: wudesheng 
> ---
>  net/packet/af_packet.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 008a45c..0300146 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, 
> bool sync)
>
> po->running = 0;
>
> +   mutex_lock(_mutex);
> if (po->fanout)
> __fanout_unlink(sk, po);
> else
> __dev_remove_pack(>prot_hook);
> +   mutex_unlock(_mutex);
>
> __sock_put(sk);

I happened to be looking at the same or a very similar race, courtesy
of syzkaller. packet_set_ring and fanout_add can race.

I believe that one bug is in fanout_add removing the socket
protocol hook and adding the fanout protocol hook without holding
po->bind_lock.

That lock ensures atomic updates to po->running and the actual
protocol hook. fanout_add tests po->running without holding the lock

   if (!po->running)
goto out;

and later unconditionally unbinds the socket protocol hook and binds
the fanout group protocol hook:

   if (refcount_read(>sk_ref) < PACKET_FANOUT_MAX) {
__dev_remove_pack(>prot_hook);
po->fanout = match;
refcount_set(>sk_ref,
refcount_read(>sk_ref) + 1);
__fanout_link(sk, po);
err = 0;
}

This can happen after packet_set_ring has already removed the
protocol hook, causing the socket to be added to the fanout list
twice.

Testing po->running again, this time while holding the bind_lock,
ensures that packet_set_ring cannot have dropped it in between:

+   spin_lock(>bind_lock);
+   if (!po->running) {
+   net_err_ratelimited("fanout add, but
unbound sock");
+   err = -EFAULT;
+   spin_unlock(>bind_lock);
+   goto out;
+   }
+   __dev_remove_pack(>prot_hook));
po->fanout = match;
refcount_set(>sk_ref,
refcount_read(>sk_ref) + 1);
__fanout_link(sk, po);
+   spin_unlock(>bind_lock);

I verified that the reproducer logs plenty of "fanout add, but unbound
sock" messages.

I intend to send this fix after cleaning it up a bit. Will take a
closer look at your patch to see whether these are indeed the
same bug report.

Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-14 Thread Willem de Bruijn

On Thu, Sep 14, 2017 at 10:07 AM, nixiaoming  wrote:
> From: l00219569 
>
> If fanout_add is preempted after running po-> fanout = match
> and before running __fanout_link,
> it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink
>
> so, we need add mutex_lock(_mutex) to __unregister_prot_hook

The packet socket code has no shortage of locks, so there are many
ways to avoid the race condition between fanout_add and packet_set_ring.

Another option would be to lock the socket when calling fanout_add:

-   return fanout_add(sk, val & 0x, val >> 16);
+   lock_sock(sk);
+   ret = fanout_add(sk, val & 0x, val >> 16);
+   release_sock(sk);
+   return ret;

But, for consistency, and to be able to continue to make sense of the
locking policy, we should use the most appropriate lock. This
is po->bind_lock, as it ensures atomicity between testing whether
a protocol hook is active through po->running and the actual existence
of that hook on the protocol hook list.

fanout_mutex protects the fanout object's list. Taking that on
__unregister_prot_hook even in the case where fanout is not
used (and __dev_remove_pack is called) complicates locking
in this already complicated code.

> or add spin_lock(>bind_lock) before po-> fanout = match
>
> this is a patch for add po->bind_lock in fanout_add
>
> test on linux 4.1.12:
> ./trinity -c setsockopt -C 2 -X &

Thanks for testing!

>
> BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()!
> Kernel panic - not syncing: BUG!
> CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW  O4.1.12 #1
> Hardware name: Hisilicon PhosphorHi1382 FPGA (DT)
> Call trace:
> [] dump_backtrace+0x0/0xf8
> [] show_stack+0x20/0x28
> [] dump_stack+0xac/0xe4
> [] panic+0xf8/0x268
> [] __unregister_prot_hook+0xa0/0x144
> [] packet_set_ring+0x280/0x5b4
> [] packet_setsockopt+0x320/0x950
> [] SyS_setsockopt+0xa4/0xd4
>
> Signed-off-by: nixiaoming 
> Tested-by: wudesheng 
> ---
>  net/packet/af_packet.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 54a18a8..7a52a3b 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1446,12 +1446,16 @@ static int fanout_add(struct sock *sk, u16 id, u16 
> type_flags)
> default:
> return -EINVAL;
> }
> -
> -   if (!po->running)
> +   spin_lock(>bind_lock);
> +   if (!po->running) {
> +   spin_unlock(>bind_lock);
> return -EINVAL;
> +   }
>
> -   if (po->fanout)
> +   if (po->fanout) {
> +   spin_unlock(>bind_lock);
> return -EALREADY;
> +   }
>
> mutex_lock(_mutex);
> match = NULL;
> @@ -1501,6 +1505,7 @@ static int fanout_add(struct sock *sk, u16 id, u16 
> type_flags)
> }
>  out:
> mutex_unlock(_mutex);
> +   spin_unlock(>bind_lock);

This function can call kzalloc with GFP_KERNEL, which may sleep. It is
not correct to sleep while holding a spinlock. Which is why I take the lock
later and test po->running again.

I will clean up that patch and send it for review.

Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-15 Thread Willem de Bruijn

On Fri, Sep 15, 2017 at 1:41 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Thu, Sep 14, 2017 at 7:35 AM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>> On Thu, Sep 14, 2017 at 10:07 AM, nixiaoming <nixiaom...@huawei.com> wrote:
>>> From: l00219569 <lisi...@huawei.com>
>>>
>>> If fanout_add is preempted after running po-> fanout = match
>>> and before running __fanout_link,
>>> it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink
>>>
>>> so, we need add mutex_lock(_mutex) to __unregister_prot_hook
>>
>> The packet socket code has no shortage of locks, so there are many
>> ways to avoid the race condition between fanout_add and packet_set_ring.
>>
>> Another option would be to lock the socket when calling fanout_add:
>>
>> -   return fanout_add(sk, val & 0x, val >> 16);
>> +   lock_sock(sk);
>> +   ret = fanout_add(sk, val & 0x, val >> 16);
>> +   release_sock(sk);
>> +   return ret;
>>
>
> I don't think this is an option, because __unregister_prot_hook()
> can be called without lock_sock(), for example in packet_notifier().
>
>
>> But, for consistency, and to be able to continue to make sense of the
>> locking policy, we should use the most appropriate lock. This
>> is po->bind_lock, as it ensures atomicity between testing whether
>> a protocol hook is active through po->running and the actual existence
>> of that hook on the protocol hook list.
>
> Yeah, register_prot_hook() and unregister_prot_hook() already assume
> bind_lock.
>
> [...]
>
>>>  out:
>>> mutex_unlock(_mutex);
>>> +   spin_unlock(>bind_lock);
>>
>> This function can call kzalloc with GFP_KERNEL, which may sleep. It is
>> not correct to sleep while holding a spinlock. Which is why I take the lock
>> later and test po->running again.
>
>
> Right, no need to mention the mutex_unlock() before the spin_unlock()
> is clearly wrong.
>
>
>>
>> I will clean up that patch and send it for review.
>
> How about the following patch?
>
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index c26172995511..f5c696a548ed 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1754,10 +1754,14 @@ static int fanout_add(struct sock *sk, u16 id,
> u16 type_flags)
> match->prot_hook.dev == po->prot_hook.dev) {
> err = -ENOSPC;
> if (refcount_read(>sk_ref) < PACKET_FANOUT_MAX) {
> +   spin_lock(>bind_lock);
> __dev_remove_pack(>prot_hook);
> -   po->fanout = match;
> -   refcount_set(>sk_ref,
> refcount_read(>sk_ref) + 1);
> -   __fanout_link(sk, po);
> +   if (po->running) {
> +   refcount_set(>sk_ref,
> refcount_read(>sk_ref) + 1);
> +   po->fanout = match;
> +   __fanout_link(sk, po);
> +   }
> +   spin_unlock(>bind_lock);
> err = 0;
> }
> }

In case of failure we also need to unlink and free match. I
sent the following:

http://patchwork.ozlabs.org/patch/813945/

Re: Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-19 Thread Willem de Bruijn

On Tue, Sep 19, 2017 at 3:21 AM, Nixiaoming <nixiaom...@huawei.com> wrote:
> On Fri, Sep 15, 2017 at 10:46 AM, Willem de Bruijn
>
> <willemdebruijn.ker...@gmail.com> wrote:
>
>>
>
>> In case of failure we also need to unlink and free match. I
>
>> sent the following:
>
>>
>
>> http://patchwork.ozlabs.org/patch/813945/
>
>
>
> +   spin_lock(>bind_lock);
>
> +   if (po->running &&
>
> +   match->type == type &&
>
>match->prot_hook.type == po->prot_hook.type &&
>
>match->prot_hook.dev == po->prot_hook.dev) {
>
> err = -ENOSPC;
>
> @@ -1761,6 +1760,13 @@  static int fanout_add(struct sock *sk, u16 id, u16
> type_flags)
>
>   err = 0;
>
> }
>
>}
>
> +   spin_unlock(>bind_lock);
>
> +
>
> +   if (err && !refcount_read(>sk_ref)) {
>
> +list_del(>list);
>
> +kfree(match);
>
> +   }
>
>
>
>
>
> In the function fanout_add add spin_lock to protect po-> running and po->
> fanout,
>
> then whether it should be in the function fanout_release also add spin_lock
> protection ?

po->bind_lock is held when registering and unregistering the
protocol hook. fanout_release does access po->running or
prot_hook.

It is called from packet_release, which does hold the bind_lock
when unregistering the protocol hook.

Re: Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook

2017-09-19 Thread Willem de Bruijn

On Tue, Sep 19, 2017 at 12:09 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Tue, Sep 19, 2017 at 3:21 AM, Nixiaoming <nixiaom...@huawei.com> wrote:
>> On Fri, Sep 15, 2017 at 10:46 AM, Willem de Bruijn
>>
>> <willemdebruijn.ker...@gmail.com> wrote:
>>
>>>
>>
>>> In case of failure we also need to unlink and free match. I
>>
>>> sent the following:
>>
>>>
>>
>>> http://patchwork.ozlabs.org/patch/813945/
>>
>>
>>
>> +   spin_lock(>bind_lock);
>>
>> +   if (po->running &&
>>
>> +   match->type == type &&
>>
>>match->prot_hook.type == po->prot_hook.type &&
>>
>>match->prot_hook.dev == po->prot_hook.dev) {
>>
>> err = -ENOSPC;
>>
>> @@ -1761,6 +1760,13 @@  static int fanout_add(struct sock *sk, u16 id, u16
>> type_flags)
>>
>>   err = 0;
>>
>> }
>>
>>}
>>
>> +   spin_unlock(>bind_lock);
>>
>> +
>>
>> +   if (err && !refcount_read(>sk_ref)) {
>>
>> +list_del(>list);
>>
>> +kfree(match);
>>
>> +   }
>>
>>
>>
>>
>>
>> In the function fanout_add add spin_lock to protect po-> running and po->
>> fanout,
>>
>> then whether it should be in the function fanout_release also add spin_lock
>> protection ?
>
> po->bind_lock is held when registering and unregistering the
> protocol hook. fanout_release does access po->running or
> prot_hook.

whoops. does *not* access.

Re: [PATCH 2/3] selftests: actually run the various net selftests

2017-09-19 Thread Willem de Bruijn

On Tue, Sep 19, 2017 at 9:34 AM, Josef Bacik <jo...@toxicpanda.com> wrote:
> On Mon, Sep 18, 2017 at 04:14:41PM -0600, Shuah Khan wrote:
>> On 09/18/2017 11:32 AM, jo...@toxicpanda.com wrote:
>> > From: Josef Bacik <jba...@fb.com>
>> >
>> > These self tests are just self contained binaries, they are not run by
>> > any of the scripts in the directory.  This means they need to be marked
>> > with TEST_GEN_PROGS to actually be run, not TEST_GEN_FILES.
>> >
>> > Signed-off-by: Josef Bacik <jba...@fb.com>
>> > ---
>> >  tools/testing/selftests/net/Makefile | 4 ++--
>> >  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/tools/testing/selftests/net/Makefile 
>> > b/tools/testing/selftests/net/Makefile
>> > index 3df542c84610..45a4e77a47c4 100644
>> > --- a/tools/testing/selftests/net/Makefile
>> > +++ b/tools/testing/selftests/net/Makefile
>> > @@ -6,8 +6,8 @@ CFLAGS += -I../../../../usr/include/
>> >  TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
>> > rtnetlink.sh
>> >  TEST_GEN_FILES =  socket
>> >  TEST_GEN_FILES += psock_fanout psock_tpacket
>> > -TEST_GEN_FILES += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>> > -TEST_GEN_FILES += reuseport_dualstack msg_zerocopy reuseaddr_conflict
>> > +TEST_GEN_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
>> > +TEST_GEN_PROGS += reuseport_dualstack msg_zerocopy reuseaddr_conflict
>>
>> Hmm. I see msg_zerocopy.sh for running msg_zerocopy. msg_zerocopy should
>> still stay in TEST_GEN_FILES and msg_zerocopy.sh needs to be added to
>> TEST_PROGS so it runs.
>>
>
> Actually the shell script requires arguments, it doesn't just run the test.
> I'll fix this to just omit the test for now as it's not setup to run properly.
>
> Willem, could you follow up with a patch so that the zero copy test is run
> properly the way you envision it running?  You need to make sure that
>
> make -C tools/testing/selftests TARGETS=net run_tests
>
> actually runs your zero copy test the way you expect it to, otherwise it's 
> just
> sitting there collecting dust.  Thanks,

Will do.

In its current state, this test is really only meant to be run manually.
It demonstrates the API and outputs some information on stderr.

Zerocopy itself requires a two-host test. The feature is expressly
disabled over loopback.

But I can make this a pass/fail tests that exercises the interface
and notification channel and verifies that data was copied. It will
be a bit more work than just changing the default invocation of
msg_zerocopy.sh

Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-28 Thread Willem de Bruijn

On Thu, Sep 28, 2017 at 3:23 AM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2017年09月28日 07:25, Willem de Bruijn wrote:
>>>>
>>>> In the future, both simple and sophisticated policy like RSS or other
>>>> guest
>>>> driven steering policies could be done on top.
>>>
>>> IMHO there should be a more practical example before adding all this
>>> indirection. And it would be nice to understand why this queue selection
>>> needs to be tun specific.
>>
>> I was thinking the same and this reminds me of the various strategies
>> implemented in packet fanout. tun_cpu_select_queue is analogous to
>> fanout_demux_cpu though it is tun-specific in that it requires
>> tun->numqueues.
>
>
> Right, the main idea is to introduce a way to change flow steering policy
> for tun. I think fanout policy could be implemented through the API
> introduced in this series. (Current flow caches based automatic steering
> method is tun specific).
>
>>
>> Fanout accrued various strategies until it gained an eBPF variant. Just
>> supporting BPF is probably sufficient here, too.
>
>
> Technically yes, but for tun, it also serve for virt. We probably still need
> some hard coded policy which could be changed by guest until we can accept
> an BPF program from guest I think?

When would a guest choose the policy? As long as this is under control
of a host user, possibly unprivileged, allowing BPF here is moot, as any
user can run socket filter BPF already. Programming from the guest is
indeed different. I don't fully understand that use case.

Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

2017-09-27 Thread Willem de Bruijn

> @@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net)
> struct socket *sock;
> struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> bool zcopy, zcopy_used;
> +   int i, batched = VHOST_NET_BATCH;
>
> mutex_lock(>mutex);
> sock = vq->private_data;
> @@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net)
> hdr_size = nvq->vhost_hlen;
> zcopy = nvq->ubufs;
>
> +   /* Disable zerocopy batched fetching for simplicity */

This special case can perhaps be avoided if we no longer block
on vhost_exceeds_maxpend, but revert to copying.

> +   if (zcopy) {
> +   heads = 

Can this special case of batchsize 1 not use vq->heads?

> +   batched = 1;
> +   }
> +
> for (;;) {
> /* Release DMAs done buffers first */
> if (zcopy)
> @@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net)
> if (unlikely(vhost_exceeds_maxpend(net)))
> break;

> +   /* TODO: Check specific error and bomb out
> +* unless ENOBUFS?
> +*/
> +   err = sock->ops->sendmsg(sock, , len);
> +   if (unlikely(err < 0)) {
> +   if (zcopy_used) {
> +   vhost_net_ubuf_put(ubufs);
> +   nvq->upend_idx =
> +  ((unsigned)nvq->upend_idx - 1) % 
> UIO_MAXIOV;
> +   }
> +   vhost_discard_vq_desc(vq, 1);
> +   goto out;
> +   }
> +   if (err != len)
> +   pr_debug("Truncated TX packet: "
> +   " len %d != %zd\n", err, len);
> +   if (!zcopy) {
> +   vhost_add_used_idx(vq, 1);
> +   vhost_signal(>dev, vq);
> +   } else if (!zcopy_used) {
> +   vhost_add_used_and_signal(>dev,
> + vq, head, 0);

While batching, perhaps can also move this producer index update
out of the loop and using vhost_add_used_and_signal_n.

> +   } else
> +   vhost_zerocopy_signal_used(net, vq);
> +   vhost_net_tx_packet(net);
> +   if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> +   vhost_poll_queue(>poll);
> +   goto out;
> }
> -   vhost_discard_vq_desc(vq, 1);
> -   break;
> -   }
> -   if (err != len)
> -   pr_debug("Truncated TX packet: "
> -" len %d != %zd\n", err, len);
> -   if (!zcopy_used)
> -   vhost_add_used_and_signal(>dev, vq, head, 0);
> -   else
> -   vhost_zerocopy_signal_used(net, vq);
> -   vhost_net_tx_packet(net);
> -   if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> -   vhost_poll_queue(>poll);
> -   break;

This patch touches many lines just for indentation. If having to touch
these lines anyway (dirtying git blame), it may be a good time to move
the processing of a single descriptor code into a separate helper function.
And while breaking up, perhaps another helper for setting up ubuf_info.
If you agree, preferably in a separate noop refactor patch that precedes
the functional changes.

Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-27 Thread Willem de Bruijn

>> > > @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
>> > >   void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>> > >   int vhost_vq_init_access(struct vhost_virtqueue *);
>> > > +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
>> > >   int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int 
>> > > len);
>> > >   int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem 
>> > > *heads,
>> > >unsigned count);
>> > Please change the API to hide the fact that there's an index that needs
>> > to be updated.
>>
>> In fact, an interesting optimization on top is just call
>> vhost_add_used_idx(vq, n) instead of n vhost_add_used_idx(vq, 1). That's the
>> reason I leave n in the API.
>>
>> Thanks
>
> Right but you could increment some internal counter in the vq
> structure then update the used index using some api
> with a generic name, e.g.  add_used_complete or something like this.

That adds a layer of information hiding. If the same variable can be
kept close to the computation in a local variable and passed directly
to vhost_add_used_idx_n that is easier to follow.

Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-27 Thread Willem de Bruijn

On Fri, Sep 22, 2017 at 4:02 AM, Jason Wang  wrote:
> This patch introduces vhost_prefetch_desc_indices() which could batch
> descriptor indices fetching and used ring updating. This intends to
> reduce the cache misses of indices fetching and updating and reduce
> cache line bounce when virtqueue is almost full. copy_to_user() was
> used in order to benefit from modern cpus that support fast string
> copy. Batched virtqueue processing will be the first user.
>
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 55 
> +++
>  drivers/vhost/vhost.h |  3 +++
>  2 files changed, 58 insertions(+)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
> vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> +   struct vring_used_elem *heads,
> +   u16 num, bool used_update)
> +{
> +   int ret, ret2;
> +   u16 last_avail_idx, last_used_idx, total, copied;
> +   __virtio16 avail_idx;
> +   struct vring_used_elem __user *used;
> +   int i;
> +
> +   if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) {
> +   vq_err(vq, "Failed to access avail idx at %p\n",
> +  >avail->idx);
> +   return -EFAULT;
> +   }
> +   last_avail_idx = vq->last_avail_idx & (vq->num - 1);
> +   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   total = vq->avail_idx - vq->last_avail_idx;
> +   ret = total = min(total, num);
> +
> +   for (i = 0; i < ret; i++) {
> +   ret2 = vhost_get_avail(vq, heads[i].id,
> + >avail->ring[last_avail_idx]);
> +   if (unlikely(ret2)) {
> +   vq_err(vq, "Failed to get descriptors\n");
> +   return -EFAULT;
> +   }
> +   last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
> +   }

This is understandably very similar to the existing logic in vhost_get_vq_desc.
Can that be extracted to a helper to avoid code duplication?

Perhaps one helper to update vq->avail_idx and return num, and
another to call vhost_get_avail one or more times.

> +
> +   if (!used_update)
> +   return ret;
> +
> +   last_used_idx = vq->last_used_idx & (vq->num - 1);
> +   while (total) {
> +   copied = min((u16)(vq->num - last_used_idx), total);
> +   ret2 = vhost_copy_to_user(vq,
> + >used->ring[last_used_idx],
> + [ret - total],
> + copied * sizeof(*used));
> +
> +   if (unlikely(ret2)) {
> +   vq_err(vq, "Failed to update used ring!\n");
> +   return -EFAULT;
> +   }
> +
> +   last_used_idx = 0;
> +   total -= copied;
> +   }

This second part seems unrelated and could be a separate function?

Also, no need for ret2 and double assignment "ret = total =" if not
modifying total
in the the second loop:

  for (i = 0; i < total; ) {
...
i += copied;
  }

Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-27 Thread Willem de Bruijn

>> In the future, both simple and sophisticated policy like RSS or other guest
>> driven steering policies could be done on top.
>
> IMHO there should be a more practical example before adding all this
> indirection. And it would be nice to understand why this queue selection
> needs to be tun specific.

I was thinking the same and this reminds me of the various strategies
implemented in packet fanout. tun_cpu_select_queue is analogous to
fanout_demux_cpu though it is tun-specific in that it requires tun->numqueues.

Fanout accrued various strategies until it gained an eBPF variant. Just
supporting BPF is probably sufficient here, too.

Re: [PATCH] packet: Don't write vnet header beyond end of buffer

2017-08-28 Thread Willem de Bruijn

On Mon, Aug 28, 2017 at 2:29 PM, Benjamin Poirier <bpoir...@suse.com> wrote:
> ... which may happen with certain values of tp_reserve and maclen.
>
> Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv")
> Signed-off-by: Benjamin Poirier <bpoir...@suse.com>
> Cc: Willem de Bruijn <will...@google.com>

Acked-by: Willem de Bruijn <will...@google.com>

Thanks for fixing this, Benjamin.

Re: [PATCH 1/2] [net-next] packet: clarify timestamp overflow

2017-11-27 Thread Willem de Bruijn

On Mon, Nov 27, 2017 at 11:17 AM, Arnd Bergmann <a...@arndb.de> wrote:
> The memory mapped packet socket data structure in version 1 through 3
> all contain 32-bit second values for the packet time stamps, which makes
> them suffer from the overflow of time_t in y2038 or y2106 (depending
> on whether user space interprets the value as signed or unsigned).
>
> The implementation uses the deprecated getnstimeofday() function.
>
> In order to get rid of that, this changes the code to use
> ktime_get_real_ts64() as a replacement, documenting the nature of the
> overflow. As long as the user applications treat the timestamps as
> unsigned, or only use the difference between timestamps, they are
> fine, and changing the timestamps to 64-bit wouldn't require a more
> invasive user space API change.
>
> Note: a lot of other APIs suffer from incompatible structures when
> time_t gets redefined to 64-bit in 32-bit user space, but this one
> does not.
>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

This change to avoid the deprecated internal interface looks great to me.

Acked-by: Willem de Bruijn <will...@google.com>

Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps

2017-11-27 Thread Willem de Bruijn

On Mon, Nov 27, 2017 at 11:59 AM, Jiri Pirko  wrote:
> Mon, Nov 27, 2017 at 05:19:25PM CET, a...@arndb.de wrote:
>>I tried to figure out what it would take to do a version 4 mmap packet
>>socket interface to completely avoid the y2106 overflow problem. This is
>>what I came up with, reusing most of the v3 code, except for the parts
>>where we access the timestamps.
>>
>>For kselftest, I'm adding support for testing v4 in addition to v1-v3,
>>but the test currently does not look at the timestamps, so it won't
>>check that the timestamp format actually works as intended, only that
>>I didn't break the parts that worked in the v3 selftest.
>>
>>Overall, this is more of a mess than I expected, so it's probably not
>>worth doing a v4 format just for the timestamp, but the patch can serve
>>as a reference for anyone that needs a new format for other reasons and
>>fixes this along with the other changes.
>>
>>Signed-off-by: Arnd Bergmann 
>>---
>
> [...]
>
>
>>@@ -250,7 +269,8 @@ struct tpacket_block_desc {
>> enum tpacket_versions {
>>   TPACKET_V1,
>>   TPACKET_V2,
>>-  TPACKET_V3
>>+  TPACKET_V3,
>>+  TPACKET_V4,
>
> I wonder with how many versions are we going to eventually end up with :O

There already is an effort to come up with a new AF_PACKET V4 [1].
We should make sure that any new interface does not have the
Y2038/Y2106 issue. But, if a new version is being developed and
that subsumes all existing use cases, then there probably is no need
for another version that is a very small diff to V3.

If adding support for existing applications is useful, another approach
would be to add a new socket option that changes the semantics for
the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
check after filling in those structs whether the option is set and, if so,
overwrite the two fields.

[1] https://lwn.net/Articles/737947/

Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps

2017-11-28 Thread Willem de Bruijn

On Tue, Nov 28, 2017 at 8:14 AM, Arnd Bergmann  wrote:
> This is a second attempt to allow 64-bit timestamps in packet sockets,

Thanks for coding up this variant.

> The implementation is fairly straightforward, but I'm less sure about the
> interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit
> odd already since most of the other flags make no sense here.  Adding two
> more flags that only make sense for packet sockets but not the normal
> SO_TIMESTAMPING option on other sockets makes this even more confusing.

Agreed.

Unfortunately, we're already stuck with SOL_PACKET/PACKET_TIMESTAMP
accepting SOF_TIMESTAMPING_RAW_HARDWARE.

Perhaps we can define a new PF_PACKET specific enum where the
equivalent option has the same value, so is backwards compatible:

enum {
  PACKET_TIMESTAMP_ORIG = 0,
  PACKET_TIMESTAMP_ZERO = 1 << 0,
  PACKET_TIMESTAMP_NS64 = 1 << 1,
  PACKET_TIMESTAMP_HW = 1 << 6
};

and  BUILD_BUG_ON(PACKET_TIMESTAMP_RAW != SOF_TIMESTAMPING_RAW_HARDWARE)
to document the dependency.

At high level, the code looks great to me, itself.

Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps

2017-11-28 Thread Willem de Bruijn

On Tue, Nov 28, 2017 at 3:46 AM, Arnd Bergmann  wrote:
> On Tue, Nov 28, 2017 at 8:04 AM, Björn Töpel  wrote:
>> 2017-11-27 21:51 GMT+01:00 Arnd Bergmann :
>> [...]
 There already is an effort to come up with a new AF_PACKET V4 [1].
 We should make sure that any new interface does not have the
 Y2038/Y2106 issue. But, if a new version is being developed and
 that subsumes all existing use cases, then there probably is no need
 for another version that is a very small diff to V3.
>>>
>>> Ah, perfect, that's good timing. Adding Björn to Cc here.
>>>
>>
>> Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of
>> netdevconf discussions) moving the AF_PACKET V4 implementation to a
>> separate, new, address/packet family.
>
> Ok, I see.

Does it matter whether the replacement is a new version or a
new packet family?

 If adding support for existing applications is useful, another approach
 would be to add a new socket option that changes the semantics for
 the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single
 check after filling in those structs whether the option is set and, if so,
 overwrite the two fields.

 [1] https://lwn.net/Articles/737947/
>>>
>>> I don't think that's necessary. As long as the V4 capabilities are a
>>> superset of V1-V3, we should be able to just require all users to
>>> move to V4 (or later) in the next 89 years, and make sure that they
>>> use unsigned seconds if they care about 2038.
>>>
>>
>> Given that V4 wont be around for AF_PACKET -- at least not in the
>> shape of our patches -- Willem's suggestion is probably a good way
>> forward.
>
> That leaves one question: should we do that now, or wait until some
> other reason for a V4 comes up? I don't mind creating another
> patch for this, just want to get a feeling of whether the API clutter
> is worth it when we have a way out that works until y2106 (at
> which point we run into other problems as well).

I don't expect that we'll have another packet version independent
from the work that Björn is doing. The choice to implement using
a new packet family is given by the complexity of the existing code,
especially the various locking mechanisms.

>From that point of view, and if we want to offer a Y2106 proof
AF_PACKET independent from the above, no reason to wait.

Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-28 Thread Willem de Bruijn

On Tue, Nov 28, 2017 at 3:32 PM, Arnd Bergmann <a...@arndb.de> wrote:
> As I noticed in my previous patch to remove the 'timespec' usage in
> the packet socket, the timestamps in the packet socket are slightly
> inefficient as they convert a nanosecond value into seconds/nanoseconds
> or seconds/microseconds.
>
> This adds two new socket options for the timestamp to resolve that:
>
> PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate
> timestamps at all. When this is set, all timestamps are hardcoded to
> zero, which saves a few cycles for the conversion and the access of
> the hardware clocksource. The idea was taken from pktgen, which has an
> F_NO_TIMESTAMP option for the same purpose.
>
> PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields:
> instead of having 32 bits for seconds plus 32 bits for nanoseconds or
> microseconds, we now always send down 64 bits worth of nanoseconds when
> this flag is set.
>
> Link: https://patchwork.kernel.org/patch/10077199/
> Suggested-by: Willem de Bruijn <willemdebruijn.ker...@gmail.com>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

This works. Another option would be to add a PACKET_TIMESTAMP_EX
with the semantics we discussed previously + fail hard when any undefined
bits are set. I don't feel strong either way, we don't intend to extend further.

If taking this approach, it might be good to split into separate patches, one
for each flag?

> -static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 
> *ts,
> +static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo,
>unsigned int flags)

Argument flags is no longer used.

>  {
> +   struct packet_sock *po = pkt_sk(skb->sk);
> struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
> +   ktime_t stamp;
> +   u32 type;
> +
> +   if (po->tp_skiptstamp)
> +   return 0;
>
> if (shhwtstamps &&
> -   (flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
> -   ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts))
> -   return TP_STATUS_TS_RAW_HARDWARE;
> +   (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) &&
> +   shhwtstamps->hwtstamp) {
> +   stamp = shhwtstamps->hwtstamp;
> +   type = TP_STATUS_TS_RAW_HARDWARE;
> +   } else if (skb->tstamp) {
> +   stamp = skb->tstamp;
> +   type = TP_STATUS_TS_SOFTWARE;
> +   } else {
> +   return 0;
> +   }
>
> -   if (ktime_to_timespec64_cond(skb->tstamp, ts))
> -   return TP_STATUS_TS_SOFTWARE;
> +   if (po->tp_tstamp_ns64) {
> +   __u64 ns = ktime_to_ns(stamp);
>
> -   return 0;
> +   *hi = upper_32_bits(ns);
> +   *lo = lower_32_bits(ns);
> +   } else {
> +   struct timespec64 ts = ktime_to_timespec64(stamp);
> +
> +   *hi = ts.tv_sec;
> +   if (po->tp_version == TPACKET_V1)

Very minor: may want to invert test to make newer the protocols the
likely branch.

>  static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
> struct sk_buff *skb)
>  {
> union tpacket_uhdr h;
> -   struct timespec64 ts;
> -   __u32 ts_status;
> +   __u32 ts_status, hi, lo;
>
> -   if (!(ts_status = tpacket_get_timestamp(skb, , po->tp_tstamp)))
> +   if (!(ts_status = tpacket_get_timestamp(skb, , , 
> po->tp_tstamp)))
> return 0;
>
> h.raw = frame;
> -   /*
> -* versions 1 through 3 overflow the timestamps in y2106, since they
> -* all store the seconds in a 32-bit unsigned integer.
> -* If we create a version 4, that should have a 64-bit timestamp,
> -* either 64-bit seconds + 32-bit nanoseconds, or just 64-bit
> -* nanoseconds.
> -*/

Probably no need to introduce this in patch 1/2 when removing it in 2/2.

> @@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
> net_device *dev,
> unsigned long status = TP_STATUS_USER;
> unsigned short macoff, netoff, hdrlen;
> struct sk_buff *copy_skb = NULL;
> -   struct timespec64 ts;
> __u32 ts_status;
> +   __u32 hi, lo;

since this function is not time-specific, the context of hi and lo is not
immediately obvious here. tstamp_hi, tstamp_lo? Or even __u32
tstamp[2] and have tpacket_get_timestamp and packet_get_time take
one fewer argument.

Re: [PATCH] net: openvswitch: datapath: fix data type in queue_gso_packets

2017-11-25 Thread Willem de Bruijn

On Sat, Nov 25, 2017 at 2:14 PM, Gustavo A. R. Silva
<garsi...@embeddedor.com> wrote:
> gso_type is being used in binary AND operations together with SKB_GSO_UDP.
> The issue is that variable gso_type is of type unsigned short and
> SKB_GSO_UDP expands to more than 16 bits:
>
> SKB_GSO_UDP = 1 << 16
>
> this makes any binary AND operation between gso_type and SKB_GSO_UDP to
> be always zero, hence making some code unreachable and likely causing
> undesired behavior.
>
> Fix this by changing the data type of variable gso_type to unsigned int.
>
> Addresses-Coverity-ID: 1462223
> Fixes: 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet")
> Signed-off-by: Gustavo A. R. Silva <garsi...@embeddedor.com>

Acked-by: Willem de Bruijn <will...@google.com>

Good catch, thanks!

The issue here is that I brought back SKB_GSO_UDP at the end of the
list at 1 << 16 to avoid renaming of all that used to follow, while it used
to be defined as 1 << 1.

The skb_shinfo(skb)->gso_type field itself has been expanded as of v4.12
in commit 7f564528a480 ("skbuff: Extend gso_type to unsigned int.").

A quick scan shows a few nic drivers that also still cast to unsigned
short: bnxt, atl1c and qede. Since UFO hardware offload no longer
exists, this is safe wrt this patch. And it is fine for older kernels as
no driver supported the previous entry at 1 << 16, SKB_GSO_ESP.
But it is fragile wrt follow-on offloads. Probably a net-next patch.

The only other likely issue I spotted with the longer gso_type is
in trace events. net_dev_start_xmit and net_dev_rx_verbose_template
both export as u16.

Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-29 Thread Willem de Bruijn

> Thanks for the review! Any suggestions for how to do the testing? If you have
> existing test cases, could you give my next version a test run to see if there
> are any regressions and if the timestamps work as expected?
>
> I see that there are test cases in tools/testing/selftests/net/, but none
> of them seem to use the time stamps so far, and I'm not overly familiar
> with how it works in the details to extend it in a meaningful way.

I could not find any good tests for this interface, either. The only
user of the interface I found was a little tool I wrote a few years
ago that compares timestamps at multiple points in the transmit
path for latency measurement [1]. But it may be easier to just write
a new test under tools/testing/selftests/net for this purpose. I can
help with that, too, if you want.

[1] https://github.com/wdebruij/kerneltools/blob/master/tools/tcplate/tcplate.c

Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-29 Thread Willem de Bruijn

On Wed, Nov 29, 2017 at 8:39 PM, Willem de Bruijn
<willemdebruijn.ker...@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <a...@arndb.de> wrote:
>> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
>> <willemdebruijn.ker...@gmail.com> wrote:
>>>> Thanks for the review! Any suggestions for how to do the testing? If you 
>>>> have
>>>> existing test cases, could you give my next version a test run to see if 
>>>> there
>>>> are any regressions and if the timestamps work as expected?
>>>>
>>>> I see that there are test cases in tools/testing/selftests/net/, but none
>>>> of them seem to use the time stamps so far, and I'm not overly familiar
>>>> with how it works in the details to extend it in a meaningful way.
>>>
>>> I could not find any good tests for this interface, either. The only
>>> user of the interface I found was a little tool I wrote a few years
>>> ago that compares timestamps at multiple points in the transmit
>>> path for latency measurement [1]. But it may be easier to just write
>>> a new test under tools/testing/selftests/net for this purpose. I can
>>> help with that, too, if you want.
>>
>> Thanks, that would be great!
>
> I'll reply to this thread with git send-email with an extension to
> tools/testing/selftests/net/psock_tpacket.c.

It appears that it did not end up in this thread. At least not when
using gmail threading. Patch at http://patchwork.ozlabs.org/patch/842854/

Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps

2017-11-29 Thread Willem de Bruijn

On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <a...@arndb.de> wrote:
> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn
> <willemdebruijn.ker...@gmail.com> wrote:
>>> Thanks for the review! Any suggestions for how to do the testing? If you 
>>> have
>>> existing test cases, could you give my next version a test run to see if 
>>> there
>>> are any regressions and if the timestamps work as expected?
>>>
>>> I see that there are test cases in tools/testing/selftests/net/, but none
>>> of them seem to use the time stamps so far, and I'm not overly familiar
>>> with how it works in the details to extend it in a meaningful way.
>>
>> I could not find any good tests for this interface, either. The only
>> user of the interface I found was a little tool I wrote a few years
>> ago that compares timestamps at multiple points in the transmit
>> path for latency measurement [1]. But it may be easier to just write
>> a new test under tools/testing/selftests/net for this purpose. I can
>> help with that, too, if you want.
>
> Thanks, that would be great!

I'll reply to this thread with git send-email with an extension to
tools/testing/selftests/net/psock_tpacket.c. I can resend that for
submission after your feature is merged (as it depends on it) or
feel free to include it in your patchset. The test currently fails for
the ns64 case. I probably did not convert correctly, but have to leave
the office and want to send what I have.

Two other comments: the test crashed the kernel due to a NULL ptr
in tpacket_get_timestamp. We cannot rely on skb->sk being set to
the packet socket here. And assignment to bitfield requires a cast to
boolean.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f55f330ab547..e9decc7fc5c3 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -439,9 +439,9 @@ static int __packet_get_status(struct packet_sock
*po, void *frame)
}
 }

-static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo)
+static __u32 tpacket_get_timestamp(struct packet_sock *po, struct sk_buff *skb,
+  __u32 *hi, __u32 *lo)
 {
-   struct packet_sock *po = pkt_sk(skb->sk);
struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
ktime_t stamp;
u32 type;
@@ -508,7 +508,7 @@ static __u32 __packet_set_timestamp(struct
packet_sock *po, void *frame,
union tpacket_uhdr h;
__u32 ts_status, hi, lo;

-   if (!(ts_status = tpacket_get_timestamp(skb, , )))
+   if (!(ts_status = tpacket_get_timestamp(po, skb, , )))
return 0;

h.raw = frame;
@@ -2352,7 +2352,7 @@ static int tpacket_rcv(struct sk_buff *skb,
struct net_device *dev,

skb_copy_bits(skb, 0, h.raw + macoff, snaplen);

-   if (!(ts_status = tpacket_get_timestamp(skb, _hi, _lo)))
+   if (!(ts_status = tpacket_get_timestamp(po, skb, _hi,
_lo)))
packet_get_time(po, _hi, _lo);

status |= ts_status;
@@ -3835,7 +3835,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;

-   po->tp_skiptstamp = val;
+   po->tp_skiptstamp = !!val;
return 0;
}
case PACKET_TIMESTAMP_NS64:
@@ -3847,7 +3847,7 @@ packet_setsockopt(struct socket *sock, int
level, int optname, char __user *optv
if (copy_from_user(, optval, sizeof(val)))
return -EFAULT;

-   po->tp_tstamp_ns64 = val;
+   po->tp_tstamp_ns64 = !!val;
return 0;
}
case PACKET_FANOUT:

Re: [PATCH net-next V3] tun: add eBPF based queue selection method

2017-12-04 Thread Willem de Bruijn

On Mon, Dec 4, 2017 at 4:31 AM, Jason Wang  wrote:
> This patch introduces an eBPF based queue selection method. With this,
> the policy could be offloaded to userspace completely through a new
> ioctl TUNSETSTEERINGEBPF.
>
> Signed-off-by: Jason Wang 
> ---

> +static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> +{
> +   struct tun_steering_prog *prog;
> +   u16 ret = 0;
> +
> +   prog = rcu_dereference(tun->steering_prog);
> +   if (prog)
> +   ret = bpf_prog_run_clear_cb(prog->prog, skb);

This dereferences tun->steering_prog for a second time. It is safe
in this load balancing case to assign a few extra packets to queue 0.
But the issue can also be avoided by replacing the function with a
direct call in tun_net_xmit:

   struct tun_steering_prog *s = rcu_dereference(tun->steering_prog);
   if (s)
   ret = bpf_prog_run_clear_cb(s->prog, skb) % tun->numqueues;

>  /* Net device start xmit */
> -static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> +static void tun_automq_xmit(struct tun_struct *tun, struct sk_buff *skb)
>  {
> -   struct tun_struct *tun = netdev_priv(dev);
> -   int txq = skb->queue_mapping;
> -   struct tun_file *tfile;
> -   u32 numqueues = 0;
> -
> -   rcu_read_lock();
> -   tfile = rcu_dereference(tun->tfiles[txq]);
> -   numqueues = READ_ONCE(tun->numqueues);
> -
> -   /* Drop packet if interface is not attached */
> -   if (txq >= numqueues)
> -   goto drop;
> -
>  #ifdef CONFIG_RPS
> -   if (numqueues == 1 && static_key_false(_needed)) {
> +   if (tun->numqueues == 1 && static_key_false(_needed)) {
> /* Select queue was not called for the skbuff, so we extract 
> the
>  * RPS hash and save it into the flow_table here.
>  */
> @@ -969,6 +986,26 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
> struct net_device *dev)
> }
> }
>  #endif
> +}
> +
> +/* Net device start xmit */
> +static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> +   struct tun_struct *tun = netdev_priv(dev);
> +   int txq = skb->queue_mapping;
> +   struct tun_file *tfile;
> +   u32 numqueues = 0;
> +
> +   rcu_read_lock();
> +   tfile = rcu_dereference(tun->tfiles[txq]);
> +   numqueues = READ_ONCE(tun->numqueues);

Now tun->numqueues is read twice, reversing commit fa35864e0bb7
("tuntap: Fix for a race in accessing numqueues"). I don't see anything
left that would cause a divide by zero after the relevant code was
converted from divide to multiple and subsequently even removed.

But if it's safe to read multiple times, might as well remove the READ_ONCE.

> @@ -1551,7 +1588,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
> struct tun_file *tfile,
> int copylen;
> bool zerocopy = false;
> int err;
> -   u32 rxhash;
> +   u32 rxhash = 0;
> int skb_xdp = 1;
> bool frags = tun_napi_frags_enabled(tun);
>
> @@ -1739,7 +1776,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
> struct tun_file *tfile,
> rcu_read_unlock();
> }
>
> -   rxhash = __skb_get_hash_symmetric(skb);
> +   rcu_read_lock();
> +   if (!rcu_dereference(tun->steering_prog))
> +   rxhash = __skb_get_hash_symmetric(skb);
> +   rcu_read_unlock();
>
> if (frags) {
> /* Exercise flow dissector code path. */
> @@ -1783,7 +1823,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
> struct tun_file *tfile,
> u64_stats_update_end(>syncp);
> put_cpu_ptr(stats);
>
> -   tun_flow_update(tun, rxhash, tfile);
> +   if (rxhash)
> +   tun_flow_update(tun, rxhash, tfile);
> +

Nit: zero is a valid hash? In which case, an int64_t initialized to -1 is the
safer check.

Re: [B.A.T.M.A.N.] [RFC v2 2/6] batman-adv: Rename batman-adv.h to batadv_genl.h

2017-12-15 Thread Willem de Bruijn

On Fri, Dec 15, 2017 at 12:18 PM, Sven Eckelmann
<sven.eckelm...@openmesh.com> wrote:
> On Freitag, 15. Dezember 2017 11:57:55 CET Willem de Bruijn wrote:
>> > No, this is also bad because batman_adv.h is MIT license and packet.h is
>> > GPL-2. So what other name would you suggest for packet.h? 
>> > batman_adv_packet.h?
>>
>> Sure, that sounds great. Thanks.
>
> Really? Isn't include/uapi/linux/batman_adv_packet.h looking like an accident
> which never should have had happened?

My only point was that renaming and modifying existing uapi files
can break userspace compilation.

As long as the existing files are not changed, I don't have a strong
opinion on naming for new files.

Re: [B.A.T.M.A.N.] [RFC v2 2/6] batman-adv: Rename batman-adv.h to batadv_genl.h

2017-12-15 Thread Willem de Bruijn

On Fri, Dec 15, 2017 at 6:48 AM, Sven Eckelmann
<sven.eckelm...@openmesh.com> wrote:
> On Freitag, 15. Dezember 2017 11:32:05 CET Sven Eckelmann wrote:
>> On Mittwoch, 6. Dezember 2017 11:58:14 CET Willem de Bruijn wrote:
>> [...]
>> > >> > ---
>> > >> >  MAINTAINERS| 2 +-
>> > >> >  include/uapi/linux/{batman_adv.h => batadv_genl.h} | 6 +++---
>> > >>
>> > >> This and the previous patch changes uapi. That might break userspace
>> > >> applications that rely on it.
>> > >
>> > > I am not aware of any application because all (alfred, batctl and some 
>> > > gluon
>> > > integration) of them currently ship their own copy because distribution 
>> > > didn't
>> > > catch up. And this is also the reason why I want to do it now - not 
>> > > later.
>> >
>> > That assumes that you know all applications, including those not
>> > publicly available. It may be true in this instance, but it is not
>> > possible to be certain.
>>
>> I've just talked with Simon. Because you have a problem with these two
>> changes, he suggested that I should drop these two patches and merge packet.h
>> with the uapi batadv genl header batman_adv.h
>
> No, this is also bad because batman_adv.h is MIT license and packet.h is
> GPL-2. So what other name would you suggest for packet.h? batman_adv_packet.h?

Sure, that sounds great. Thanks.

Re: [RFC v3 0/4] flow_dissector: Provide basic batman-adv unicast handling

2017-12-15 Thread Willem de Bruijn

On Fri, Dec 15, 2017 at 1:23 PM, Sven Eckelmann
<sven.eckelm...@openmesh.com> wrote:
> Hi,
>
> we are currently starting to use batman-adv as mesh protocol on multicore
> embedded devices. These usually don't have a lot of CPU power per core but
> are reasonable fast when using multiple cores.
>
> It was noticed that sending was working very well but receiving was
> basically only using on CPU core per neighbor. The reason for that is
> format of the (normal) incoming packet:
>
>   ++
>   | ip(v6)hdr  |
>   ++
>   | inner ethhdr   |
>   ++
>   | batadv unicast hdr |
>   ++
>   | outer ethhdr   |
>   ++
>
> The flow dissector will therefore stop after parsing the outer ethernet
> header and will not parse the actual ipv(4|6)/... header of the packet. Our
> assumption was now that it would help us to add minimal support to the flow
> dissector to jump over the batman-adv unicast and inner ethernet header
> (like in gre ETH_P_TEB). The patch was implemented in a slightly hacky
> way [1] and the results looked quite promising.
>
> I didn't get any feedback how the files should actually be named and I am
> not really happy with the current names - so please feel free to propose
> better names.
>
> The discussion of the RFC v2 can be found in the related patches of
> https://patchwork.ozlabs.org/cover/844783/
>
>
> Changes in v3:
> ==
>
> * removed change of uapi/linux/batman_adv.h to uapi/linux/batadv_genl.h
>   - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com>
> * removed naming fixes for enums/defines in uapi/linux/batadv_genl.h
>   - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com>
> * renamed uapi/linux/batadv.h to uapi/linux/batadv_packet.h
> * moved batadv dissector functionality in own function
>   - requested by Tom Herbert <t...@herbertland.com>
> * added support for flags FLOW_DISSECTOR_F_STOP_AT_ENCAP and
>   FLOW_DIS_ENCAPSULATION
>   - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com>

Thanks for making these changes. The flow dissection looks good to me.

One possible issue is that this exposes kernel fixed width types u8/u16/..
to userspace. For posix compatibility reasons, uapi headers use the
variant with underscores.

Re: [PATCH net-next V2 3/3] tun: add eBPF based queue selection method

2017-11-03 Thread Willem de Bruijn

On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
> This patch introduces an eBPF based queue selection method based on
> the flow steering policy ops. Userspace could load an eBPF program
> through TUNSETSTEERINGEBPF. This gives much more flexibility compare
> to simple but hard coded policy in kernel.
>
> Signed-off-by: Jason Wang 
> ---
> +static int tun_set_steering_ebpf(struct tun_struct *tun, void __user *data)
> +{
> +   struct bpf_prog *prog;
> +   u32 fd;
> +
> +   if (copy_from_user(, data, sizeof(fd)))
> +   return -EFAULT;
> +
> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);

If the idea is to allow guests to pass BPF programs down to the host,
you may want to define a new program type that is more restrictive than
socket filter.

The external functions allowed for socket filters (sk_filter_func_proto)
are relatively few (compared to, say, clsact), but may still leak host
information to a guest. More importantly, guest security considerations
limits how we can extend socket filters later.

Re: [PATCH net-next V2 1/3] tun: abstract flow steering logic

2017-11-01 Thread Willem de Bruijn

On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
> tun now use flow caches based automatic queue steering method. This
> may not suffice all user cases. To extend it to be able to use more
> flow steering policy, this patch abstracts flow steering logic into
> tun_steering_ops, then we can declare and use different methods in
> the future.
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/tun.c | 85 
> +--
>  1 file changed, 63 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index ea29da9..bff6259 100644

The previous RFC enabled support for multiple pluggable steering
policies. But as all can be implemented in BPF and we only plan to
support an eBPF policy besides the legacy one, this patch is no longer
needed. We can save a few indirect function calls.

Re: [PATCH net-next V2 2/3] tun: introduce ioctls to set and get steering policies

2017-11-01 Thread Willem de Bruijn

On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
> This patch introduces new ioctl for change packet steering policy for
> tun. Only automatic flow steering is supported, more policies will
> come.
>
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/tun.c   | 35 ++-
>  include/uapi/linux/if_tun.h |  7 +++
>  2 files changed, 41 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index bff6259..ab109ff 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -122,7 +122,8 @@ do {  
>   \
>  #define TUN_VNET_BE 0x4000
>
>  #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> - IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
> + IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS | \
> + IFF_MULTI_STEERING)
>
>  #define GOODCOPY_LEN 128
>
> @@ -2516,6 +2517,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
> int cmd,
> unsigned int ifindex;
> int le;
> int ret;
> +   unsigned int steering;
>
> if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == 
> SOCK_IOC_TYPE) {
> if (copy_from_user(, argp, ifreq_len))
> @@ -2774,6 +2776,37 @@ static long __tun_chr_ioctl(struct file *file, 
> unsigned int cmd,
> ret = 0;
> break;
>
> +   case TUNSETSTEERING:
> +   ret = -EFAULT;
> +   if (copy_from_user(, argp, sizeof(steering)))
> +   break;
> +   ret = 0;
> +   switch (steering) {
> +   case TUN_STEERING_AUTOMQ:
> +   tun->steering_ops = _automq_ops;
> +   break;
> +   default:
> +   ret = -EFAULT;
> +   }
> +   break;
> +
> +   case TUNGETSTEERING:
> +   ret = 0;
> +   if (tun->steering_ops == _automq_ops)
> +   steering = TUN_STEERING_AUTOMQ;
> +   else
> +   BUG();
> +   if (copy_to_user(argp, , sizeof(steering)))
> +   ret = -EFAULT;
> +   break;
> +
> +   case TUNGETSTEERINGFEATURES:
> +   ret = 0;
> +   steering = TUN_STEERING_AUTOMQ;
> +   if (copy_to_user(argp, , sizeof(steering)))
> +   ret = -EFAULT;
> +   break;
> +


Similar to my comment in patch 1/3: if only eBPF is used, these
calls can be avoided in favor of only TUNSETSTEERINGEBPF
from patch 3/3.

1 2 3 4 5 6 7 >

1 - 100 of 633 matches

Mail list logo