Re: git guidance
Dave Quigley wrote: On Wed, 2007-11-28 at 16:57 +0100, Tilman Schmidt wrote: Dave Quigley schrieb: There is a project listed on the kernel.org git page called guilt. I find it very useful. It is much more responsive than stgit and it actually has a git backend which quilt does not. On Wed, 2007-11-28 at 00:20 +0100, Jan Engelhardt wrote: On Nov 27 2007 23:33, Tilman Schmidt wrote: Well, it did. So now I'm back to keeping a virgin kernel source tree alongside my development area in order to produce diffs. That can't be right? No, it can't. Use stgit/quilt ;p In which respect would stgit/quilt/guilt help me? At first glance they just seem to add another level of complexity. Thanks, Tilman These tools allow you to maintain a set of patches with very little effort. More importantly it removes a lot of the git specifics from your development process. For example this is how I use guilt for a new patch set. I take my fresh tree and do a guilt-init in the base. This will create a new patch series. I then need to create a patch to modify something LSM related (guilt-new patch_name). Things like stgit/quilt/git use the idea of a stack of patches. At this point if you were to type guilt-series you would only see the one patch we just created. This patch is going to be one logical set of changes (it should also produce a compilable and working kernel). You can make whatever modifications you need to make to your files and at this point you need to do one of two things. If they were already in the tree you just type guilt-refresh and under your .git/patches/branch_name directory you will see a file named patch_name which contains your patch. Otherwise you need to do a guilt-add file_name and then a guilt-refresh. The idea here is that you have a moved your workflow from managing a series of commits and then breaking out patches from a final version to one where you think in terms of the patches and make modifications to them instead. In my example I said I was doing something LSM related. Lets say the first patch added a new hook and its implementation in the various modules. We can now add a second patch using the guilt-new command and this one will add uses of that new hook. At this point we have a stack that looks like this. patch that adds users patch that adds hook I can pop and push patches onto this stack to have a version of my kernel tree at any state within the patch set. At this point lets say we have posted the patch set and have feedback. I need to apply this feedback to the patch that adds the LSM hook. Since my top patch (guilt-top) is currently at the one that adds the users of the hook I need to pop off that patch and get to the one that creates the hook (guilt-pop). After doing this I'm at a kernel tree state which just has the changes which add the hook. I make my modifications, type guilt-refresh to create a new patch and then guilt-push my second patch on and make sure everything is still working. As you can see there is almost no git knowledge required to use this system and it allows you to focus on development instead of the versioning system. One useful feature is that when Linus adds new patches and I want to rebase my set against the current tree It only takes 3 commands to rebase the patch set (Assuming all goes well). guilt-push -a #push all patches onto the stack git-fetch #pull down the index guilt-rebase FETCH_HEAD #Rebase our patches should do a merge and #reapply all patches These are just some basics about guilt. Jeff has written a better tutorial with a sample repository for you to work with if your interested. I don't know if this will help your development process but I can tell you from experience breaking patches by hand was a pain in the ass and a huge waste of time and I'm glad to have a tool like this Where can I find that tutorial ? regards now. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Linux and iptables
Hello, I have a kubuntu system with a firestarter Firewall. I did purchase a Sweex router. This router has also a firewall. If I connect the router then i have two Firewalls. The result is that no visitor from outside can visit my wiki's, because they have to pass 2 Firewalls. How can I disable the firestarter Firewall with iptables ? The Sweex router has a Mikkisoft Firewall , so I do not know anything about it and I can not disable it. regards Wim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
can we discuss
Linux and iptables
Hello, I have a kubuntu system with a firestarter Firewall. I did purchase a Sweex router. This router has also a firewall. If I connect the router then i have two Firewalls. The result is that no visitor from outside can visit my wiki's, because they have to pass 2 Firewalls. How can I disable the firestarter Firewall with iptables ? The Sweex router has a Mikkisoft Firewall , so I do not know anything about it and I can not disable it. regards Wim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: git guidance
Dave Quigley wrote: On Wed, 2007-11-28 at 16:57 +0100, Tilman Schmidt wrote: Dave Quigley schrieb: There is a project listed on the kernel.org git page called guilt. I find it very useful. It is much more responsive than stgit and it actually has a git backend which quilt does not. On Wed, 2007-11-28 at 00:20 +0100, Jan Engelhardt wrote: On Nov 27 2007 23:33, Tilman Schmidt wrote: Well, it did. So now I'm back to keeping a virgin kernel source tree alongside my development area in order to produce diffs. That can't be right? No, it can't. Use stgit/quilt ;p In which respect would stgit/quilt/guilt help me? At first glance they just seem to add another level of complexity. Thanks, Tilman These tools allow you to maintain a set of patches with very little effort. More importantly it removes a lot of the git specifics from your development process. For example this is how I use guilt for a new patch set. I take my fresh tree and do a guilt-init in the base. This will create a new patch series. I then need to create a patch to modify something LSM related (guilt-new ). Things like stgit/quilt/git use the idea of a stack of patches. At this point if you were to type guilt-series you would only see the one patch we just created. This patch is going to be one logical set of changes (it should also produce a compilable and working kernel). You can make whatever modifications you need to make to your files and at this point you need to do one of two things. If they were already in the tree you just type guilt-refresh and under your .git/patches/ directory you will see a file named which contains your patch. Otherwise you need to do a guilt-add and then a guilt-refresh. The idea here is that you have a moved your workflow from managing a series of commits and then breaking out patches from a final version to one where you think in terms of the patches and make modifications to them instead. In my example I said I was doing something LSM related. Lets say the first patch added a new hook and its implementation in the various modules. We can now add a second patch using the guilt-new command and this one will add uses of that new hook. At this point we have a stack that looks like this. I can pop and push patches onto this stack to have a version of my kernel tree at any state within the patch set. At this point lets say we have posted the patch set and have feedback. I need to apply this feedback to the patch that adds the LSM hook. Since my top patch (guilt-top) is currently at the one that adds the users of the hook I need to pop off that patch and get to the one that creates the hook (guilt-pop). After doing this I'm at a kernel tree state which just has the changes which add the hook. I make my modifications, type guilt-refresh to create a new patch and then guilt-push my second patch on and make sure everything is still working. As you can see there is almost no git knowledge required to use this system and it allows you to focus on development instead of the versioning system. One useful feature is that when Linus adds new patches and I want to rebase my set against the current tree It only takes 3 commands to rebase the patch set (Assuming all goes well). guilt-push -a #push all patches onto the stack git-fetch #pull down the index guilt-rebase FETCH_HEAD #Rebase our patches should do a merge and #reapply all patches These are just some basics about guilt. Jeff has written a better tutorial with a sample repository for you to work with if your interested. I don't know if this will help your development process but I can tell you from experience breaking patches by hand was a pain in the ass and a huge waste of time and I'm glad to have a tool like this Where can I find that tutorial ? regards now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
can we discuss
Re: LANANA: To Pending Device Number Registrants
[EMAIL PROTECTED] wrote: [lots of sensible comments to get this discussion on track] So we have: user space has file names and uses open() or mount(). Kernel space has device paths. In principle the kernel could just number the devices it sees 1,2,... and export information about them, so that user space can choose the right number. The part about exporting information is good. User space needs to be able to ask if a certain beast is a CD reader, and if so what manufacturer and model. But the part about numbering 1,2,... may not be good enough, e.g. because it does not survive reboots. If we remain Unix-like and use device nodes in user space to pair a file name with a number, then it would be very nice if the number encoded the device path uniquely. Many programs expect this. It cannot be done in all cases, but a good approximation is obtained if the number is a hash of the device path. In so far the hash is collision free we obtain numbers that stay unique over a reboot. I think here we might learn from the comments that people made about how AIX and OSF/1/Tru64 have been doing this. You want the relation between a given /dev/xyz node and the corresponding device in the kernel to be relatively stable from the perspective of the user, so you can use it to implement device naming and access policies. What stable means technically is a matter of policy, since it depends very much on the perspective of the user. You also want the relation between a given /dev/xyz node and the major/minor number of the related device to be fixed, so you don't need to recreate all device nodes on every boot. The issue is then how to establish the relation between major/minor numbers and (potentially very dynamic) kernel-internal device entities, in such a way that the type of stability desired by the user is provided. As people noted, the mechanism used by AIX and Tru64 is essentially to have this relationship established by a user-level tool using a configuration file with policy information, and a status file to maintain state information across reboots. This tool can implement any policy it likes, which means we can avoid all the policy discusions where the kernel is concerned, and can instead focus on methods. For the people who want things to be backward compatible, it would be sufficient to implement a default tool that will set all the major/minor numbers exactly like they are now, so all existing device nodes will keep working as they did. In order to support hot plugging of devices, I suppose this tool should either run as a daemon, or set itself up to be called by the kernel whenever something happens. (I think I would prefer the daemon...) Sort of like /sbin/hotplug. The next issue to discuss is then how to shape the mechanism by which the device manager tool and the kernel exchange information on what devices are out there, and what major/minor numbers each should be assigned. The sort of information that a device manager might want to use, depending on how it is implemented and configured, would include such things as: device driver id, I/O bus, I/O port, IRQ nr, SCSI card/bus/id/lun, partition id, serial number, device model/type, filesystem label, filesystem UUID, phase of the moon, etc, etc. Basically, anything that a driver or related subsystem can come up with that might help identify a device or resource in any particular situation, depending on user preferences. I think this may fit somewhere in /proc/sys/dev or thereabout. The big advantage of this approach is basically that we can continue to use the good old device nodes in the filesystem as we have done for the last 30 years, with all the advantages, but without the major/minor number allocation troubles. Hmm, this all sounds too simple. I must be talking nonsense. Please wake me up. ;-) -- Willem Konynenberg [EMAIL PROTECTED] I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question -- Charles Babbage - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8
Dawson Engler wrote: Hi All, Enclosed are 103 potential errors where code gets a pointer from a possibly-failing routine (kmalloc, etc) and dereferences it without [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they copy SRpnt back by *aSRpnt, implies it could be NULL No. It implies SRpnt could have changed. The functions flagged (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry) cannot be reached with SRpnt == NULL. So these are false alarms. /u2/engler/mc/oses/linux/2.4.4/drivers/scsi/osst.c:1163:osst_read_back_buffer_and_rewrite: ERROR:NULL::1163: Using unknown ptr SRpnt illegally! set by 'osst_do_scsi':1163 [nbytes = 216] #if DEBUG if (debugging) printk(OSST_DEB_MSG osst%d: About to attempt to write to frame %d\n, dev, new_block+i); #endif SRpnt = osst_do_scsi(SRpnt, STp, cmd, OS_FRAME_SIZE, SCSI_DATA_WRITE, Start --- STp-timeout, MAX_WRITE_RETRIES, TRUE); ... DELETED 46 lines ... } } if (flag) { if ((SRpnt-sr_sense_buffer[ 2] 0x0f) == 13 SRpnt-sr_sense_buffer[12] == 0 Error --- SRpnt-sr_sense_buffer[13] == 2) { printk(KERN_ERR osst%d: Volume overflow in write error recovery\n, dev); vfree((void *)buffer); return (-EIO); /* hit end of tape = fail */ Regards. Willem Riede. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8
Junfeng Yang wrote: On Thu, 24 May 2001, Willem Riede wrote: Dawson Engler wrote: Hi All, Enclosed are 103 potential errors where code gets a pointer from a possibly-failing routine (kmalloc, etc) and dereferences it without [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they copy SRpnt back by *aSRpnt, implies it could be NULL No. It implies SRpnt could have changed. The functions flagged (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry) cannot be reached with SRpnt == NULL. So these are false alarms. these are false positives if osst_read_back_buffer_and rewrite can't be reached with SRpnt == NULL. It seems that osst_do_scsi will not change SRpnt unless it is NULL though. That is currently true, and the re-assignment of SRpnt is superfluous but harmless. It is not a design constraint though that SRpnt cannot change (except it can't change to NULL), so I prefer to leave the code as is. In other words, SRpnt is changed by osst_do_scsi = the initial argument SRpnt == NULL. Probabaly the pointer aSRpnt is useless. The pointer aSRpnt is not useless, it's used to communicate the current value of SRpnt throughout the driver. Regards, Willem Riede. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] wrong stat of NTFS volume in linux-kernel 2.4.0
I tested the 2.4.0 kernel and it reported a wrong volume size for ntfs partitions. The size of the partition (as reported by df -k) was a factor of 8 too high. After some searching I found that super.c in /usr/src/linux/fs/ntfs was changed. In my previous kernel 2.2.16 a division by the number of blocks in a cluster was made although it included a remark about the 64 bit division routine, and thus used only a 32 bit division. In the latest kernel (2.4.0) the division is completly removed, and thus the volume size is incorrect. I included a patch for the situation. This routine presupposes that the number of blocks in a cluster is always a power of 2, and uses right shifts. As far as I know all NTFS cluster sizes are a power of two, which is confirmed by the options of the format command in NT and some knowledgebase articles of Microsoft. If not an message will appear in the syslog. (printk) Willem Dekker Patch to adapt super.c of the ntfs filesystem. On a stock linux 2.4.0 kernel. --- fs/ntfs/super.orig Tue Nov 7 20:22:35 2000 +++ fs/ntfs/super.c Sun Jan 7 22:34:43 2001 @@ -300,6 +300,82 @@ return 0; } +static ntfs_u64 div_by_power_of_2(ntfs_u64 numerator, int denominator) +{ + switch(denominator) { + case 0 : + printk("Division by zero error at div_by_power_of_2"); + return 0; + case 1 : + return numerator ; + case 2 : + return numerator 1; + case 4 : + return numerator 2; + case 8 : + return numerator 3; + case 16 : + return numerator 4; + case 32 : + return numerator 5; + case 64 : + return numerator 6; + case 128 : + return numerator 7; + case 256 : + return numerator 8; + case 512 : + return numerator 9; + case 1024 : + return numerator 10; + case 2048 : + return numerator 11; + case 4096 : + return numerator 12; + case 8192 : + return numerator 13; + case 16384 : + return numerator 14; + case 32768 : + return numerator 15; + case 65536 : + return numerator 16; + case 131072 : + return numerator 17; + case 262144 : + return numerator 18; + case 524288 : + return numerator 19; + case 1048576 : + return numerator 20; + case 2097152 : + return numerator 21; + case 4194304 : + return numerator 22; + case 8388608 : + return numerator 23; + case 16777216 : + return numerator 24; + case 33554432 : + return numerator 25; + case 67108864 : + return numerator 26; + case 134217728 : + return numerator 27; + case 268435456 : + return numerator 28; + case 536870912 : + return numerator 29; + case 1073741824 : + return numerator 30; + case 2147483648U : + return numerator 31; + default : + printk("Not dividing by power of 2 in div_by_power_of_2!!"); + return 0; + } + return 0; +} /* * Writes the volume size into vol_size. Returns 0 if successful * or error. @@ -323,6 +399,7 @@ io.size=vol-clustersize; ntfs_getput_clusters(vol,0,0,io); *vol_size = NTFS_GETU64(cluster0+0x28); +*vol_size = div_by_power_of_2(*vol_size,(vol-clusterfactor)); ntfs_free(cluster0); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: osst ide-2.2.19 conflict?
Anil B. Somayaji wrote: In the ide.2.2.19.05042001 patch, there is the following bit of code in ide-scsi.c, which prevents the ide-scsi driver from allowing access to an OnStream DI-30 tape drive. This is strange, since this same drive can be used with the included ide-scsi + osst drivers in the stock 2.2.19 kernel. If you delete this bit, however, ide-scsi recognizes the drive, and the osst driver seems to work fine. Here's the offending code: #ifndef CONFIG_BLK_DEV_IDETAPE /* * The Onstream DI-30 does not handle clean emulation, yet. */ if (strstr(drive-id-model, OnStream DI-30)) { printk(ide-tape: ide-scsi emulation is not supported for %s.\n, drive-id-model); continue; } #endif /* CONFIG_BLK_DEV_IDETAPE */ Any reason for this to stay in the ide patch, or is it now obsolete? It is obsolete, and can safely be removed. Success. Willem Riede. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] drivers/scsi/osst.c: make code static
On 03/22/2005 09:28:07 AM, Adrian Bunk wrote: This patch makes needlessly global code static. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] James, I agree with this, can you put it in BK, please? Signed-off-by: Willem Riede [EMAIL PROTECTED] --- This patch was already sent on: - 28 Feb 2005 drivers/scsi/osst.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) --- linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c.old 2005-02-28 19:36:05.0 +0100 +++ linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c 2005-02-28 19:36:25.0 +0100 @@ -24,7 +24,7 @@ */ static const char * cvsid = $Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede Exp $; -const char * osst_version = 0.99.3; +static const char * osst_version = 0.99.3; /* The failure to reconnect firmware bug */ #define OSST_FW_NEED_POLL_MIN 10601 /*(107A)*/ @@ -170,7 +170,7 @@ static int osst_probe(struct device *); static int osst_remove(struct device *); -struct scsi_driver osst_template = { +static struct scsi_driver osst_template = { .owner = THIS_MODULE, .gendrv = { .name = osst, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] drivers/scsi/osst.c: remove unused code
On 03/22/2005 04:59:48 PM, Adrian Bunk wrote: Thanks to both the Coverity checker and GNU gcc, it was found that this variable is completely unused. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] And it is so obvious when your attention is drawn to it. The code that did use it moved to os_scsi_tape_flush() recently. James, could you make this change in BK too, please? Signed-off-by: Willem Riede [EMAIL PROTECTED] --- linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c.old 2005-03-22 21:04:36.0 +0100 +++ linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c 2005-03-22 22:09:32.0 +0100 @@ -4770,9 +4770,6 @@ static int os_scsi_tape_close(struct ino { int result = 0; struct osst_tape* STp= filp-private_data; - struct scsi_request * SRpnt = NULL; - - if (SRpnt) scsi_release_request(SRpnt); if (STp-door_locked == ST_LOCKED_AUTO) do_door_lock(STp, 0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: NCQ support NVidia NForce4 (CK804) SATAII
On Wed, 10 Aug 2005 20:54:35 +, Allen Martin wrote: Likely the only way nForce4 NCQ support could be added under Linux would be with a closed source binary driver, and no one really wants that, especially for storage / boot volume. We decided it wasn't worth the headache of a binary driver for this one feature. Future nForce chipsets will have a redesigned SATA controller where we can be more open about documenting it. That is disappointing. I was seriously considering a motherboard with your chipset because of its impressive specifications, but now I have to reconsider (I'm one of those folks that never bought an nVidia graphics board due to the lack of open full-function driver). I _hate_ not being able to use all features. Any chance your company will reconsider? Are you in a position to make that decision? Is there a VP I can write to (politely) to support the case? I don't understand your reluctance in this case (I do for your graphics processors) - it's not as if adding this function to sata_nv would expose your crown jewels - you write yourself that next time you'd use a different (better) interface... Regards, Willem Riede. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
[EMAIL PROTECTED] wrote: [lots of sensible comments to get this discussion "on track"] > So we have: user space has file names and uses open() or mount(). > Kernel space has device paths. > > In principle the kernel could just number the devices it sees 1,2,... > and export information about them, so that user space can choose > the right number. > The part about exporting information is good. User space needs to > be able to ask if a certain beast is a CD reader, and if so what > manufacturer and model. > But the part about numbering 1,2,... may not be good enough, e.g. > because it does not survive reboots. If we remain Unix-like and use > device nodes in user space to pair a file name with a number, then > it would be very nice if the number encoded the device path uniquely. > Many programs expect this. > It cannot be done in all cases, but a good approximation is obtained > if the number is a hash of the device path. In so far the hash is > collision free we obtain numbers that stay unique over a reboot. I think here we might learn from the comments that people made about how AIX and OSF/1/Tru64 have been doing this. You want the relation between a given /dev/xyz node and the corresponding device in the kernel to be relatively stable from the perspective of the user, so you can use it to implement device naming and access policies. What "stable" means technically is a matter of policy, since it depends very much on the perspective of the user. You also want the relation between a given /dev/xyz node and the major/minor number of the related device to be "fixed", so you don't need to recreate all device nodes on every boot. The issue is then how to establish the relation between major/minor numbers and (potentially very dynamic) kernel-internal device entities, in such a way that the type of stability desired by the user is provided. As people noted, the mechanism used by AIX and Tru64 is essentially to have this relationship established by a user-level tool using a configuration file with policy information, and a status file to maintain state information across reboots. This tool can implement any policy it likes, which means we can avoid all the policy discusions where the kernel is concerned, and can instead focus on methods. For the people who want things to be backward compatible, it would be sufficient to implement a default tool that will set all the major/minor numbers exactly like they are now, so all existing device nodes will keep working as they did. In order to support hot plugging of devices, I suppose this tool should either run as a daemon, or set itself up to be called by the kernel whenever something happens. (I think I would prefer the daemon...) Sort of like /sbin/hotplug. The next issue to discuss is then how to shape the mechanism by which the device manager tool and the kernel exchange information on what devices are out there, and what major/minor numbers each should be assigned. The sort of information that a device manager might want to use, depending on how it is implemented and configured, would include such things as: device driver id, I/O bus, I/O port, IRQ nr, SCSI card/bus/id/lun, partition id, serial number, device model/type, filesystem label, filesystem UUID, phase of the moon, etc, etc. Basically, anything that a driver or related subsystem can come up with that might help identify a device or resource in any particular situation, depending on user preferences. I think this may fit somewhere in /proc/sys/dev or thereabout. The big advantage of this approach is basically that we can continue to use the good old device nodes in the filesystem as we have done for the last 30 years, with all the advantages, but without the major/minor number allocation troubles. Hmm, this all sounds too simple. I must be talking nonsense. Please wake me up. ;-) -- Willem Konynenberg <[EMAIL PROTECTED]> I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question -- Charles Babbage - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8
Dawson Engler wrote: > > Hi All, > > Enclosed are 103 potential errors where code gets a pointer from a > possibly-failing routine (kmalloc, etc) and dereferences it without > > [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they >copy SRpnt back by *aSRpnt, implies it could be NULL No. It implies SRpnt could have changed. The functions flagged (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry) cannot be reached with SRpnt == NULL. So these are false alarms. > >/u2/engler/mc/oses/linux/2.4.4/drivers/scsi/osst.c:1163:osst_read_back_buffer_and_rewrite: > ERROR:NULL::1163: Using unknown ptr "SRpnt" illegally! set by >'osst_do_scsi':1163 [nbytes = 216] > #if DEBUG > if (debugging) > printk(OSST_DEB_MSG "osst%d: About to attempt to write to >frame %d\n", dev, new_block+i); > #endif > SRpnt = osst_do_scsi(SRpnt, STp, cmd, OS_FRAME_SIZE, SCSI_DATA_WRITE, > Start ---> > STp->timeout, MAX_WRITE_RETRIES, TRUE); > > ... DELETED 46 lines ... > > } > } > if (flag) { > if ((SRpnt->sr_sense_buffer[ 2] & 0x0f) == 13 && > SRpnt->sr_sense_buffer[12] == 0 && > Error ---> > SRpnt->sr_sense_buffer[13] == 2) { > printk(KERN_ERR "osst%d: Volume overflow in write >error recovery\n", dev); > vfree((void *)buffer); > return (-EIO); /* hit end of tape = >fail */ > Regards. Willem Riede. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [CHECKER] null bugs in 2.4.4 and 2.4.4-ac8
Junfeng Yang wrote: > > On Thu, 24 May 2001, Willem Riede wrote: > > > Dawson Engler wrote: > > > > > > Hi All, > > > > > > Enclosed are 103 potential errors where code gets a pointer from a > > > possibly-failing routine (kmalloc, etc) and dereferences it without > > > > > > [BUG] osst_do_scsi will never return NULL if argument SRpnt isn't NULL. But they >copy SRpnt back by *aSRpnt, implies it could be NULL > > > > No. It implies SRpnt could have changed. The functions flagged > > (osst_read_back_buffer_and_rewrite and osst_reposition_and_retry) > > cannot be reached with SRpnt == NULL. So these are false alarms. > > these are false positives if osst_read_back_buffer_and rewrite can't be > reached with SRpnt == NULL. It seems that osst_do_scsi will not change > SRpnt unless it is NULL though. That is currently true, and the re-assignment of SRpnt is superfluous but harmless. It is not a design constraint though that SRpnt cannot change (except it can't change to NULL), so I prefer to leave the code as is. > In other words, SRpnt is changed by > osst_do_scsi <=> the initial argument SRpnt == NULL. Probabaly the pointer > aSRpnt is useless. > The pointer aSRpnt is not useless, it's used to communicate the current value of SRpnt throughout the driver. Regards, Willem Riede. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH: ide-scsi.c to allow claiming of Onstream drives
Andre Hedrick wrote: > > On Wed, 13 Sep 2000, Matthew Dharm wrote: > > > Attached is a patch to ide-scsi.c against 2.4.0-test8. Please consider > > applying it. > > > > This patch removes the logic which causes ide-scsi to refuse to attach > > itself to an IDE OnStream drive. While these drives are not supported by > > the standard st.c driver, there is userspace support (using the sg > > interface) which works with the ATAPI drive with ide-scsi patched with this > > patch. > > > > So there really is no reason for ide-scsi to reject this device. > > Yes, > > There is a reason that it was put there, DI-30's require special IO that > st.o does not know how to do. > > Look at all the hoops that it took to get ide-tape to work. > If ide-scsi would have done the trick, it would have saved headaches. Indeed. But in the mean time, a scsi driver for Onstream drives has been written (by me). Together with Kai Makisara I have resolved how to get those drives detected by this new driver (osst) and not by st. The second patch that Matt refers to inserts that logic into st so it does not try to support these drives and fails due to the drive particulars. Since SCSI, USB and IDE versions of these drives exist, patching st is more appropriate than ide-scsi. Regards. Willem Riede. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: osst & ide-2.2.19 conflict?
"Anil B. Somayaji" wrote: > > In the ide.2.2.19.05042001 patch, there is the following bit of code > in ide-scsi.c, which prevents the ide-scsi driver from allowing access > to an OnStream DI-30 tape drive. This is strange, since this same > drive can be used with the included ide-scsi + osst drivers in the > stock 2.2.19 kernel. If you delete this bit, however, ide-scsi > recognizes the drive, and the osst driver seems to work fine. > > Here's the offending code: > > #ifndef CONFIG_BLK_DEV_IDETAPE >/* > * The Onstream DI-30 does not handle clean emulation, yet. > */ >if (strstr(drive->id->model, "OnStream DI-30")) { >printk("ide-tape: ide-scsi emulation is not supported for %s.\n", >drive->id->model); >continue; >} > #endif /* CONFIG_BLK_DEV_IDETAPE */ > > Any reason for this to stay in the ide patch, or is it now obsolete? > It is obsolete, and can safely be removed. Success. Willem Riede. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] wrong stat of NTFS volume in linux-kernel 2.4.0
I tested the 2.4.0 kernel and it reported a wrong volume size for ntfs partitions. The size of the partition (as reported by df -k) was a factor of 8 too high. After some searching I found that super.c in /usr/src/linux/fs/ntfs was changed. In my previous kernel 2.2.16 a division by the number of blocks in a cluster was made although it included a remark about the 64 bit division routine, and thus used only a 32 bit division. In the latest kernel (2.4.0) the division is completly removed, and thus the volume size is incorrect. I included a patch for the situation. This routine presupposes that the number of blocks in a cluster is always a power of 2, and uses right shifts. As far as I know all NTFS cluster sizes are a power of two, which is confirmed by the options of the format command in NT and some knowledgebase articles of Microsoft. If not an message will appear in the syslog. (printk) Willem Dekker Patch to adapt super.c of the ntfs filesystem. On a stock linux 2.4.0 kernel. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --- fs/ntfs/super.orig Tue Nov 7 20:22:35 2000 +++ fs/ntfs/super.c Sun Jan 7 22:34:43 2001 @@ -300,6 +300,82 @@ return 0; } +static ntfs_u64 div_by_power_of_2(ntfs_u64 numerator, int denominator) +{ + switch(denominator) { + case 0 : + printk("Division by zero error at div_by_power_of_2"); + return 0; + case 1 : + return numerator ; + case 2 : + return numerator >> 1; + case 4 : + return numerator >> 2; + case 8 : + return numerator >> 3; + case 16 : + return numerator >> 4; + case 32 : + return numerator >> 5; + case 64 : + return numerator >> 6; + case 128 : + return numerator >> 7; + case 256 : + return numerator >> 8; + case 512 : + return numerator >> 9; + case 1024 : + return numerator >> 10; + case 2048 : + return numerator >> 11; + case 4096 : + return numerator >> 12; + case 8192 : + return numerator >> 13; + case 16384 : + return numerator >> 14; + case 32768 : + return numerator >> 15; + case 65536 : + return numerator >> 16; + case 131072 : + return numerator >> 17; + case 262144 : + return numerator >> 18; + case 524288 : + return numerator >> 19; + case 1048576 : + return numerator >> 20; + case 2097152 : + return numerator >> 21; + case 4194304 : + return numerator >> 22; + case 8388608 : + return numerator >> 23; + case 16777216 : + return numerator >> 24; + case 33554432 : + return numerator >> 25; + case 67108864 : + return numerator >> 26; + case 134217728 : + return numerator >> 27; + case 268435456 : + return numerator >> 28; + case 536870912 : + return numerator >> 29; + case 1073741824 : + return numerator >> 30; + case 2147483648U : + return numerator >> 31; + default : + printk("Not dividing by power of 2 in div_by_power_of_2!!"); + return 0; + } + return 0; +} /* * Writes the volume size into vol_size. Returns 0 if successful * or error. @@ -323,6 +399,7 @@ io.size=vol->clustersize; ntfs_getput_clusters(vol,0,0,); *vol_size = NTFS_GETU64(cluster0+0x28); +*vol_size = div_by_power_of_2(*vol_size,(vol->clusterfactor)); ntfs_free(cluster0); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] drivers/scsi/osst.c: make code static
On 03/22/2005 09:28:07 AM, Adrian Bunk wrote: > This patch makes needlessly global code static. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> James, I agree with this, can you put it in BK, please? Signed-off-by: Willem Riede <[EMAIL PROTECTED]> > --- > > This patch was already sent on: > - 28 Feb 2005 > > drivers/scsi/osst.c |4 ++-- > 1 files changed, 2 insertions(+), 2 deletions(-) > > --- linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c.old 2005-02-28 > 19:36:05.0 +0100 > +++ linux-2.6.11-rc4-mm1-full/drivers/scsi/osst.c 2005-02-28 > 19:36:25.0 +0100 > @@ -24,7 +24,7 @@ > */ > > static const char * cvsid = "$Id: osst.c,v 1.73 2005/01/01 21:13:34 wriede > Exp $"; > -const char * osst_version = "0.99.3"; > +static const char * osst_version = "0.99.3"; > > /* The "failure to reconnect" firmware bug */ > #define OSST_FW_NEED_POLL_MIN 10601 /*(107A)*/ > @@ -170,7 +170,7 @@ > static int osst_probe(struct device *); > static int osst_remove(struct device *); > > -struct scsi_driver osst_template = { > +static struct scsi_driver osst_template = { > .owner = THIS_MODULE, > .gendrv = { > .name = "osst", > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [2.6 patch] drivers/scsi/osst.c: remove unused code
On 03/22/2005 04:59:48 PM, Adrian Bunk wrote: > Thanks to both the Coverity checker and GNU gcc, it was found that this > variable is completely unused. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> And it is so obvious when your attention is drawn to it. The code that did use it moved to os_scsi_tape_flush() recently. James, could you make this change in BK too, please? Signed-off-by: Willem Riede <[EMAIL PROTECTED]> > --- linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c.old 2005-03-22 > 21:04:36.0 +0100 > +++ linux-2.6.12-rc1-mm1-full/drivers/scsi/osst.c 2005-03-22 > 22:09:32.0 +0100 > @@ -4770,9 +4770,6 @@ static int os_scsi_tape_close(struct ino > { > int result = 0; > struct osst_tape* STp= filp->private_data; > - struct scsi_request * SRpnt = NULL; > - > - if (SRpnt) scsi_release_request(SRpnt); > > if (STp->door_locked == ST_LOCKED_AUTO) > do_door_lock(STp, 0); > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: NCQ support NVidia NForce4 (CK804) SATAII
On Wed, 10 Aug 2005 20:54:35 +, Allen Martin wrote: > Likely the only way nForce4 NCQ support could be added under Linux would > be with a closed source binary driver, and no one really wants that, > especially for storage / boot volume. We decided it wasn't worth the > headache of a binary driver for this one feature. Future nForce > chipsets will have a redesigned SATA controller where we can be more > open about documenting it. That is disappointing. I was seriously considering a motherboard with your chipset because of its impressive specifications, but now I have to reconsider (I'm one of those folks that never bought an nVidia graphics board due to the lack of open full-function driver). I _hate_ not being able to use all features. Any chance your company will reconsider? Are you in a position to make that decision? Is there a VP I can write to (politely) to support the case? I don't understand your reluctance in this case (I do for your graphics processors) - it's not as if adding this function to sata_nv would expose your crown jewels - you write yourself that next time you'd use a different (better) interface... Regards, Willem Riede. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: status of kernel memory debugging?
UML is still too strange for valgrind, despite progress on both sides (valgrind accepting more strange things and UML becoming less strange). I tried grinding UML a month or so ago, and its use of clone was a sticking point. I think I read your remark, yes. Thanks for the update; I hoped some strides had been made in that direction since then. Personally, I know to little about the topic to be of any use. I guess the best option is then using slab caches per object type, so that I can at least find obvious memory leaks. On a sidenote, it'll be interesting to see what valgrind reports once (if?) the kernel gets a good grinding. Willem - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] ip6tnl: fix use after free of fb_tnl_dev
On Thu, Nov 14, 2013 at 9:47 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: Bug has been introduced by commit bb8140947a24 (ip6tnl: allow to use rtnl ops on fb tunnel). When ip6_tunnel.ko is unloaded, FB device is delete by rtnl_link_unregister() and then we try to use the pointer in ip6_tnl_destroy_tunnels(). Let's add an handler for dellink, which will never remove the FB tunnel. With this patch it will no more be possible to remove it via 'ip link del ip6tnl0', but it's safer. The same fix was already proposed by Willem de Bruijn will...@google.com for sit interfaces. CC: Willem de Bruijn will...@google.com Reported-by: Steven Rostedt rost...@goodmis.org Signed-off-by: Nicolas Dichtel nicolas.dich...@6wind.com Acked-by: Willem de Bruijn will...@google.com Also ran a test similar to the one for sit: `modprobe ip6_tunnel; rmmod ip6_tunnel` with CONFIG_DEBUG_SLAB=y. This exposed the bug at HEAD, completes successfully with the patch applied. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 2/6] net: add low latency socket poll
On Mon, Jun 10, 2013 at 10:29 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote: Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- Documentation/sysctl/net.txt |7 ++ include/linux/netdevice.h|3 + include/linux/skbuff.h |8 ++ include/net/ll_poll.h| 148 ++ include/net/sock.h |4 + include/uapi/linux/snmp.h|1 net/Kconfig | 12 +++ net/core/skbuff.c|4 + net/core/sock.c |6 ++ net/core/sysctl_net_core.c | 10 +++ net/ipv4/proc.c |1 net/socket.c |6 ++ 12 files changed, 208 insertions(+), 2 deletions(-) create mode 100644 include/net/ll_poll.h Acked-by: Eric Dumazet eduma...@google.com Tested-by: Willem de Bruijn will...@google.com Per Eliezer's request, I applied v10 to a small set of workloads (netperf tcp_rr, udp_rr, 100x tcp_rr, 100x udp_rr and a poll()-based tcp rr) for some functional testing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 1/6] net: add napi_id and hash
On Mon, Jun 10, 2013 at 5:22 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2013-06-10 at 11:39 +0300, Eliezer Tamir wrote: Adds a napi_id and a hashing mechanism to lookup a napi by id. This will be used by subsequent patches to implement low latency Ethernet device polling. Based on a code sample by Eric Dumazet. Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- include/linux/netdevice.h | 29 ++ net/core/dev.c| 59 + 2 files changed, 88 insertions(+), 0 deletions(-) Signed-off-by: Eric Dumazet eduma...@google.com Tested-by: Willem de Bruijn will...@google.com (per Eliezer's request) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 3/6] udp: add low latency socket poll support
On Mon, Jun 10, 2013 at 10:31 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote: Add upport for busy-polling on UDP sockets. In __udp[46]_lib_rcv add a call to sk_mark_ll() to copy the napi_id from the skb into the sk. This is done at the earliest possible moment, right after we identify which socket this skb is for. In __skb_recv_datagram When there is no data and the user tries to read we busy poll. Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- net/core/datagram.c |4 net/ipv4/udp.c |6 +- net/ipv6/udp.c |6 +- 3 files changed, 14 insertions(+), 2 deletions(-) Acked-by: Eric Dumazet eduma...@google.com Tested-by: Willem de Bruijn will...@google.com (per Eliezer's request) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 4/6] tcp: add low latency socket poll support.
On Mon, Jun 10, 2013 at 10:32 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2013-06-10 at 11:40 +0300, Eliezer Tamir wrote: Adds low latency socket poll support for TCP. In tcp_v[46]_rcv() add a call to sk_mark_ll() to copy the napi_id from the skb to the sk. In tcp_recvmsg(), when there is no data in the socket we busy-poll. This is a good example of how to add busy-poll support to more protocols. Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- Acked-by: Eric Dumazet eduma...@google.com Tested-by: Willem de Bruijn will...@google.com (per Eliezer's request) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v9 net-next 2/7] net: add low latency socket poll
On Wed, Jun 5, 2013 at 9:23 AM, Eric Dumazet eric.duma...@gmail.com wrote: On Wed, 2013-06-05 at 13:34 +0300, Eliezer Tamir wrote: Adds an ndo_ll_poll method and the code that supports it. This method can be used by low latency applications to busy-poll Ethernet device queues directly from the socket code. sysctl_net_ll_poll controls how many microseconds to poll. Default is zero (disabled). Individual protocol support will be added by subsequent patches. Signed-off-by: Alexander Duyck alexander.h.du...@intel.com Signed-off-by: Jesse Brandeburg jesse.brandeb...@intel.com Tested-by: Willem de Bruijn will...@google.com Signed-off-by: Eliezer Tamir eliezer.ta...@linux.intel.com --- Are you sure this version was tested by Willem ? I hadn't tested the latest revisions, indeed. Am testing this one right now. Acked-by: Eric Dumazet eduma...@google.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 net-next 3/4] ixgbe: Add support for ndo_ll_poll
On Tue, May 21, 2013 at 2:54 AM, Eliezer Tamir eliezer.ta...@linux.intel.com wrote: On 20/05/2013 23:20, Or Gerlitz wrote: On Mon, May 20, 2013 at 1:16 PM, Eliezer Tamir eliezer.ta...@linux.intel.com wrote: Add the ixgbe driver code implementing ndo_ll_poll. It should be easy for other drivers to do something similar in order to enable support for CONFIG_INET_LL_RX_POLL I am not sure, Willem ported this to some undisclosed HW that they use at Google, his feedback was that it was not a major effort. The core ndo_ll_poll implementation is generally a subset of a device driver's existing napi callback. It cleans the queues, but it skips napi_complete and unmasking of the IRQ. + ixgbe_for_each_ring(ring, q_vector-rx) { + found = ixgbe_clean_rx_irq(q_vector, ring, 4); + if (found) + break; + } A subtle difference in the above code vs ixgbe_poll is that the callback returns as soon as some data arrived on a queue, as opposed to iterating over all queues. The budget is lower, too. Since not all data arriving is necessarily destined towards polling socket, this may or may not be an improvement. Besides that, the driver has to mark the packet with ll_mark_skb(cq-napi, skb); On devices where tx completion interrupts share the same IRQ as rx interrupts, the driver may also have to clean the tx queue once in a while (at obvious tail latency cost). LLS does not disable the IRQ, but I think the suggestion was to set its moderation threshold very high to avoid net_rx_action/LLS lock contention. If so, starvation may occur. The most difficult bit is handling mutual exclusion with the interrupt-driven receive path. The ixgbe port has its own internal locking mechanism in anticipation of future use cases that can be lock-free. As first approximation, I just took the napi-poll_lock, similar to how netpoll handles mutual exclusion with net_rx_action. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 net-next 0/4] net: low latency Ethernet device polling
2. How is the logic aware of RSS and RFS? With TCP sockets, the driver knows the specific ring it need to poll so this should be mapped and provide the best latency. This code is blissfully oblivious of RFS and RSS, it only assumes that the packets for a socket are likely to continue to come on the same queue. The code is designed to be correct even if you get your data on the wrong queue. (your performance will suffer but no more than that.) For low latency, you don't want to have to wait for the IPI that RPS sets to redirect packets to another CPU. However, this feature works extremely well with hardware flow steering (nfc/ntuple) and accelerated RFS, where the device will enqueue directly on an rxqueue owned exclusively by the destination cpu (if configured correctly). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] [BUG FIXES - 3.10.27] sit: More backports
From d101450583c3a472a2a94904cfe13fd4e7d2f519 Mon Sep 17 00:00:00 2001 From: Nicolas Dichtel nicolas.dich...@6wind.com Date: Wed, 29 Jan 2014 16:40:30 +0100 Subject: [PATCH] sit: fix double free of fb_tunnel_dev on exit This problem was fixed upstream by commit 9434266f2c64 (sit: fix use after free of fb_tunnel_dev). The upstream patch depends on upstream commit 5e6700b3bf98 (sit: add support of x-netns), which was not backported into 3.10 branch. First, explain the problem: when the sit module is unloaded, sit_cleanup() is called. rmmod sit = sit_cleanup() = rtnl_link_unregister() = __rtnl_kill_links() = for_each_netdev(net, dev) { if (dev-rtnl_link_ops == ops) ops-dellink(dev, list_kill); } At this point, the FB device is deleted (and all sit tunnels). = unregister_pernet_device() = unregister_pernet_operations() = ops_exit_list() = sit_exit_net() = sit_destroy_tunnels() In this function, no tunnel is found. = unregister_netdevice_queue(sitn-fb_tunnel_dev, list); We delete the FB device a second time here! Because we cannot simply remove the second deletion (sit_exit_net() must remove the FB device when a netns is deleted), we add an rtnl ops which delete all sit device excepting the FB device and thus we can keep the explicit deletion in sit_exit_net(). CC: Steven Rostedt rost...@goodmis.org Signed-off-by: Nicolas Dichtel nicolas.dich...@6wind.com --- net/ipv6/sit.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c index 0491264b8bfc..620d326e8fdd 100644 --- a/net/ipv6/sit.c +++ b/net/ipv6/sit.c @@ -1507,6 +1507,15 @@ static const struct nla_policy ipip6_policy[IFLA_IPTUN_MAX + 1] = { #endif }; +static void ipip6_dellink(struct net_device *dev, struct list_head *head) +{ + struct net *net = dev_net(dev); + struct sit_net *sitn = net_generic(net, sit_net_id); + + if (dev != sitn-fb_tunnel_dev) + unregister_netdevice_queue(dev, head); +} + static struct rtnl_link_ops sit_link_ops __read_mostly = { .kind = sit, .maxtype= IFLA_IPTUN_MAX, @@ -1517,6 +1526,7 @@ static struct rtnl_link_ops sit_link_ops __read_mostly = { .changelink = ipip6_changelink, .get_size = ipip6_get_size, .fill_info = ipip6_fill_info, + .dellink= ipip6_dellink, }; static struct xfrm_tunnel sit_handler __read_mostly = { -- 1.8.4.1 This looks good to me. It is the same as the backport sit: fix use after free of fb_tunnel_dev (9434266f2c64), minus the small code cleanup at the end of that patch that is not relevant to 3.10.27 (do not define sit_net *sitn in sit_exit_net if it is not used there. But in 3.10.27 it is used, in unregister_netdevice_queue). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'
alternative fix for xt_bpf.h could be to replace: /* only used in the kernel */ struct sk_filter *filter __attribute__((aligned(8))); with /* only used in the kernel */ void *filter __attribute__((aligned(8))); but this 'void *' approach may further break broken userspace, whereas the fix implemented here is more seamless. Yep, that's not good, 'struct sk_filter' should never have been in a uapi file actually. This follows a convention in include/uapi/linux/netfilter/*.h that likely predates the introduction of uapi. A search for Used internally by the kernel shows many more examples. I should not have included filter.h, however. The common behavior when using pointers to kernel-internal structures is to have a forward declaration. I suggest making that change, instead of changing to void *. This avoids having to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c: -#include linux/filter.h #include linux/types.h #define XT_BPF_MAX_NUM_INSTR 64 +struct sk_filter; + struct xt_bpf_info { I can send this as a separate patch to net-next, if that helps. You can just send me a patch to change it to void. It's an internal kernel pointer as the comment states. There is **no** way that userspace can lurk with that from iptables at all. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'
This follows a convention in include/uapi/linux/netfilter/*.h that likely predates the introduction of uapi. A search for Used internally by the kernel shows many more examples. I should not have included filter.h, however. The common behavior when using pointers to kernel-internal structures is to have a forward declaration. I suggest making that change, instead of changing to void *. This avoids having to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c: that will not avoid typecast. Either 'void *' approach or extra 'struct sk_filter;' approach, both need type casts to 'struct bpf_prog' in xt_bpf.c (because of SK_RUN_FILTER macro) Therefore I prefer extra 'struct sk_filter;' approach. I hadn't noticed that your patch makes the same change that I proposed. Nothing in userspace should touch that pointer, so it is fine to change its type to struct bpf_prog* at the same time. No need for typecasts. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] net: filter: rename 'struct sk_filter' to 'struct bpf_prog'
On Fri, Jul 25, 2014 at 2:43 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Fri, Jul 25, 2014 at 11:32 AM, Willem de Bruijn will...@google.com wrote: This follows a convention in include/uapi/linux/netfilter/*.h that likely predates the introduction of uapi. A search for Used internally by the kernel shows many more examples. I should not have included filter.h, however. The common behavior when using pointers to kernel-internal structures is to have a forward declaration. I suggest making that change, instead of changing to void *. This avoids having to add casts where xt_bpf_info is used in net/netfilter/xt_bpf.c: that will not avoid typecast. Either 'void *' approach or extra 'struct sk_filter;' approach, both need type casts to 'struct bpf_prog' in xt_bpf.c (because of SK_RUN_FILTER macro) Therefore I prefer extra 'struct sk_filter;' approach. I hadn't noticed that your patch makes the same change that I proposed. Nothing in userspace should touch that pointer, so it is fine to change its type to struct bpf_prog* at the same time. No need for typecasts. really? I don't think it's a good idea to expose kernel struct type to user space. How is it even going to compile? a forward declaration. #include linux/filter.h brings different files in kernel and in user space. struct bpf_prog is undefined in user space and compiler will complain. Adding 'struct bpf_prog;' will be ugly. imo the lesser evil is adding 'struct sk_filter;' and doing type casts in kernel. but the exact same argument applies to sk_filter. If that struct is renamed everywhere else, then the result will only be more confusing. A forward declaration is the standard workaround to all such cases in include/uapi/linux/netfilter. See for instance xt_connlimit.h. This is sufficient to allow userspace build to succeed, without exposing any kernel structure detail. If you don't even want to leak the name, then let's make it void *. Keeping a declaration for sk_filter, while sk_filter is renamed everywhere else is the least good option, in my opinion. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LKP] [net] 4ed2d765dfa: ltp.recv01.4.TFAIL
On Wed, Oct 1, 2014 at 11:29 AM, Fengguang Wu fengguang...@intel.com wrote: Hi Willem, FYI, we noticed the below LTP failures on commit Thanks for the report, Fengguang. The failures are recv01 4 TFAIL : recv01.c:142: invalid flags set ; returned -1 (expected -1), errno 11 (expected 22) recvfrom014 TFAIL : recvfrom01.c:164: invalid socket length ; returned -1 (expected -1), errno 11 (expected 22) recvfrom016 TFAIL : recvfrom01.c:164: invalid flags set ; returned -1 (expected -1), errno 11 (expected 22) recvmsg019 TFAIL : recvmsg01.c:228: invalid flags set ; returned -1 (expected -1), errno 11 (expected 22) In other words, these functions return EAGAIN now, when they used to return EINVAL. This will happen from this patch onwards if flags includes MSG_ERRQUEUE. The patch introduced error queue support for TCP sockets. Like other recvmsg implementations, the error queue code will act on some flags and ignore others. Previously, the code happened to check MSG_OOB first and return EINVAL. The error queue code ignores this flag and returns EAGAIN if no data is waiting. This was recently discussed in LTP recv/recvmsg tests failing on 3.17 http://comments.gmane.org/gmane.linux.network/331750 where the consensus was that applications should not expect particular error codes when passing unsupported combinations of flags. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LKP] [net] 4ed2d765dfa:
On Mon, Nov 24, 2014 at 7:38 AM, Yuanhan Liu yuanhan@linux.intel.com wrote: FYI, we noticed the below changes on commit 4ed2d765dfaccff5ebdac68e2064b59125033a3b (net-timestamp: TCP timestamping) e7fd2885385157d4 4ed2d765dfaccff5ebdac68e20 -- :10 100% 5:5 ltp.recv01.fail :10 100% 5:5 ltp.recvfrom01.fail :10 100% 5:5 ltp.recvmsg01.fail This has been discussed before: LTP recv/recvmsg tests failing on 3.17 http://comments.gmane.org/gmane.linux.network/331750 and https://lkml.org/lkml/2014/10/1/379 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] af_packet: don't pass empty blocks for PACKET_V3
On Thu, Feb 5, 2015 at 10:54 PM, Alexander Drozdov al.droz...@gmail.com wrote: On 05.02.2015 23:01:38 +0300 Willem de Bruijn wrote: On Wed, Feb 4, 2015 at 9:58 PM, Alexander Drozdov al.droz...@gmail.com wrote: Don't close an empty block on timeout. Its meaningless to pass it to the user. Moreover, passing empty blocks wastes CPU buffer space increasing probability of packets dropping on small timeouts. Side effect of this patch is indefinite user-space wait in poll on idle links. But, I believe its better to set timeout for poll(2) when needed than to get empty blocks every millisecond when not needed. This change would break existing applications that have come to depend on the periodic signal. I don't disagree with the argument that the data ready signal should be sent only when a block is full or a timer expires and at least some data is waiting, but that is moot at this point. I missed something. As pointed by Guy Harris g...@alum.mit.edu, before the previous patch periodic signal was not delivered. The previous patch (da413eec729dae5dc by Dan Collins d...@dcollins.co.nz) is for 3.19 kernel only. Should we care about existing 3.19-only applications? It does sound reasonable to expect processes to handle infinite sleep on no data if that is the historical behavior of the interface. Signed-off-by: Alexander Drozdov al.droz...@gmail.com --- net/packet/af_packet.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 9cfe2e1..9a2f70a 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -698,6 +698,10 @@ static void prb_retire_rx_blk_timer_expired(unsigned long data) if (pkc-last_kactive_blk_num == pkc-kactive_blk_num) { if (!frozen) { + if (!BLOCK_NUM_PKTS(pbd)) { + /* An empty block. Just refresh the timer. */ + goto refresh_timer; + } prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO); if (!prb_dispatch_next_block(pkc, po)) goto refresh_timer; @@ -798,7 +802,11 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1, h1-ts_last_pkt.ts_sec = last_pkt-tp_sec; h1-ts_last_pkt.ts_nsec = last_pkt-tp_nsec; } else { - /* Ok, we tmo'd - so get the current time */ + /* Ok, we tmo'd - so get the current time. +* +* It shouldn't really happen as we don't close empty +* blocks. See prb_retire_rx_blk_timer_expired(). +*/ struct timespec ts; getnstimeofday(ts); h1-ts_last_pkt.ts_sec = ts.tv_sec; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net] packet: bail out of packet_snd() if L2 header creation fails
On Sun, Jan 11, 2015 at 1:35 PM, Eric Dumazet eric.duma...@gmail.com wrote: On Sun, 2015-01-11 at 13:01 -0500, Christoph Jaeger wrote: Due to a misplaced parenthesis, the expression (unlikely(offset) 0), which expands to (__builtin_expect(!!(offset), 0) 0), never evaluates to true. Therefore, when sending packets with PF_PACKET/SOCK_DGRAM, packet_snd() does not abort as intended if the creation of the layer 2 header fails. Spotted by Coverity - CID 1259975 (Operands don't affect result). Fixes: 9c7077622dd9 (packet: make packet_snd fail on len smaller than l2 header) Signed-off-by: Christoph Jaeger c...@linux.com --- Nice catch ! Acked-by: Eric Dumazet eduma...@google.com Indeed. I'm responsible for that typo. Thanks a lot for catching it! Acked-by: Willem de Bruijn will...@google.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] af_packet: don't pass empty blocks for PACKET_V3
On Wed, Feb 4, 2015 at 9:58 PM, Alexander Drozdov al.droz...@gmail.com wrote: Don't close an empty block on timeout. Its meaningless to pass it to the user. Moreover, passing empty blocks wastes CPU buffer space increasing probability of packets dropping on small timeouts. Side effect of this patch is indefinite user-space wait in poll on idle links. But, I believe its better to set timeout for poll(2) when needed than to get empty blocks every millisecond when not needed. This change would break existing applications that have come to depend on the periodic signal. I don't disagree with the argument that the data ready signal should be sent only when a block is full or a timer expires and at least some data is waiting, but that is moot at this point. Signed-off-by: Alexander Drozdov al.droz...@gmail.com --- net/packet/af_packet.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 9cfe2e1..9a2f70a 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -698,6 +698,10 @@ static void prb_retire_rx_blk_timer_expired(unsigned long data) if (pkc-last_kactive_blk_num == pkc-kactive_blk_num) { if (!frozen) { + if (!BLOCK_NUM_PKTS(pbd)) { + /* An empty block. Just refresh the timer. */ + goto refresh_timer; + } prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO); if (!prb_dispatch_next_block(pkc, po)) goto refresh_timer; @@ -798,7 +802,11 @@ static void prb_close_block(struct tpacket_kbdq_core *pkc1, h1-ts_last_pkt.ts_sec = last_pkt-tp_sec; h1-ts_last_pkt.ts_nsec = last_pkt-tp_nsec; } else { - /* Ok, we tmo'd - so get the current time */ + /* Ok, we tmo'd - so get the current time. +* +* It shouldn't really happen as we don't close empty +* blocks. See prb_retire_rx_blk_timer_expired(). +*/ struct timespec ts; getnstimeofday(ts); h1-ts_last_pkt.ts_sec = ts.tv_sec; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] af_packet: pass checksum validation status to the user
On Fri, Mar 20, 2015 at 2:46 AM, Alexander Drozdov al.droz...@gmail.com wrote: On 19.03.2015 21:50:03 +0300 Willem de Bruijn wrote: On Thu, Mar 19, 2015 at 2:08 PM, Alexander Drozdov al.droz...@gmail.com wrote: On Thu, 19 Mar 2015 11:29:32 -0400, Willem de Bruijn will...@google.com wrote: On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov al.droz...@gmail.com wrote: Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the af_packet user that at least the transport header checksum has been already validated. This changes the interface slightly. Processes should be treating this previously unused bit as reserved and other flags have been added to the bitmap in this manner as well, so this should then be safe here, too. For now, the flag may be set for incoming packets only. Why? I can't figure out how af_packet could get that the outgoing packet's checksum has been validated. Checksumming on output from skbuff.h tells that skb-ip_summed should equal to CHECKSUM_NONE in that case, CHECKSUM_UNNECESSARY is a valid flag on the outgoing path according to that documentation. And, indeed, I also see no checksum scrubbing on forwarding paths in practice (but I may be wrong there, only took a quick glance). but that is not true for me (in my tests, skb-ip_summed == CHECKSUM_UNNECESSARY for forwarded packets in some cases). When you disable hardware checksum offload and generate packets locally, you do see the expected CHECKSUM_NONE value? Yes, I do, but see below. You cannot change the semantics of the flag afterwards. I think the semantics of the flag won't be changed if one set the flag for outgoing packets. If the flag is not set (for any directions) then that is not mean that the packet checksum is invalid. The user just can then checksum the packet by itself. So, the user may check the flag for any packet right now. I see. So it is a hint. Okay. It would be nice if it behaves as expected in as many cases as possible from the outset. This would include PACKET_OUTGOING and CHECKSUM_NONE. I've just done some testing, and I've found that packets generated by 'nping --badsum' (socket(PF_INET, SOCK_RAW, IPPROTO_RAW)) have CHECKSUM_NONE when they are viewed by af_packet. I've used rather old Linux kernel for the tests, but isn't that a reason to not set TP_STATUS_CSUM_VALID for outgoing packets right now? Yes. That's a very good example. Please also note the flag and semantics in Documentation/networking/packet_mmap.txt I'll do it and I'll resend the patches with the note. Thanks. Better to support both directions from the start. Signed-off-by: Alexander Drozdov al.droz...@gmail.com --- include/uapi/linux/if_packet.h | 1 + net/packet/af_packet.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index da2d668..053bd10 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -99,6 +99,7 @@ struct tpacket_auxdata { #define TP_STATUS_VLAN_VALID (1 4) /* auxdata has valid tp_vlan_tci */ #define TP_STATUS_BLK_TMO (1 5) #define TP_STATUS_VLAN_TPID_VALID (1 6) /* auxdata has valid tp_vlan_tpid */ +#define TP_STATUS_CSUM_VALID (1 7) /* Tx ring - header status */ #define TP_STATUS_AVAILABLE 0 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 6ecf8dd..3f09dda 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (skb-ip_summed == CHECKSUM_PARTIAL) status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + status |= TP_STATUS_CSUM_VALID; if (snaplen res) snaplen = res; @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock, aux.tp_status = TP_STATUS_USER; if (skb-ip_summed == CHECKSUM_PARTIAL) aux.tp_status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + aux.tp_status |= TP_STATUS_CSUM_VALID; + These two sections are near duplicates. I'd move the entire status initialization, including existing TP_STATUS_USER and TP_STATUS_CSUMNOTREADY fields to a helper function. It's a bit unfortunately that we have to use an extra bit to add a signal that is a near inverse of the existing TP_STATUS_CSUMNOTREADY (bar CHECKSUM_NONE). I do not immediately see a better way, either, though. And tp_status has plenty room at 32 bits
Re: [PATCH 2/2] af_packet: pass checksum validation status to the user
On Thu, Mar 19, 2015 at 2:08 PM, Alexander Drozdov al.droz...@gmail.com wrote: On Thu, 19 Mar 2015 11:29:32 -0400, Willem de Bruijn will...@google.com wrote: On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov al.droz...@gmail.com wrote: Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the af_packet user that at least the transport header checksum has been already validated. This changes the interface slightly. Processes should be treating this previously unused bit as reserved and other flags have been added to the bitmap in this manner as well, so this should then be safe here, too. For now, the flag may be set for incoming packets only. Why? I can't figure out how af_packet could get that the outgoing packet's checksum has been validated. Checksumming on output from skbuff.h tells that skb-ip_summed should equal to CHECKSUM_NONE in that case, CHECKSUM_UNNECESSARY is a valid flag on the outgoing path according to that documentation. And, indeed, I also see no checksum scrubbing on forwarding paths in practice (but I may be wrong there, only took a quick glance). but that is not true for me (in my tests, skb-ip_summed == CHECKSUM_UNNECESSARY for forwarded packets in some cases). When you disable hardware checksum offload and generate packets locally, you do see the expected CHECKSUM_NONE value? You cannot change the semantics of the flag afterwards. I think the semantics of the flag won't be changed if one set the flag for outgoing packets. If the flag is not set (for any directions) then that is not mean that the packet checksum is invalid. The user just can then checksum the packet by itself. So, the user may check the flag for any packet right now. I see. So it is a hint. Okay. It would be nice if it behaves as expected in as many cases as possible from the outset. This would include PACKET_OUTGOING and CHECKSUM_NONE. Please also note the flag and semantics in Documentation/networking/packet_mmap.txt Better to support both directions from the start. Signed-off-by: Alexander Drozdov al.droz...@gmail.com --- include/uapi/linux/if_packet.h | 1 + net/packet/af_packet.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index da2d668..053bd10 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -99,6 +99,7 @@ struct tpacket_auxdata { #define TP_STATUS_VLAN_VALID (1 4) /* auxdata has valid tp_vlan_tci */ #define TP_STATUS_BLK_TMO (1 5) #define TP_STATUS_VLAN_TPID_VALID (1 6) /* auxdata has valid tp_vlan_tpid */ +#define TP_STATUS_CSUM_VALID (1 7) /* Tx ring - header status */ #define TP_STATUS_AVAILABLE 0 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 6ecf8dd..3f09dda 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (skb-ip_summed == CHECKSUM_PARTIAL) status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + status |= TP_STATUS_CSUM_VALID; if (snaplen res) snaplen = res; @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock, aux.tp_status = TP_STATUS_USER; if (skb-ip_summed == CHECKSUM_PARTIAL) aux.tp_status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + aux.tp_status |= TP_STATUS_CSUM_VALID; + These two sections are near duplicates. I'd move the entire status initialization, including existing TP_STATUS_USER and TP_STATUS_CSUMNOTREADY fields to a helper function. It's a bit unfortunately that we have to use an extra bit to add a signal that is a near inverse of the existing TP_STATUS_CSUMNOTREADY (bar CHECKSUM_NONE). I do not immediately see a better way, either, though. And tp_status has plenty room at 32 bits. aux.tp_len = PACKET_SKB_CB(skb)-origlen; aux.tp_snaplen = skb-len; aux.tp_mac = 0; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] af_packet: pass checksum validation status to the user
On Thu, Mar 19, 2015 at 4:01 AM, Alexander Drozdov al.droz...@gmail.com wrote: Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the af_packet user that at least the transport header checksum has been already validated. This changes the interface slightly. Processes should be treating this previously unused bit as reserved and other flags have been added to the bitmap in this manner as well, so this should then be safe here, too. For now, the flag may be set for incoming packets only. Why? You cannot change the semantics of the flag afterwards. Better to support both directions from the start. Signed-off-by: Alexander Drozdov al.droz...@gmail.com --- include/uapi/linux/if_packet.h | 1 + net/packet/af_packet.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index da2d668..053bd10 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -99,6 +99,7 @@ struct tpacket_auxdata { #define TP_STATUS_VLAN_VALID (1 4) /* auxdata has valid tp_vlan_tci */ #define TP_STATUS_BLK_TMO (1 5) #define TP_STATUS_VLAN_TPID_VALID (1 6) /* auxdata has valid tp_vlan_tpid */ +#define TP_STATUS_CSUM_VALID (1 7) /* Tx ring - header status */ #define TP_STATUS_AVAILABLE 0 diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 6ecf8dd..3f09dda 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1918,6 +1918,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (skb-ip_summed == CHECKSUM_PARTIAL) status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + status |= TP_STATUS_CSUM_VALID; if (snaplen res) snaplen = res; @@ -3015,6 +3019,11 @@ static int packet_recvmsg(struct kiocb *iocb, struct socket *sock, aux.tp_status = TP_STATUS_USER; if (skb-ip_summed == CHECKSUM_PARTIAL) aux.tp_status |= TP_STATUS_CSUMNOTREADY; + else if (skb-pkt_type != PACKET_OUTGOING +(skb-ip_summed == CHECKSUM_COMPLETE || + skb_csum_unnecessary(skb))) + aux.tp_status |= TP_STATUS_CSUM_VALID; + These two sections are near duplicates. I'd move the entire status initialization, including existing TP_STATUS_USER and TP_STATUS_CSUMNOTREADY fields to a helper function. It's a bit unfortunately that we have to use an extra bit to add a signal that is a near inverse of the existing TP_STATUS_CSUMNOTREADY (bar CHECKSUM_NONE). I do not immediately see a better way, either, though. And tp_status has plenty room at 32 bits. aux.tp_len = PACKET_SKB_CB(skb)-origlen; aux.tp_snaplen = skb-len; aux.tp_mac = 0; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build warnings after merge of the net-next tree
On Thu, May 14, 2015 at 3:12 AM, Stephen Rothwell s...@canb.auug.org.au wrote: Hi all, After merging the net-next tree, today's linux-next build (sparc64 defconfig) produced these warnings: In file included from arch/sparc/include/asm/cmpxchg.h:4:0, from arch/sparc/include/asm/atomic_64.h:11, from arch/sparc/include/asm/atomic.h:4, from include/linux/atomic.h:4, from include/linux/spinlock.h:416, from include/linux/mmzone.h:7, from include/linux/gfp.h:5, from include/linux/mm.h:9, from net/packet/af_packet.c:56: net/packet/af_packet.c: In function 'packet_rcv_has_room': arch/sparc/include/asm/cmpxchg_64.h:43:22: warning: value computed is not used [-Wunused-value] #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr ^ net/packet/af_packet.c:1314:3: note: in expansion of macro 'xchg' xchg(po-pressure, !has_room); ^ net/packet/af_packet.c: In function 'packet_poll': arch/sparc/include/asm/cmpxchg_64.h:43:22: warning: value computed is not used [-Wunused-value] #define xchg(ptr,x) ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr ^ net/packet/af_packet.c:3817:3: note: in expansion of macro 'xchg' xchg(po-pressure, 0); ^ Introduced by commit 2ccdbaa6d55b (packet: rollover lock contention avoidance). I just sent http://patchwork.ozlabs.org/patch/472362/ to resolve these. Thanks for the report. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: net: heap-out-of-bounds in sock_setsockopt
> > Hmm, we should exclude the raw socket case, something like the > following, but I am not sure if the check is too strict or not, also > not sure if we should return an error for this raw socket case. No, SOF_TIMESTAMPING_OPT_ID with SOCK_RAW/IPPROTO_TCP is legitimate. It should fall through to initializing sk->sk_tskey to zero. Only stream sockets should use the special case where numbering is bytestream and computed by subtracting the seqno from the seqno at the time that the option is enabled. > > diff --git a/net/core/sock.c b/net/core/sock.c > index 765be83..c26e80a 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -872,7 +872,7 @@ int sock_setsockopt(struct socket *sock, int > level, int optname, > > if (val & SOF_TIMESTAMPING_OPT_ID && > !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) { > - if (sk->sk_protocol == IPPROTO_TCP) { > + if (sk->sk_protocol == IPPROTO_TCP && > sk->sk_type == SOCK_STREAM) { > if (sk->sk_state != TCP_ESTABLISHED) { > ret = -EINVAL; > break; I made the same error when later returning the tskey in __skb_tstamp_tx: if (sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID) { serr->ee.ee_data = skb_shinfo(skb)->tskey; if (sk->sk_protocol == IPPROTO_TCP) serr->ee.ee_data -= sk->sk_tskey; } This has no effect if sk->sk_tskey is initialized to 0 with your patch. Still, it should not treat SOCK_RAW as a TCP sock here, either. Please add this if you're about to send the patch. Or I can send it separately, if you prefer. Thanks for the quick fix. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Boot failure when using NFS on OMAP based evms
Currently linux-next is failing to boot via NFS on my AM335x GP evm, AM437x GP evm and Beagle X15. I bisected the problem down to the commit "udp: remove headers from UDP packets before queueing". >>> >>> Thanks for the report, and apologies for breaking your configuration. >>> I had missed that sunrpc can dequeue skbs from a udp receive >>> queue and makes assumptions about the layout of those packets. rxrpc >>> does the same. From what I can tell so far, those are the only two >>> protocols that do this. I have verified that the following fixes rxrpc for >>> me >>> >> >> Thank you for your quick response. I verified with all of the above >> suggested changes that NFS works again on my 3 evms. > > Thanks a lot for testing, Franklin. I will send out the two patches. Patches sent to netdev. I'll do another scan to verify that there are no additional protocols that dequeue skbs from udp receive queues and expect udp headers present.
Re: Boot failure when using NFS on OMAP based evms
On Thu, Apr 7, 2016 at 10:36 AM, Franklin S Cooper Jr. <fcoo...@ti.com> wrote: > > > On 04/06/2016 09:22 PM, Willem de Bruijn wrote: >> On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr. <fcoo...@ti.com> wrote: >>> Hi All, >>> >>> Currently linux-next is failing to boot via NFS on my AM335x GP evm, >>> AM437x GP evm and Beagle X15. I bisected the problem down to the commit >>> "udp: remove headers from UDP packets before queueing". >>> >>> I had to revert the following three commits to get things working again: >>> >>> e6afc8ace6dd5cef5e812f26c72579da8806f5ac >>> udp: remove headers from UDP packets before queueing >>> >>> 627d2d6b550094d88f9e518e15967e7bf906ebbf >>> udp: enable MSG_PEEK at non-zero offset >>> >>> b9bb53f3836f4eb2bdeb3447be11042bd29c2408 >>> sock: convert sk_peek_offset functions to WRITE_ONCE >>> >> >> Thanks for the report, and apologies for breaking your configuration. >> I had missed that sunrpc can dequeue skbs from a udp receive >> queue and makes assumptions about the layout of those packets. rxrpc >> does the same. From what I can tell so far, those are the only two >> protocols that do this. I have verified that the following fixes rxrpc for me >> >> --- a/net/rxrpc/ar-input.c >> +++ b/net/rxrpc/ar-input.c >> @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv >> *sp, struct sk_buff *skb) >> struct rxrpc_wire_header whdr; >> >> /* dig out the RxRPC connection details */ >> - if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < >> 0) >> + if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0) >> return -EBADMSG; >> - if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr))) >> + if (!pskb_pull(skb, sizeof(whdr))) >> BUG(); >> >> I have not yet been able to reproduce the sunrpc/nfs issue, but I >> suspect that the following might fix it. I will try to create an NFS >> setup. >> >> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c >> index 2df87f7..8ab40ba 100644 >> --- a/net/sunrpc/socklib.c >> +++ b/net/sunrpc/socklib.c >> @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr, >> struct sk_buff *skb) >> struct xdr_skb_reader desc; >> >> desc.skb = skb; >> - desc.offset = sizeof(struct udphdr); >> + desc.offset = 0; >> desc.count = skb->len - desc.offset; >> >> if (skb_csum_unnecessary(skb)) >> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >> index 1413cdc..71d6072 100644 >> --- a/net/sunrpc/svcsock.c >> +++ b/net/sunrpc/svcsock.c >> @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) >> svsk->sk_sk->sk_stamp = skb->tstamp; >> set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be >> more data... */ >> >> - len = skb->len - sizeof(struct udphdr); >> + len = skb->len; >> rqstp->rq_arg.len = len; >> >> rqstp->rq_prot = IPPROTO_UDP; >> @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) >> skb_free_datagram_locked(svsk->sk_sk, skb); >> } else { >> /* we can use it in-place */ >> - rqstp->rq_arg.head[0].iov_base = skb->data + >> - sizeof(struct udphdr); >> + rqstp->rq_arg.head[0].iov_base = skb->data; >> rqstp->rq_arg.head[0].iov_len = len; >> if (skb_checksum_complete(skb)) >> goto out_free; >> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c >> index 65e7595..c1fc7b2 100644 >> --- a/net/sunrpc/xprtsock.c >> +++ b/net/sunrpc/xprtsock.c >> @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt, >> u32 _xid; >> __be32 *xp; >> >> - repsize = skb->len - sizeof(struct udphdr); >> + repsize = skb->len; >> if (repsize < 4) { >> dprintk("RPC: impossible RPC reply size %d!\n", >> repsize); >> return; >> } >> >> >> >> /* Copy the XID from the skb... */ >> - xp = skb_header_pointer(skb, sizeof(struct udphdr), >> - sizeof(_xid), &_xid); >> + xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid); >> if (xp == NULL) >> return; >> > > > Thank you for your quick response. I verified with all of the above > suggested changes that NFS works again on my 3 evms. Thanks a lot for testing, Franklin. I will send out the two patches.
Re: Boot failure when using NFS on OMAP based evms
On Wed, Apr 6, 2016 at 7:12 PM, Franklin S Cooper Jr.wrote: > Hi All, > > Currently linux-next is failing to boot via NFS on my AM335x GP evm, > AM437x GP evm and Beagle X15. I bisected the problem down to the commit > "udp: remove headers from UDP packets before queueing". > > I had to revert the following three commits to get things working again: > > e6afc8ace6dd5cef5e812f26c72579da8806f5ac > udp: remove headers from UDP packets before queueing > > 627d2d6b550094d88f9e518e15967e7bf906ebbf > udp: enable MSG_PEEK at non-zero offset > > b9bb53f3836f4eb2bdeb3447be11042bd29c2408 > sock: convert sk_peek_offset functions to WRITE_ONCE > Thanks for the report, and apologies for breaking your configuration. I had missed that sunrpc can dequeue skbs from a udp receive queue and makes assumptions about the layout of those packets. rxrpc does the same. From what I can tell so far, those are the only two protocols that do this. I have verified that the following fixes rxrpc for me --- a/net/rxrpc/ar-input.c +++ b/net/rxrpc/ar-input.c @@ -612,9 +612,9 @@ int rxrpc_extract_header(struct rxrpc_skb_priv *sp, struct sk_buff *skb) struct rxrpc_wire_header whdr; /* dig out the RxRPC connection details */ - if (skb_copy_bits(skb, sizeof(struct udphdr), , sizeof(whdr)) < 0) + if (skb_copy_bits(skb, 0, , sizeof(whdr)) < 0) return -EBADMSG; - if (!pskb_pull(skb, sizeof(struct udphdr) + sizeof(whdr))) + if (!pskb_pull(skb, sizeof(whdr))) BUG(); I have not yet been able to reproduce the sunrpc/nfs issue, but I suspect that the following might fix it. I will try to create an NFS setup. diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c index 2df87f7..8ab40ba 100644 --- a/net/sunrpc/socklib.c +++ b/net/sunrpc/socklib.c @@ -155,7 +155,7 @@ int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb) struct xdr_skb_reader desc; desc.skb = skb; - desc.offset = sizeof(struct udphdr); + desc.offset = 0; desc.count = skb->len - desc.offset; if (skb_csum_unnecessary(skb)) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 1413cdc..71d6072 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -617,7 +617,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) svsk->sk_sk->sk_stamp = skb->tstamp; set_bit(XPT_DATA, >sk_xprt.xpt_flags); /* there may be more data... */ - len = skb->len - sizeof(struct udphdr); + len = skb->len; rqstp->rq_arg.len = len; rqstp->rq_prot = IPPROTO_UDP; @@ -641,8 +641,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp) skb_free_datagram_locked(svsk->sk_sk, skb); } else { /* we can use it in-place */ - rqstp->rq_arg.head[0].iov_base = skb->data + - sizeof(struct udphdr); + rqstp->rq_arg.head[0].iov_base = skb->data; rqstp->rq_arg.head[0].iov_len = len; if (skb_checksum_complete(skb)) goto out_free; diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c index 65e7595..c1fc7b2 100644 --- a/net/sunrpc/xprtsock.c +++ b/net/sunrpc/xprtsock.c @@ -995,15 +995,14 @@ static void xs_udp_data_read_skb(struct rpc_xprt *xprt, u32 _xid; __be32 *xp; - repsize = skb->len - sizeof(struct udphdr); + repsize = skb->len; if (repsize < 4) { dprintk("RPC: impossible RPC reply size %d!\n", repsize); return; } /* Copy the XID from the skb... */ - xp = skb_header_pointer(skb, sizeof(struct udphdr), - sizeof(_xid), &_xid); + xp = skb_header_pointer(skb, 0, sizeof(_xid), &_xid); if (xp == NULL) return;
Re: Tool for sampling /proc/net/softnet_stat statistics
On Mon, Mar 7, 2016 at 10:36 AM, Jesper Dangaard Brouerwrote: > Hi Google, > > While playing with RPS, I needed to read stats from > /proc/net/softnet_stat and the tools I could find [1] and [2] was not > very good. > > I lack of better, I coded up my own tool softnet_stat.pl here: > > https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl > > The output format/columns in /proc/net/softnet_stat is undocumented, > plus values are printed in hex. E.g. to decode the columns you need to > read kernel function kernel softnet_seq_show() in > kernel/net/core/net-procfs.c. > > To make things easier I wrote this small perl script for get > so human readable statistics from /proc/net/softnet_stat. Very nice. Thanks for sharing, Jesper. I maintained something similar, but never got around to clean up and upstream it. Will start using yours, instead. A few points, from using my earlier tool: A minimum cut-off value is helpful, especially on beefy servers, to suppress the many 0 rows. Preferably configurable, to also be able to suppress low-rate background traffic when analyzing a few large streams. My default was 500. The number of columns has grown with kernel versions. The latest column is flow_limit, added in 3.11 at99bbc7074190. It is helpful for the script to be robust against both older and future kernels. On which note, to be able to support these kinds of tools, any new columns to such procfs files should be appended, not inserted as for instance in https://patchwork.ozlabs.org/patch/574171/ Time squeeze is an exception, in that it is number of squeeze events per second, not number of packets squeezed. This is often misunderstood if not explained clearly.
Re: [patch -next] udp: fix if statement in SIOCINQ ioctl
On Mon, Apr 18, 2016 at 8:19 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Mon, 2016-04-18 at 11:44 +0300, Dan Carpenter wrote: >> We deleted a line of code and accidentally made the "return put_user()" >> part of the if statement when it's supposed to be unconditional. >> >> Fixes: 9f9a45beaa96 ('udp: do not expect udp headers on ioctl SIOCINQ') >> Signed-off-by: Dan Carpenter <dan.carpen...@oracle.com> > > Acked-by: Eric Dumazet <eduma...@google.com> Acked-by: Willem de Bruijn <will...@google.com> Thanks for catching this.
[PATCH tip/timers/core] uapi glibc compat: make linux/time.h compile after libc time.h files
From: Willem de Bruijn <will...@google.com> Add libc-compat workaround for definitions in linux/time.h that duplicate those in libc time.h, sys/time.h and bits/time.h. With this change, userspace builds succeeds when linux/time.h is included after those libc files. The inverse requires changes to those userspace headers. Without this patch, when compiling the following program after make headers_install: echo -e "#include \n#include " | \ gcc -Wall -Werror -Iusr/include -c -xc - gcc gives these errors: #include #include In file included from ../test_time.c:3:0: /usr/include/time.h:120:8: error: redefinition of ‘struct timespec’ struct timespec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:9:8: note: originally defined here struct timespec { ^ In file included from ../test_time.c:3:0: /usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’ struct itimerspec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:34:8: note: originally defined here struct itimerspec { and this warning by indirect inclusion of bits/time.h: In file included from ../test_time.c:4:0: ./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror] #define TIMER_ABSTIME 0x01 ^ In file included from /usr/include/time.h:41:0, from ../test_time.c:3: /usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location of the previous definition # define TIMER_ABSTIME 1 ^ Ran the same program for sys/time.h and bits/time.h. The _SYS_TIME_H test resolves similar errors for timeval, timezone, itimerval and warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF. Signed-off-by: Willem de Bruijn <will...@google.com> --- include/uapi/linux/libc-compat.h | 50 include/uapi/linux/time.h| 15 2 files changed, 65 insertions(+) diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index e4f048e..9c3cbf5 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -146,6 +146,43 @@ #define __UAPI_DEF_XATTR 1 #endif +/* Definitions for time.h */ +#if defined(__timespec_defined) +#define __UAPI_DEF_TIMESPEC0 +#else +#define __UAPI_DEF_TIMESPEC1 +#endif + +#if defined(_TIME_H) && defined(__USE_POSIX199309) +#define __UAPI_DEF_ITIMERSPEC 0 +#else +#define __UAPI_DEF_ITIMERSPEC 1 +#endif + +/* Definitions for sys/time.h */ +#if defined(_SYS_TIME_H) +#define __UAPI_DEF_TIMEVAL 0 +#define __UAPI_DEF_ITIMERVAL 0 +#define __UAPI_DEF_ITIMER_WHICH0 +#else +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#endif + +/* Definitions for bits/time.h */ +#if defined(_BITS_TIME_H) +#define __UAPI_DEF_ABSTIME 0 +#else +#define __UAPI_DEF_ABSTIME 1 +#endif + +#if defined(_SYS_TIME_H) && defined(__USE_BSD) +#define __UAPI_DEF_TIMEZONE0 +#else +#define __UAPI_DEF_TIMEZONE1 +#endif + /* If we did not see any headers from any supported C libraries, * or we are being included in the kernel, then define everything * that we need. */ @@ -182,6 +219,19 @@ /* Definitions for xattr.h */ #define __UAPI_DEF_XATTR 1 +/* Definitions for time.h */ +#define __UAPI_DEF_TIMESPEC1 +#define __UAPI_DEF_ITIMERSPEC 1 + +/* Definitions for sys/time.h */ +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#define __UAPI_DEF_TIMEZONE1 + +/* Definitions for bits/time.h */ +#define __UAPI_DEF_ABSTIME 1 + #endif /* __GLIBC__ */ #endif /* _UAPI_LIBC_COMPAT_H */ diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h index e75e1b6..4e7333c 100644 --- a/include/uapi/linux/time.h +++ b/include/uapi/linux/time.h @@ -1,9 +1,11 @@ #ifndef _UAPI_LINUX_TIME_H #define _UAPI_LINUX_TIME_H +#include #include +#if __UAPI_DEF_TIMESPEC #ifndef _STRUCT_TIMESPEC #define _STRUCT_TIMESPEC struct timespec { @@ -11,35 +13,46 @@ struct timespec { longtv_nsec;/* nanoseconds */ }; #endif +#endif +#if __UAPI_DEF_TIMEVAL struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_ttv_usec;/* microseconds */ }; +#endif +#if __UAPI_DEF_TIMEZONE struct timezone { int tz_minuteswest; /* minutes west of Greenwich */ int tz_dsttime; /* type of dst correction */ }; +#endif /* * Names of the interval timers, and structure * defining a timer setting: */ +#if __UAPI_DEF_ITIMER_WHICH #
Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much
On Sat, Jul 9, 2016 at 6:43 AM, Michal Kubecek <mkube...@suse.cz> wrote: > On Sat, Jul 09, 2016 at 11:48:49AM +0200, Daniel Borkmann wrote: >> On 07/09/2016 02:20 AM, Alexei Starovoitov wrote: >> >On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote: >> >>On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote: >> >>>If socket filter truncates an udp packet below the length of UDP header >> >>>in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a >> >>>BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if >> >>>kernel is configured that way) can be easily enforced by an unprivileged >> >>>user which was reported as CVE-2016-6162. For a reproducer, see >> >>>http://seclists.org/oss-sec/2016/q3/8 >> >>> >> >>>Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before >> >>>queueing") >> >>>Reported-by: Marco Grassi <marco@gmail.com> >> >>>Signed-off-by: Michal Kubecek <mkube...@suse.cz> >> >>>--- > >> >>Acked-by: Eric Dumazet <eduma...@google.com> >> > >> >this is incomplete fix. Please do not apply. See discussion at >> >security@kernel >> >> Ohh well, didn't see it earlier before starting the discussion at >> security@... >> >> I'm okay if we take this for now as a quick band aid and find a better >> way how to deal with the underlying issue long-term so that it's >> /guaranteed/ that it doesn't bite us any further in such fragile ways. > > Agreed. As rc7 is due in a day or two, rushing a complex and intrusive > solution in might be too risky. Acked-by: Willem de Bruijn <will...@google.com> Thanks, Michal.
Re: Missing include file in include/uapi/linux/errqueue.h?
On Sun, Jul 10, 2016 at 1:43 AM, Brooks Moseswrote: > On Sat, Jul 9, 2016 at 10:36 AM, Brooks Moses wrote: >> I've been attempting to qualify the Linux 4.5.2 user-space headers for >> a toolchain release, and ran into what looks like a missing include >> file in include/uapi/linux/errqueue.h. In particular, >> https://github.com/torvalds/linux/commit/f24b9be5957b38bb420b838115040dc2031b7d0c >> adds the following to this file: >> >> +struct scm_timestamping { >> + struct timespec ts[3]; >> +}; >> >> However, struct timespec is defined in time.h, which isn't included >> either in 4.5.2 or in current head. Is this simply a missing #include >> line, It is. I missed that in my original patch. > or am I misunderstanding something? > > As a followup: Unfortunately the obvious fix -- adding "#include > " -- causes other problems, since linux/time.h is > incompatible with the glibc time.h such that including both of them > into the same compilation unit causes errors about redefined types. If these conflicts between libc and uapi time.h can be resolved through include/uapi/linux/libc-compat.h, then we can apply the obvious fix of including linux/time.h in linux/errqueue.h. > And we, at least, have some programs that want to include > linux/errqueue.h and (glibc's) time.h. The fix of adding "#include > " to linux/errqueue.h seems to work for us, but I'm not sure > that won't cause problems in the other direction for other people. That breaks kernel compilation.
Re: Missing include file in include/uapi/linux/errqueue.h?
On Sun, Jul 10, 2016 at 9:56 AM, Willem de Bruijn <will...@google.com> wrote: > On Sun, Jul 10, 2016 at 1:43 AM, Brooks Moses <bmo...@google.com> wrote: >> On Sat, Jul 9, 2016 at 10:36 AM, Brooks Moses <bmo...@google.com> wrote: >>> I've been attempting to qualify the Linux 4.5.2 user-space headers for >>> a toolchain release, and ran into what looks like a missing include >>> file in include/uapi/linux/errqueue.h. In particular, >>> https://github.com/torvalds/linux/commit/f24b9be5957b38bb420b838115040dc2031b7d0c >>> adds the following to this file: >>> >>> +struct scm_timestamping { >>> + struct timespec ts[3]; >>> +}; >>> >>> However, struct timespec is defined in time.h, which isn't included >>> either in 4.5.2 or in current head. Is this simply a missing #include >>> line, > > It is. I missed that in my original patch. > >> or am I misunderstanding something? >> >> As a followup: Unfortunately the obvious fix -- adding "#include >> " -- causes other problems, since linux/time.h is >> incompatible with the glibc time.h such that including both of them >> into the same compilation unit causes errors about redefined types. > > If these conflicts between libc and uapi time.h can be resolved through > include/uapi/linux/libc-compat.h, then we can apply the obvious fix of > including linux/time.h in linux/errqueue.h. The below patch (bar whitespace mangling by gmail) avoids redefinition conflicts for me if linux/time.h is included after time.h. The other direction would also require changes to the userspace headers, as stated in libc-compat.h. >> And we, at least, have some programs that want to include >> linux/errqueue.h and (glibc's) time.h. The fix of adding "#include >> " to linux/errqueue.h seems to work for us, but I'm not sure >> that won't cause problems in the other direction for other people. > > That breaks kernel compilation. A simpler other option is to explicitly include either of the time.h files depending on the environment, similar to uapi/linux/lightnvm.h. timespec itself is also explicitly defined in uapi/linux/coda.h, but that won't avoid redefinition if time.h is included afterwards. The rough libc-compat.h patch: diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 07bdce1..6b1cdc6 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -2,6 +2,7 @@ #define _UAPI_LINUX_ERRQUEUE_H #include +#include struct sock_extended_err { __u32 ee_errno; diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index e4f048e..ef33ea9 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -146,6 +146,16 @@ #define __UAPI_DEF_XATTR 1 #endif +#if defined(_TIME_H) +#define __UAPI_DEF_TIMESPEC0 +#define __UAPI_DEF_ITIMERSPEC 0 +#define __UAPI_DEF_ABSTIME 0 +#else +#define __UAPI_DEF_TIMESPEC1 +#define __UAPI_DEF_ITIMERSPEC 1 +#define __UAPI_DEF_ABSTIME 1 +#endif /* _TIME_H */ + /* If we did not see any headers from any supported C libraries, * or we are being included in the kernel, then define everything * that we need. */ @@ -182,6 +192,11 @@ /* Definitions for xattr.h */ #define __UAPI_DEF_XATTR 1 +/* Definitions for time.h */ +#define __UAPI_DEF_TIMESPEC1 +#define __UAPI_DEF_ITIMERSPEC 1 +#define __UAPI_DEF_ABSTIME 1 + #endif /* __GLIBC__ */ #endif /* _UAPI_LIBC_COMPAT_H */ diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h index e75e1b6..7129c54 100644 --- a/include/uapi/linux/time.h +++ b/include/uapi/linux/time.h @@ -1,9 +1,10 @@ #ifndef _UAPI_LINUX_TIME_H #define _UAPI_LINUX_TIME_H +#include #include - +#if __UAPI_DEF_TIMESPEC #ifndef _STRUCT_TIMESPEC #define _STRUCT_TIMESPEC struct timespec { @@ -11,6 +12,7 @@ struct timespec { longtv_nsec;/* nanoseconds */ }; #endif +#endif struct timeval { __kernel_time_t tv_sec; /* seconds */ @@ -31,10 +33,12 @@ struct timezone { #defineITIMER_VIRTUAL 1 #defineITIMER_PROF 2 +#if __UAPI_DEF_TIMESPEC struct itimerspec { struct timespec it_interval;/* timer period */ struct timespec it_value; /* timer expiration */ }; +#endif struct itimerval { struct timeval it_interval; /* timer interval */ @@ -64,6 +68,8 @@ struct itimerval { /* * The various flags for setting POSIX.1b interval timers: */ +#if __UAPI_DEF_ABSTIME #define TIMER_ABSTIME 0x01 +#endif #endif /* _UAPI_LINUX_TIME_H */
Re: [PATCH tip/timers/core] uapi glibc compat: make linux/time.h compile after libc time.h files
On Wed, Aug 17, 2016 at 3:33 PM, John Stultz <john.stu...@linaro.org> wrote: > On Wed, Aug 10, 2016 at 8:45 AM, Willem de Bruijn > <willemdebruijn.ker...@gmail.com> wrote: >> From: Willem de Bruijn <will...@google.com> >> >> Add libc-compat workaround for definitions in linux/time.h that >> duplicate those in libc time.h, sys/time.h and bits/time.h. >> >> With this change, userspace builds succeeds when linux/time.h is >> included after those libc files. The inverse requires changes to >> those userspace headers. >> >> Without this patch, when compiling the following program after >> make headers_install: >> >> echo -e "#include \n#include " | \ >> gcc -Wall -Werror -Iusr/include -c -xc - >> >> gcc gives these errors: >> >> >> Signed-off-by: Willem de Bruijn <will...@google.com> >> --- >> include/uapi/linux/libc-compat.h | 50 >> >> include/uapi/linux/time.h| 15 >> 2 files changed, 65 insertions(+) > > > So I don't have any objection to this. But I'm not sure if I'm the > right path for such a change to go through. > > From the commit history, it seems David Miller or Andrew Morton would > be the right folks to take this. > > thanks > -john It is a prerequisite for including linux/time.h from uapi/linux/errqueue.h, a network file https://lkml.org/lkml/2016/7/10/10 If making changes to these two files through net-next does not cause conflicts with other branches, I can resend this patch to netdev.
Re: [PATCH net-next 2/2] errqueue: include linux/time.h
On Mon, Sep 12, 2016 at 3:26 PM, kbuild test robot <l...@intel.com> wrote: > Hi Willem, > > [auto build test ERROR on net-next/master] > > url: > https://github.com/0day-ci/linux/commits/Willem-de-Bruijn/uapi-include-time-h-from-errqueue-h/20160913-020431 > config: i386-defconfig (attached as .config) > compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): This error report shows that breakage can occur with applications that include linux/errqueue.h before the libc time headers. The libc-compat definitions in this patch set only fix compilation when uapi headers are included after the userspace headers. These errors indeed go away when errqueue.h is included after the userspace time includes, as in the diff below. For the inverse, the libc headers need additional #if __UAPI_DEF_FOO changes, as described in include/uapli/linux/libc-compat.h. Those changes are a noop without kernel definitions, so arguably that libc patch should be merged before this kernel patch. I will remove this patch set from the patchwork queue for now. diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c index 5df0704..f073801 100644 --- a/Documentation/networking/timestamping/txtimestamp.c +++ b/Documentation/networking/timestamping/txtimestamp.c @@ -37,7 +37,6 @@ #include #include #include -#include #include #include #include @@ -59,6 +58,7 @@ #include #include #include +#include #include
[PATCH net-next 0/2] uapi: include time.h from errqueue.h
From: Willem de Bruijn <will...@google.com> It was reported that linux/errqueue.h requires linux/time.h, but that adding the include directly may cause userspace conflicts between linux/time.h and glibc time.h: https://lkml.org/lkml/2016/7/10/10 Address the conflicts using the standard libc-compat approach, then add the #include to errqueue.h The first patch is a resubmit. It was previously submitted to tip/timers/core, but given the commit history, the maintainer suggested this tree, instead. https://lkml.org/lkml/2016/8/10/748 This also allows sending the follow-up as part of the patchset. Willem de Bruijn (2): uapi glibc compat: make linux/time.h compile with user time.h files errqueue: include linux/time.h include/uapi/linux/errqueue.h| 1 + include/uapi/linux/libc-compat.h | 50 include/uapi/linux/time.h| 15 3 files changed, 66 insertions(+) -- 2.8.0.rc3.226.g39d4020
[PATCH net-next 2/2] errqueue: include linux/time.h
From: Willem de Bruijn <will...@google.com> struct scm_timestamping has fields of type struct timespec. Now that it is safe to include linux/time.h and time.h at the same time, include linux/time.h directly in linux/errqueue.h Without this patch, when compiling the following program after make headers_install: gcc -Wall -Werror -Iusr/include -c -xc - < static struct scm_timestamping tss; int main(void) { tss.ts[0].tv_sec = 1; return 0; } EOF gcc gives this error: In file included from :1:0: usr/include/linux/errqueue.h:33:18: error: array type has incomplete element type struct timespec ts[3]; Reported-by: Brooks Moses <bmo...@google.com> Signed-off-by: Willem de Bruijn <will...@google.com> --- include/uapi/linux/errqueue.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 07bdce1..abafec8 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -1,6 +1,7 @@ #ifndef _UAPI_LINUX_ERRQUEUE_H #define _UAPI_LINUX_ERRQUEUE_H +#include #include struct sock_extended_err { -- 2.8.0.rc3.226.g39d4020
[PATCH net-next 1/2] uapi glibc compat: make linux/time.h compile with user time.h files
From: Willem de Bruijn <will...@google.com> Add libc-compat workaround for definitions in linux/time.h that duplicate those in libc time.h, sys/time.h and bits/time.h. With this change, userspace builds succeeds when linux/time.h is included after libc time.h and when it is included after sys/time.h. The inverse requires additional changes to those userspace headers. Without this patch, when compiling the following program after make headers_install: echo -e "#include \n#include " | \ gcc -Wall -Werror -Iusr/include -c -xc - gcc gives these errors: #include #include In file included from ../test_time.c:3:0: /usr/include/time.h:120:8: error: redefinition of ‘struct timespec’ struct timespec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:9:8: note: originally defined here struct timespec { ^ In file included from ../test_time.c:3:0: /usr/include/time.h:161:8: error: redefinition of ‘struct itimerspec’ struct itimerspec ^ In file included from ../test_time.c:2:0: ./usr/include/linux/time.h:34:8: note: originally defined here struct itimerspec { and this warning by indirect inclusion of bits/time.h: In file included from ../test_time.c:4:0: ./usr/include/linux/time.h:67:0: error: "TIMER_ABSTIME" redefined [-Werror] #define TIMER_ABSTIME 0x01 ^ In file included from /usr/include/time.h:41:0, from ../test_time.c:3: /usr/include/x86_64-linux-gnu/bits/time.h:82:0: note: this is the location of the previous definition # define TIMER_ABSTIME 1 ^ The _SYS_TIME_H variant resolves similar errors for timeval, timezone, itimerval and warnings for ITIMER_REAL, ITIMER_VIRTUAL, ITIMER_PROF. Ran the same program for sys/time.h and bits/time.h. Signed-off-by: Willem de Bruijn <will...@google.com> --- include/uapi/linux/libc-compat.h | 50 include/uapi/linux/time.h| 15 2 files changed, 65 insertions(+) diff --git a/include/uapi/linux/libc-compat.h b/include/uapi/linux/libc-compat.h index 44b8a6b..b08b0f5 100644 --- a/include/uapi/linux/libc-compat.h +++ b/include/uapi/linux/libc-compat.h @@ -165,6 +165,43 @@ #define __UAPI_DEF_XATTR 1 #endif +/* Definitions for time.h */ +#if defined(__timespec_defined) +#define __UAPI_DEF_TIMESPEC0 +#else +#define __UAPI_DEF_TIMESPEC1 +#endif + +#if defined(_TIME_H) && defined(__USE_POSIX199309) +#define __UAPI_DEF_ITIMERSPEC 0 +#else +#define __UAPI_DEF_ITIMERSPEC 1 +#endif + +/* Definitions for sys/time.h */ +#if defined(_SYS_TIME_H) +#define __UAPI_DEF_TIMEVAL 0 +#define __UAPI_DEF_ITIMERVAL 0 +#define __UAPI_DEF_ITIMER_WHICH0 +#else +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#endif + +/* Definitions for bits/time.h */ +#if defined(_BITS_TIME_H) +#define __UAPI_DEF_ABSTIME 0 +#else +#define __UAPI_DEF_ABSTIME 1 +#endif + +#if defined(_SYS_TIME_H) && defined(__USE_BSD) +#define __UAPI_DEF_TIMEZONE0 +#else +#define __UAPI_DEF_TIMEZONE1 +#endif + /* If we did not see any headers from any supported C libraries, * or we are being included in the kernel, then define everything * that we need. */ @@ -208,6 +245,19 @@ /* Definitions for xattr.h */ #define __UAPI_DEF_XATTR 1 +/* Definitions for time.h */ +#define __UAPI_DEF_TIMESPEC1 +#define __UAPI_DEF_ITIMERSPEC 1 + +/* Definitions for sys/time.h */ +#define __UAPI_DEF_TIMEVAL 1 +#define __UAPI_DEF_ITIMERVAL 1 +#define __UAPI_DEF_ITIMER_WHICH1 +#define __UAPI_DEF_TIMEZONE1 + +/* Definitions for bits/time.h */ +#define __UAPI_DEF_ABSTIME 1 + #endif /* __GLIBC__ */ #endif /* _UAPI_LIBC_COMPAT_H */ diff --git a/include/uapi/linux/time.h b/include/uapi/linux/time.h index e75e1b6..4e7333c 100644 --- a/include/uapi/linux/time.h +++ b/include/uapi/linux/time.h @@ -1,9 +1,11 @@ #ifndef _UAPI_LINUX_TIME_H #define _UAPI_LINUX_TIME_H +#include #include +#if __UAPI_DEF_TIMESPEC #ifndef _STRUCT_TIMESPEC #define _STRUCT_TIMESPEC struct timespec { @@ -11,35 +13,46 @@ struct timespec { longtv_nsec;/* nanoseconds */ }; #endif +#endif +#if __UAPI_DEF_TIMEVAL struct timeval { __kernel_time_t tv_sec; /* seconds */ __kernel_suseconds_ttv_usec;/* microseconds */ }; +#endif +#if __UAPI_DEF_TIMEZONE struct timezone { int tz_minuteswest; /* minutes west of Greenwich */ int tz_dsttime; /* type of dst correction */ }; +#endif /* * Names of the interval timers, and structure * defining a timer set
Re: af_packet: use after free in prb_retire_rx_blk_timer_expired
On Mon, Apr 10, 2017 at 3:23 PM, Dave Joneswrote: > On Mon, Apr 10, 2017 at 07:03:30PM +, alexander.le...@verizon.com wrote: > > Hi all, > > > > I seem to be hitting this use-after-free on a -next kernel using trinity: > > > > [ 531.036054] BUG: KASAN: use-after-free in > prb_retire_rx_blk_timer_expired (net/packet/af_packet.c:688) The retire_blk_timer is called after the pg_vec struct for this ring was freed. This should not happen. packet_set_ring stops the timer with del_timer_sync when tearing down the ring before freeing that struct: if (closing && (po->tp_version > TPACKET_V2)) { /* Because we don't support block-based V3 on tx-ring */ if (!tx_ring) prb_shutdown_retire_blk_timer(po, rb_queue); } if (pg_vec) free_pg_vec(pg_vec, order, req->tp_block_nr); This is a similar race to the use-after-free fixed by 84ac7260236a ("packet: fix race condition in packet_set_ring"). The previous race was triggered by a call to setsockopt PACKET_VERSION changing tp_version while the ring is active. It is not immediately obvious what is the cause now. I suppose trinity does not give a trace of such system calls on this file descriptor? That would be helpful. The bug report shows both a timer firing after the packet_set_ring call that freed the pg_vec, and later a CONFIG_DEBUG_OBJECTS_FREE warning that the timer is still active when the socket is closed on release of the last file descriptor.
Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg
=== > BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg > net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128 Thanks for the report. This is accessing skb->dev from within recvmsg() at line info->ipi_ifindex = skb->dev->ifindex; Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp"). At this time the device may indeed have gone away. I'm having a look at a way to read this in the receive BH and store the ifindex.
Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg
On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote: >> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn >> <willemdebruijn.ker...@gmail.com> wrote: >> > === >> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg >> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128 >> > >> > Thanks for the report. This is accessing skb->dev from within recvmsg() at >> > line >> > >> > info->ipi_ifindex = skb->dev->ifindex; >> > >> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on >> > errqueue with origin tstamp"). At this time the device may indeed have >> > gone away. I'm having a look at a way to read this in the receive BH >> > and store the ifindex. >> >> Why not use skb_iif? This code is called from the error path for transmit timestamps. We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as the first field in its control block. This also holds for the PKTINFO_SKB_CB struct to which skb->cb is cast on dequeue when it copies pktinfo to userspace. So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation is even needed on dequeue, let alone the currently buggy line that touches skb->dev. This iif cast was added for this purpose in the receive path in 0b922b7a829c ("net: original ingress device index in PKTINFO"). The device pointer is valid on enqueue for all paths called from device drivers, as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in __dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but there skb->dev is NULL. The v6 path does need a conversion, but already does this in ip6_datagram_recv_common_ctl. There, too, we can remove the buggy logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg. I will send a patch.
Re: net/ipv4: use-after-free in ipv4_datagram_support_cmsg
On Wed, Apr 12, 2017 at 6:25 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Wed, Apr 12, 2017 at 4:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> On Wed, 2017-04-12 at 13:07 -0700, Cong Wang wrote: >>> On Wed, Apr 12, 2017 at 8:39 AM, Willem de Bruijn >>> <willemdebruijn.ker...@gmail.com> wrote: >>> > === >>> >> BUG: KASAN: use-after-free in ipv4_datagram_support_cmsg >>> >> net/ipv4/ip_sockglue.c:500 [inline] at addr 880059be0128 >>> > >>> > Thanks for the report. This is accessing skb->dev from within recvmsg() >>> > at line >>> > >>> > info->ipi_ifindex = skb->dev->ifindex; >>> > >>> > Introduced in 829ae9d61165 ("net-timestamp: allow reading recv cmsg on >>> > errqueue with origin tstamp"). At this time the device may indeed have >>> > gone away. I'm having a look at a way to read this in the receive BH >>> > and store the ifindex. >>> >>> Why not use skb_iif? > > This code is called from the error path for transmit timestamps. > > We can make use of the fact that SKB_EXT_ERR used on enqueue has iif as > the first field in its control block. This also holds for the PKTINFO_SKB_CB > struct to which skb->cb is cast on dequeue when it copies pktinfo to > userspace. > So if set on enqueue in __skb_complete_tx_timestamp, no conversion operation > is even needed on dequeue, let alone the currently buggy line that touches > skb->dev. > > This iif cast was added for this purpose in the receive path in 0b922b7a829c > ("net: original ingress device index in PKTINFO"). > > The device pointer is valid on enqueue for all paths called from device > drivers, > as well as from dev_queue_xmit for SCM_TSTAMP_SCHED generation in > __dev_queue_xmit. The exception is SCM_TSTAMP_ACK generation, but > there skb->dev is NULL. > > The v6 path does need a conversion, but already does this in > ip6_datagram_recv_common_ctl. There, too, we can remove the buggy > logic to set it from skb->dev->ifindex in ip6_datagram_support_cmsg. > > I will send a patch. Sent http://patchwork.ozlabs.org/patch/750197
Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space
>> > +#ifdef __KERNEL__ >> > +#include >> > +#else >> > +#include >> > +#endif /* __KERNEL__ */ >> >> This will break applications that include manually. >> I previously sent a patch to use libc-compat to make compilation succeed >> when both are included in the case where is included after >> . >> >> https://lkml.org/lkml/2016/9/12/872 >> >> The inverse will require changes to the libc header to avoid redefining >> symbols already defined by >> >> The second patch in that 2-patch set included >> unconditionally after the fix. This broke builds that also included >> in the wrong order. I did not resubmit the first patch as a >> stand-alone, as it is not sufficient to avoid breakage. > > I wasn't aware of your change, but I was about to send this to fix the > case when glibc is included before : > > https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66 There are a few differences between the two. Including does not unconditionally define all the symbols. Some are conditional on additional state, such as __timespec_defined. > but you also ran into problems where is included before > which need fixes in libc header side. > > So how to proceed with these? The libc-compat change is a good fix that can be submitted on its own. > I don't like leaving a few dozen non-compiling header files into uapi. I agree, but I do not see a simple solution. Unless libc has the analogous change, including either or in userspace can unfortunately cause breakage. The added include if __KERNEL__ is defined should be safe, though.
Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space
On Sun, Aug 6, 2017 at 5:52 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote: > On Sun, Aug 06, 2017 at 05:42:13PM -0400, Willem de Bruijn wrote: >> On Sun, Aug 6, 2017 at 5:33 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote: >> > On Sun, Aug 06, 2017 at 05:24:20PM -0400, Willem de Bruijn wrote: >> >> >> > +#ifdef __KERNEL__ >> >> >> > +#include >> >> >> > +#else >> >> >> > +#include >> >> >> > +#endif /* __KERNEL__ */ >> >> >> >> >> >> This will break applications that include manually. >> >> >> I previously sent a patch to use libc-compat to make compilation >> >> >> succeed >> >> >> when both are included in the case where is included >> >> >> after >> >> >> . >> >> >> >> >> >> https://lkml.org/lkml/2016/9/12/872 >> >> >> >> >> >> The inverse will require changes to the libc header to avoid redefining >> >> >> symbols already defined by >> >> >> >> >> >> The second patch in that 2-patch set included >> >> >> unconditionally after the fix. This broke builds that also included >> >> >> in the wrong order. I did not resubmit the first patch as a >> >> >> stand-alone, as it is not sufficient to avoid breakage. >> >> > >> >> > I wasn't aware of your change, but I was about to send this to fix the >> >> > case when glibc is included before : >> >> > >> >> > https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66 >> >> >> >> There are a few differences between the two. Including does not >> >> unconditionally define all the symbols. Some are conditional on additional >> >> state, such as __timespec_defined. >> > >> > Yep, your patch seems better for libc-compat.h. Could you send it again? >> >> Okay. Or feel free to include it in the patchset if that helps resolve >> dependencies. > > If you don't have the time, I will send tomorrow a new version of this > patch which fixes the commit topic and before that your libc-compat.h change > so both could be applied together. Please do. Thanks!
Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space
On Sun, Aug 6, 2017 at 5:33 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote: > On Sun, Aug 06, 2017 at 05:24:20PM -0400, Willem de Bruijn wrote: >> >> > +#ifdef __KERNEL__ >> >> > +#include >> >> > +#else >> >> > +#include >> >> > +#endif /* __KERNEL__ */ >> >> >> >> This will break applications that include manually. >> >> I previously sent a patch to use libc-compat to make compilation succeed >> >> when both are included in the case where is included after >> >> . >> >> >> >> https://lkml.org/lkml/2016/9/12/872 >> >> >> >> The inverse will require changes to the libc header to avoid redefining >> >> symbols already defined by >> >> >> >> The second patch in that 2-patch set included >> >> unconditionally after the fix. This broke builds that also included >> >> in the wrong order. I did not resubmit the first patch as a >> >> stand-alone, as it is not sufficient to avoid breakage. >> > >> > I wasn't aware of your change, but I was about to send this to fix the >> > case when glibc is included before : >> > >> > https://github.com/mcfrisk/linux/commit/f3952a27b8a21c6478d26e6246055383483f6a66 >> >> There are a few differences between the two. Including does not >> unconditionally define all the symbols. Some are conditional on additional >> state, such as __timespec_defined. > > Yep, your patch seems better for libc-compat.h. Could you send it again? Okay. Or feel free to include it in the patchset if that helps resolve dependencies. >> > I don't like leaving a few dozen non-compiling header files into uapi. >> >> I agree, but I do not see a simple solution. >> >> Unless libc has the analogous change, including either or >> in userspace can unfortunately cause breakage. >> >> The added include if __KERNEL__ is defined should be safe, though. > > Yes, for the kernel side, but your libc-compat change would nice for > userspace, where something will break for sure, but providing source > API compatibility is sometimes impossible. > > To summarize, this change from me, and your libc-compat.c for time.h, or? I'm still afraid that this patch as is will break builds that include first.
Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space
On Sun, Aug 6, 2017 at 4:23 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Sun, Aug 6, 2017 at 12:44 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote: >> linux/time.h conflicts with user space header time.h. Try to be compatible >> with both. >> >> Fixes userspace compilation error: >> >> error: array type has incomplete element type >> struct timespec ts[3]; >> >> Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> >> Cc: Willem de Bruijn <will...@google.com> >> Cc: Soheil Hassas Yeganeh <soh...@google.com> >> Cc: net...@vger.kernel.org >> --- >> include/uapi/linux/errqueue.h | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h >> index 07bdce1f444a..b310b2c6d94f 100644 >> --- a/include/uapi/linux/errqueue.h >> +++ b/include/uapi/linux/errqueue.h >> @@ -3,6 +3,12 @@ >> >> #include >> >> +#ifdef __KERNEL__ >> +#include >> +#else >> +#include >> +#endif /* __KERNEL__ */ > > This will break applications that include manually. Also, the patch title reads "include in user space", but it includes in that environment.
Re: [PATCH v06 18/36] uapi linux/errqueue.h: include linux/time.h in user space
On Sun, Aug 6, 2017 at 12:44 PM, Mikko Rapeli <mikko.rap...@iki.fi> wrote: > linux/time.h conflicts with user space header time.h. Try to be compatible > with both. > > Fixes userspace compilation error: > > error: array type has incomplete element type > struct timespec ts[3]; > > Signed-off-by: Mikko Rapeli <mikko.rap...@iki.fi> > Cc: Willem de Bruijn <will...@google.com> > Cc: Soheil Hassas Yeganeh <soh...@google.com> > Cc: net...@vger.kernel.org > --- > include/uapi/linux/errqueue.h | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h > index 07bdce1f444a..b310b2c6d94f 100644 > --- a/include/uapi/linux/errqueue.h > +++ b/include/uapi/linux/errqueue.h > @@ -3,6 +3,12 @@ > > #include > > +#ifdef __KERNEL__ > +#include > +#else > +#include > +#endif /* __KERNEL__ */ This will break applications that include manually. I previously sent a patch to use libc-compat to make compilation succeed when both are included in the case where is included after . https://lkml.org/lkml/2016/9/12/872 The inverse will require changes to the libc header to avoid redefining symbols already defined by The second patch in that 2-patch set included unconditionally after the fix. This broke builds that also included in the wrong order. I did not resubmit the first patch as a stand-alone, as it is not sufficient to avoid breakage.
Re: linux-next: build failure after merge of the netfilter tree
On Wed, May 17, 2017 at 1:02 AM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Tue, May 16, 2017 at 11:45 PM, Stephen Rothwell <s...@canb.auug.org.au> > wrote: >> Hi all, >> >> After merging the netfilter tree, today's linux-next build (i386 >> defconfig) failed like this: >> >> net/netfilter/x_tables.c: In function 'xt_match_to_user': >> net/netfilter/x_tables.c:303:13: error: implicit declaration of function >> 'COMPAT_XT_ALIGN' [-Werror=implicit-function-declaration] >> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) : \ >> ^ >> net/netfilter/x_tables.c:310:9: note: in expansion of macro 'XT_DATA_TO_USER' >> XT_DATA_TO_USER(u, m, match, 0); >> ^ >> >> Caused by commit >> >> 324318f0248c ("netfilter: xtables: zero padding in data_to_user") >> >> In the !CONFIG_COMPAT case C_SIZE will always be zero, but the compiler >> is still looking for the macro :-( > > Apologies for the breakage. > >> I added this cludge patch (I am sure it can be done better): >> >> From: Stephen Rothwell <s...@canb.auug.org.au> >> Date: Wed, 17 May 2017 13:36:26 +1000 >> Subject: [PATCH] netfilter: xtables: fix for zero padding in data_to_user >> >> Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au> >> --- >> net/netfilter/x_tables.c | 8 >> 1 file changed, 8 insertions(+) >> >> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c >> index d17769599c10..2b1785993a92 100644 >> --- a/net/netfilter/x_tables.c >> +++ b/net/netfilter/x_tables.c >> @@ -296,12 +296,20 @@ int xt_data_to_user(void __user *dst, const void *src, >> } >> EXPORT_SYMBOL_GPL(xt_data_to_user); >> >> +#ifdef CONFIG_COMPAT >> #define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\ >> xt_data_to_user(U->data, K->data, \ >> K->u.kernel.TYPE->usersize, \ >> C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\ >> C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) : \ >> XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) >> +#else >> +#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\ >> + xt_data_to_user(U->data, K->data, \ >> + K->u.kernel.TYPE->usersize, \ >> + C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\ >> + C_SIZE ? : XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) >> +#endif >> > > I will send a patch to the netfilter to define a separate > COMPAT_XT_DATA_TO_USER inside the CONFIG_COMPAT region further down > the file. This also allows simplifying XT_DATA_TO_USER by removing > those ternary statements. Full patch out for review at http://patchwork.ozlabs.org/patch/763655/
Re: linux-next: build failure after merge of the netfilter tree
On Tue, May 16, 2017 at 11:45 PM, Stephen Rothwellwrote: > Hi all, > > After merging the netfilter tree, today's linux-next build (i386 > defconfig) failed like this: > > net/netfilter/x_tables.c: In function 'xt_match_to_user': > net/netfilter/x_tables.c:303:13: error: implicit declaration of function > 'COMPAT_XT_ALIGN' [-Werror=implicit-function-declaration] > C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) : \ > ^ > net/netfilter/x_tables.c:310:9: note: in expansion of macro 'XT_DATA_TO_USER' > XT_DATA_TO_USER(u, m, match, 0); > ^ > > Caused by commit > > 324318f0248c ("netfilter: xtables: zero padding in data_to_user") > > In the !CONFIG_COMPAT case C_SIZE will always be zero, but the compiler > is still looking for the macro :-( Apologies for the breakage. > I added this cludge patch (I am sure it can be done better): > > From: Stephen Rothwell > Date: Wed, 17 May 2017 13:36:26 +1000 > Subject: [PATCH] netfilter: xtables: fix for zero padding in data_to_user > > Signed-off-by: Stephen Rothwell > --- > net/netfilter/x_tables.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index d17769599c10..2b1785993a92 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -296,12 +296,20 @@ int xt_data_to_user(void __user *dst, const void *src, > } > EXPORT_SYMBOL_GPL(xt_data_to_user); > > +#ifdef CONFIG_COMPAT > #define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\ > xt_data_to_user(U->data, K->data, \ > K->u.kernel.TYPE->usersize, \ > C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\ > C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) : \ > XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) > +#else > +#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\ > + xt_data_to_user(U->data, K->data, \ > + K->u.kernel.TYPE->usersize, \ > + C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\ > + C_SIZE ? : XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) > +#endif > I will send a patch to the netfilter to define a separate COMPAT_XT_DATA_TO_USER inside the CONFIG_COMPAT region further down the file. This also allows simplifying XT_DATA_TO_USER by removing those ternary statements. -#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\ +#define XT_DATA_TO_USER(U, K, TYPE)\ xt_data_to_user(U->data, K->data, \ K->u.kernel.TYPE->usersize, \ - C_SIZE ? : K->u.kernel.TYPE->TYPE##size,\ - C_SIZE ? COMPAT_XT_ALIGN(C_SIZE) : \ -XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) + K->u.kernel.TYPE->TYPE##size, \ + XT_ALIGN(K->u.kernel.TYPE->TYPE##size)) and +#define COMPAT_XT_DATA_TO_USER(U, K, TYPE, C_SIZE) \ + xt_data_to_user(U->data, K->data, \ + K->u.kernel.TYPE->usersize, \ + C_SIZE, \ + COMPAT_XT_ALIGN(C_SIZE)) +
Re: linux-next: build failure after merge of the net-next tree
On Sun, May 21, 2017 at 9:16 PM, Stephen Rothwellwrote: > Hi all, > > After merging the net-next tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > net/socket.c: In function 'put_ts_pktinfo': > net/socket.c:695:28: error: 'SCM_TIMESTAMPING_PKTINFO' undeclared (first use > in this function) > put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMPING_PKTINFO, > ^ > > Caused by commit > > aad9c8c470f2 ("net: add new control message for incoming HW-timestamped > packets") > > This probably broke every architecture that has its own > arch//include/uapi/asm/socket.h that does not include > include/uapi/asm-generic/socket.h :-( Indeed. I added the architecture specific versions in patch http://patchwork.ozlabs.org/patch/765238/ That fixes the powerpc build for me. The new option is now defined in every file that also defines the last added such option SCM_TIMESTAMPING_OPT_STATS. Apologies for missing this earlier.
Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
On Wed, Sep 13, 2017 at 10:40 PM, nixiaomingwrote: > If fanout_add is preempted after running po-> fanout = match > and before running __fanout_link, > it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink > > so, we need add mutex_lock(_mutex) to __unregister_prot_hook > or add spin_lock(>bind_lock) before po-> fanout = match > > test on linux 4.1.42: > ./trinity -c setsockopt -C 2 -X & > > BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()! > Kernel panic - not syncing: BUG! > CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW O4.1.12 #1 > Hardware name: Hisilicon PhosphorHi1382 FPGA (DT) > Call trace: > [] dump_backtrace+0x0/0xf8 > [] show_stack+0x20/0x28 > [] dump_stack+0xac/0xe4 > [] panic+0xf8/0x268 > [] __unregister_prot_hook+0xa0/0x144 > [] packet_set_ring+0x280/0x5b4 > [] packet_setsockopt+0x320/0x950 > [] SyS_setsockopt+0xa4/0xd4 > > Signed-off-by: nixiaoming > Tested-by: wudesheng > --- > net/packet/af_packet.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 008a45c..0300146 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -365,10 +365,12 @@ static void __unregister_prot_hook(struct sock *sk, > bool sync) > > po->running = 0; > > + mutex_lock(_mutex); > if (po->fanout) > __fanout_unlink(sk, po); > else > __dev_remove_pack(>prot_hook); > + mutex_unlock(_mutex); > > __sock_put(sk); I happened to be looking at the same or a very similar race, courtesy of syzkaller. packet_set_ring and fanout_add can race. I believe that one bug is in fanout_add removing the socket protocol hook and adding the fanout protocol hook without holding po->bind_lock. That lock ensures atomic updates to po->running and the actual protocol hook. fanout_add tests po->running without holding the lock if (!po->running) goto out; and later unconditionally unbinds the socket protocol hook and binds the fanout group protocol hook: if (refcount_read(>sk_ref) < PACKET_FANOUT_MAX) { __dev_remove_pack(>prot_hook); po->fanout = match; refcount_set(>sk_ref, refcount_read(>sk_ref) + 1); __fanout_link(sk, po); err = 0; } This can happen after packet_set_ring has already removed the protocol hook, causing the socket to be added to the fanout list twice. Testing po->running again, this time while holding the bind_lock, ensures that packet_set_ring cannot have dropped it in between: + spin_lock(>bind_lock); + if (!po->running) { + net_err_ratelimited("fanout add, but unbound sock"); + err = -EFAULT; + spin_unlock(>bind_lock); + goto out; + } + __dev_remove_pack(>prot_hook)); po->fanout = match; refcount_set(>sk_ref, refcount_read(>sk_ref) + 1); __fanout_link(sk, po); + spin_unlock(>bind_lock); I verified that the reproducer logs plenty of "fanout add, but unbound sock" messages. I intend to send this fix after cleaning it up a bit. Will take a closer look at your patch to see whether these are indeed the same bug report.
Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
On Thu, Sep 14, 2017 at 10:07 AM, nixiaomingwrote: > From: l00219569 > > If fanout_add is preempted after running po-> fanout = match > and before running __fanout_link, > it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink > > so, we need add mutex_lock(_mutex) to __unregister_prot_hook The packet socket code has no shortage of locks, so there are many ways to avoid the race condition between fanout_add and packet_set_ring. Another option would be to lock the socket when calling fanout_add: - return fanout_add(sk, val & 0x, val >> 16); + lock_sock(sk); + ret = fanout_add(sk, val & 0x, val >> 16); + release_sock(sk); + return ret; But, for consistency, and to be able to continue to make sense of the locking policy, we should use the most appropriate lock. This is po->bind_lock, as it ensures atomicity between testing whether a protocol hook is active through po->running and the actual existence of that hook on the protocol hook list. fanout_mutex protects the fanout object's list. Taking that on __unregister_prot_hook even in the case where fanout is not used (and __dev_remove_pack is called) complicates locking in this already complicated code. > or add spin_lock(>bind_lock) before po-> fanout = match > > this is a patch for add po->bind_lock in fanout_add > > test on linux 4.1.12: > ./trinity -c setsockopt -C 2 -X & Thanks for testing! > > BUG: failure at net/packet/af_packet.c:1414/__fanout_unlink()! > Kernel panic - not syncing: BUG! > CPU: 2 PID: 2271 Comm: trinity-c0 Tainted: GW O4.1.12 #1 > Hardware name: Hisilicon PhosphorHi1382 FPGA (DT) > Call trace: > [] dump_backtrace+0x0/0xf8 > [] show_stack+0x20/0x28 > [] dump_stack+0xac/0xe4 > [] panic+0xf8/0x268 > [] __unregister_prot_hook+0xa0/0x144 > [] packet_set_ring+0x280/0x5b4 > [] packet_setsockopt+0x320/0x950 > [] SyS_setsockopt+0xa4/0xd4 > > Signed-off-by: nixiaoming > Tested-by: wudesheng > --- > net/packet/af_packet.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 54a18a8..7a52a3b 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1446,12 +1446,16 @@ static int fanout_add(struct sock *sk, u16 id, u16 > type_flags) > default: > return -EINVAL; > } > - > - if (!po->running) > + spin_lock(>bind_lock); > + if (!po->running) { > + spin_unlock(>bind_lock); > return -EINVAL; > + } > > - if (po->fanout) > + if (po->fanout) { > + spin_unlock(>bind_lock); > return -EALREADY; > + } > > mutex_lock(_mutex); > match = NULL; > @@ -1501,6 +1505,7 @@ static int fanout_add(struct sock *sk, u16 id, u16 > type_flags) > } > out: > mutex_unlock(_mutex); > + spin_unlock(>bind_lock); This function can call kzalloc with GFP_KERNEL, which may sleep. It is not correct to sleep while holding a spinlock. Which is why I take the lock later and test po->running again. I will clean up that patch and send it for review.
Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
On Fri, Sep 15, 2017 at 1:41 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote: > On Thu, Sep 14, 2017 at 7:35 AM, Willem de Bruijn > <willemdebruijn.ker...@gmail.com> wrote: >> On Thu, Sep 14, 2017 at 10:07 AM, nixiaoming <nixiaom...@huawei.com> wrote: >>> From: l00219569 <lisi...@huawei.com> >>> >>> If fanout_add is preempted after running po-> fanout = match >>> and before running __fanout_link, >>> it will cause BUG_ON when __unregister_prot_hook call __fanout_unlink >>> >>> so, we need add mutex_lock(_mutex) to __unregister_prot_hook >> >> The packet socket code has no shortage of locks, so there are many >> ways to avoid the race condition between fanout_add and packet_set_ring. >> >> Another option would be to lock the socket when calling fanout_add: >> >> - return fanout_add(sk, val & 0x, val >> 16); >> + lock_sock(sk); >> + ret = fanout_add(sk, val & 0x, val >> 16); >> + release_sock(sk); >> + return ret; >> > > I don't think this is an option, because __unregister_prot_hook() > can be called without lock_sock(), for example in packet_notifier(). > > >> But, for consistency, and to be able to continue to make sense of the >> locking policy, we should use the most appropriate lock. This >> is po->bind_lock, as it ensures atomicity between testing whether >> a protocol hook is active through po->running and the actual existence >> of that hook on the protocol hook list. > > Yeah, register_prot_hook() and unregister_prot_hook() already assume > bind_lock. > > [...] > >>> out: >>> mutex_unlock(_mutex); >>> + spin_unlock(>bind_lock); >> >> This function can call kzalloc with GFP_KERNEL, which may sleep. It is >> not correct to sleep while holding a spinlock. Which is why I take the lock >> later and test po->running again. > > > Right, no need to mention the mutex_unlock() before the spin_unlock() > is clearly wrong. > > >> >> I will clean up that patch and send it for review. > > How about the following patch? > > > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index c26172995511..f5c696a548ed 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -1754,10 +1754,14 @@ static int fanout_add(struct sock *sk, u16 id, > u16 type_flags) > match->prot_hook.dev == po->prot_hook.dev) { > err = -ENOSPC; > if (refcount_read(>sk_ref) < PACKET_FANOUT_MAX) { > + spin_lock(>bind_lock); > __dev_remove_pack(>prot_hook); > - po->fanout = match; > - refcount_set(>sk_ref, > refcount_read(>sk_ref) + 1); > - __fanout_link(sk, po); > + if (po->running) { > + refcount_set(>sk_ref, > refcount_read(>sk_ref) + 1); > + po->fanout = match; > + __fanout_link(sk, po); > + } > + spin_unlock(>bind_lock); > err = 0; > } > } In case of failure we also need to unlink and free match. I sent the following: http://patchwork.ozlabs.org/patch/813945/
Re: Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
On Tue, Sep 19, 2017 at 3:21 AM, Nixiaoming <nixiaom...@huawei.com> wrote: > On Fri, Sep 15, 2017 at 10:46 AM, Willem de Bruijn > > <willemdebruijn.ker...@gmail.com> wrote: > >> > >> In case of failure we also need to unlink and free match. I > >> sent the following: > >> > >> http://patchwork.ozlabs.org/patch/813945/ > > > > + spin_lock(>bind_lock); > > + if (po->running && > > + match->type == type && > >match->prot_hook.type == po->prot_hook.type && > >match->prot_hook.dev == po->prot_hook.dev) { > > err = -ENOSPC; > > @@ -1761,6 +1760,13 @@ static int fanout_add(struct sock *sk, u16 id, u16 > type_flags) > > err = 0; > > } > >} > > + spin_unlock(>bind_lock); > > + > > + if (err && !refcount_read(>sk_ref)) { > > +list_del(>list); > > +kfree(match); > > + } > > > > > > In the function fanout_add add spin_lock to protect po-> running and po-> > fanout, > > then whether it should be in the function fanout_release also add spin_lock > protection ? po->bind_lock is held when registering and unregistering the protocol hook. fanout_release does access po->running or prot_hook. It is called from packet_release, which does hold the bind_lock when unregistering the protocol hook.
Re: Re: [PATCH] net/packet: fix race condition between fanout_add and __unregister_prot_hook
On Tue, Sep 19, 2017 at 12:09 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Tue, Sep 19, 2017 at 3:21 AM, Nixiaoming <nixiaom...@huawei.com> wrote: >> On Fri, Sep 15, 2017 at 10:46 AM, Willem de Bruijn >> >> <willemdebruijn.ker...@gmail.com> wrote: >> >>> >> >>> In case of failure we also need to unlink and free match. I >> >>> sent the following: >> >>> >> >>> http://patchwork.ozlabs.org/patch/813945/ >> >> >> >> + spin_lock(>bind_lock); >> >> + if (po->running && >> >> + match->type == type && >> >>match->prot_hook.type == po->prot_hook.type && >> >>match->prot_hook.dev == po->prot_hook.dev) { >> >> err = -ENOSPC; >> >> @@ -1761,6 +1760,13 @@ static int fanout_add(struct sock *sk, u16 id, u16 >> type_flags) >> >> err = 0; >> >> } >> >>} >> >> + spin_unlock(>bind_lock); >> >> + >> >> + if (err && !refcount_read(>sk_ref)) { >> >> +list_del(>list); >> >> +kfree(match); >> >> + } >> >> >> >> >> >> In the function fanout_add add spin_lock to protect po-> running and po-> >> fanout, >> >> then whether it should be in the function fanout_release also add spin_lock >> protection ? > > po->bind_lock is held when registering and unregistering the > protocol hook. fanout_release does access po->running or > prot_hook. whoops. does *not* access.
Re: [PATCH 2/3] selftests: actually run the various net selftests
On Tue, Sep 19, 2017 at 9:34 AM, Josef Bacik <jo...@toxicpanda.com> wrote: > On Mon, Sep 18, 2017 at 04:14:41PM -0600, Shuah Khan wrote: >> On 09/18/2017 11:32 AM, jo...@toxicpanda.com wrote: >> > From: Josef Bacik <jba...@fb.com> >> > >> > These self tests are just self contained binaries, they are not run by >> > any of the scripts in the directory. This means they need to be marked >> > with TEST_GEN_PROGS to actually be run, not TEST_GEN_FILES. >> > >> > Signed-off-by: Josef Bacik <jba...@fb.com> >> > --- >> > tools/testing/selftests/net/Makefile | 4 ++-- >> > 1 file changed, 2 insertions(+), 2 deletions(-) >> > >> > diff --git a/tools/testing/selftests/net/Makefile >> > b/tools/testing/selftests/net/Makefile >> > index 3df542c84610..45a4e77a47c4 100644 >> > --- a/tools/testing/selftests/net/Makefile >> > +++ b/tools/testing/selftests/net/Makefile >> > @@ -6,8 +6,8 @@ CFLAGS += -I../../../../usr/include/ >> > TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh >> > rtnetlink.sh >> > TEST_GEN_FILES = socket >> > TEST_GEN_FILES += psock_fanout psock_tpacket >> > -TEST_GEN_FILES += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa >> > -TEST_GEN_FILES += reuseport_dualstack msg_zerocopy reuseaddr_conflict >> > +TEST_GEN_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa >> > +TEST_GEN_PROGS += reuseport_dualstack msg_zerocopy reuseaddr_conflict >> >> Hmm. I see msg_zerocopy.sh for running msg_zerocopy. msg_zerocopy should >> still stay in TEST_GEN_FILES and msg_zerocopy.sh needs to be added to >> TEST_PROGS so it runs. >> > > Actually the shell script requires arguments, it doesn't just run the test. > I'll fix this to just omit the test for now as it's not setup to run properly. > > Willem, could you follow up with a patch so that the zero copy test is run > properly the way you envision it running? You need to make sure that > > make -C tools/testing/selftests TARGETS=net run_tests > > actually runs your zero copy test the way you expect it to, otherwise it's > just > sitting there collecting dust. Thanks, Will do. In its current state, this test is really only meant to be run manually. It demonstrates the API and outputs some information on stderr. Zerocopy itself requires a two-host test. The feature is expressly disabled over loopback. But I can make this a pass/fail tests that exercises the interface and notification channel and verifies that data was copied. It will be a bit more work than just changing the default invocation of msg_zerocopy.sh
Re: [PATCH net-next 0/3] support changing steering policies in tuntap
On Thu, Sep 28, 2017 at 3:23 AM, Jason Wang <jasow...@redhat.com> wrote: > > > On 2017年09月28日 07:25, Willem de Bruijn wrote: >>>> >>>> In the future, both simple and sophisticated policy like RSS or other >>>> guest >>>> driven steering policies could be done on top. >>> >>> IMHO there should be a more practical example before adding all this >>> indirection. And it would be nice to understand why this queue selection >>> needs to be tun specific. >> >> I was thinking the same and this reminds me of the various strategies >> implemented in packet fanout. tun_cpu_select_queue is analogous to >> fanout_demux_cpu though it is tun-specific in that it requires >> tun->numqueues. > > > Right, the main idea is to introduce a way to change flow steering policy > for tun. I think fanout policy could be implemented through the API > introduced in this series. (Current flow caches based automatic steering > method is tun specific). > >> >> Fanout accrued various strategies until it gained an eBPF variant. Just >> supporting BPF is probably sufficient here, too. > > > Technically yes, but for tun, it also serve for virt. We probably still need > some hard coded policy which could be changed by guest until we can accept > an BPF program from guest I think? When would a guest choose the policy? As long as this is under control of a host user, possibly unprivileged, allowing BPF here is moot, as any user can run socket filter BPF already. Programming from the guest is indeed different. I don't fully understand that use case.
Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing
> @@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net) > struct socket *sock; > struct vhost_net_ubuf_ref *uninitialized_var(ubufs); > bool zcopy, zcopy_used; > + int i, batched = VHOST_NET_BATCH; > > mutex_lock(>mutex); > sock = vq->private_data; > @@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net) > hdr_size = nvq->vhost_hlen; > zcopy = nvq->ubufs; > > + /* Disable zerocopy batched fetching for simplicity */ This special case can perhaps be avoided if we no longer block on vhost_exceeds_maxpend, but revert to copying. > + if (zcopy) { > + heads = Can this special case of batchsize 1 not use vq->heads? > + batched = 1; > + } > + > for (;;) { > /* Release DMAs done buffers first */ > if (zcopy) > @@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net) > if (unlikely(vhost_exceeds_maxpend(net))) > break; > + /* TODO: Check specific error and bomb out > +* unless ENOBUFS? > +*/ > + err = sock->ops->sendmsg(sock, , len); > + if (unlikely(err < 0)) { > + if (zcopy_used) { > + vhost_net_ubuf_put(ubufs); > + nvq->upend_idx = > + ((unsigned)nvq->upend_idx - 1) % > UIO_MAXIOV; > + } > + vhost_discard_vq_desc(vq, 1); > + goto out; > + } > + if (err != len) > + pr_debug("Truncated TX packet: " > + " len %d != %zd\n", err, len); > + if (!zcopy) { > + vhost_add_used_idx(vq, 1); > + vhost_signal(>dev, vq); > + } else if (!zcopy_used) { > + vhost_add_used_and_signal(>dev, > + vq, head, 0); While batching, perhaps can also move this producer index update out of the loop and using vhost_add_used_and_signal_n. > + } else > + vhost_zerocopy_signal_used(net, vq); > + vhost_net_tx_packet(net); > + if (unlikely(total_len >= VHOST_NET_WEIGHT)) { > + vhost_poll_queue(>poll); > + goto out; > } > - vhost_discard_vq_desc(vq, 1); > - break; > - } > - if (err != len) > - pr_debug("Truncated TX packet: " > -" len %d != %zd\n", err, len); > - if (!zcopy_used) > - vhost_add_used_and_signal(>dev, vq, head, 0); > - else > - vhost_zerocopy_signal_used(net, vq); > - vhost_net_tx_packet(net); > - if (unlikely(total_len >= VHOST_NET_WEIGHT)) { > - vhost_poll_queue(>poll); > - break; This patch touches many lines just for indentation. If having to touch these lines anyway (dirtying git blame), it may be a good time to move the processing of a single descriptor code into a separate helper function. And while breaking up, perhaps another helper for setting up ubuf_info. If you agree, preferably in a separate noop refactor patch that precedes the functional changes.
Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()
>> > > @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq, >> > > void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); >> > > int vhost_vq_init_access(struct vhost_virtqueue *); >> > > +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n); >> > > int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int >> > > len); >> > > int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem >> > > *heads, >> > >unsigned count); >> > Please change the API to hide the fact that there's an index that needs >> > to be updated. >> >> In fact, an interesting optimization on top is just call >> vhost_add_used_idx(vq, n) instead of n vhost_add_used_idx(vq, 1). That's the >> reason I leave n in the API. >> >> Thanks > > Right but you could increment some internal counter in the vq > structure then update the used index using some api > with a generic name, e.g. add_used_complete or something like this. That adds a layer of information hiding. If the same variable can be kept close to the computation in a local variable and passed directly to vhost_add_used_idx_n that is easier to follow.
Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index
On Fri, Sep 22, 2017 at 4:02 AM, Jason Wangwrote: > This patch introduces vhost_prefetch_desc_indices() which could batch > descriptor indices fetching and used ring updating. This intends to > reduce the cache misses of indices fetching and updating and reduce > cache line bounce when virtqueue is almost full. copy_to_user() was > used in order to benefit from modern cpus that support fast string > copy. Batched virtqueue processing will be the first user. > > Signed-off-by: Jason Wang > --- > drivers/vhost/vhost.c | 55 > +++ > drivers/vhost/vhost.h | 3 +++ > 2 files changed, 58 insertions(+) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index f87ec75..8424166d 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct > vhost_dev *dev, > } > EXPORT_SYMBOL_GPL(vhost_dequeue_msg); > > +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq, > + struct vring_used_elem *heads, > + u16 num, bool used_update) > +{ > + int ret, ret2; > + u16 last_avail_idx, last_used_idx, total, copied; > + __virtio16 avail_idx; > + struct vring_used_elem __user *used; > + int i; > + > + if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) { > + vq_err(vq, "Failed to access avail idx at %p\n", > + >avail->idx); > + return -EFAULT; > + } > + last_avail_idx = vq->last_avail_idx & (vq->num - 1); > + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > + total = vq->avail_idx - vq->last_avail_idx; > + ret = total = min(total, num); > + > + for (i = 0; i < ret; i++) { > + ret2 = vhost_get_avail(vq, heads[i].id, > + >avail->ring[last_avail_idx]); > + if (unlikely(ret2)) { > + vq_err(vq, "Failed to get descriptors\n"); > + return -EFAULT; > + } > + last_avail_idx = (last_avail_idx + 1) & (vq->num - 1); > + } This is understandably very similar to the existing logic in vhost_get_vq_desc. Can that be extracted to a helper to avoid code duplication? Perhaps one helper to update vq->avail_idx and return num, and another to call vhost_get_avail one or more times. > + > + if (!used_update) > + return ret; > + > + last_used_idx = vq->last_used_idx & (vq->num - 1); > + while (total) { > + copied = min((u16)(vq->num - last_used_idx), total); > + ret2 = vhost_copy_to_user(vq, > + >used->ring[last_used_idx], > + [ret - total], > + copied * sizeof(*used)); > + > + if (unlikely(ret2)) { > + vq_err(vq, "Failed to update used ring!\n"); > + return -EFAULT; > + } > + > + last_used_idx = 0; > + total -= copied; > + } This second part seems unrelated and could be a separate function? Also, no need for ret2 and double assignment "ret = total =" if not modifying total in the the second loop: for (i = 0; i < total; ) { ... i += copied; }
Re: [PATCH net-next 0/3] support changing steering policies in tuntap
>> In the future, both simple and sophisticated policy like RSS or other guest >> driven steering policies could be done on top. > > IMHO there should be a more practical example before adding all this > indirection. And it would be nice to understand why this queue selection > needs to be tun specific. I was thinking the same and this reminds me of the various strategies implemented in packet fanout. tun_cpu_select_queue is analogous to fanout_demux_cpu though it is tun-specific in that it requires tun->numqueues. Fanout accrued various strategies until it gained an eBPF variant. Just supporting BPF is probably sufficient here, too.
Re: [PATCH] packet: Don't write vnet header beyond end of buffer
On Mon, Aug 28, 2017 at 2:29 PM, Benjamin Poirier <bpoir...@suse.com> wrote: > ... which may happen with certain values of tp_reserve and maclen. > > Fixes: 58d19b19cd99 ("packet: vnet_hdr support for tpacket_rcv") > Signed-off-by: Benjamin Poirier <bpoir...@suse.com> > Cc: Willem de Bruijn <will...@google.com> Acked-by: Willem de Bruijn <will...@google.com> Thanks for fixing this, Benjamin.
Re: [PATCH 1/2] [net-next] packet: clarify timestamp overflow
On Mon, Nov 27, 2017 at 11:17 AM, Arnd Bergmann <a...@arndb.de> wrote: > The memory mapped packet socket data structure in version 1 through 3 > all contain 32-bit second values for the packet time stamps, which makes > them suffer from the overflow of time_t in y2038 or y2106 (depending > on whether user space interprets the value as signed or unsigned). > > The implementation uses the deprecated getnstimeofday() function. > > In order to get rid of that, this changes the code to use > ktime_get_real_ts64() as a replacement, documenting the nature of the > overflow. As long as the user applications treat the timestamps as > unsigned, or only use the difference between timestamps, they are > fine, and changing the timestamps to 64-bit wouldn't require a more > invasive user space API change. > > Note: a lot of other APIs suffer from incompatible structures when > time_t gets redefined to 64-bit in 32-bit user space, but this one > does not. > > Signed-off-by: Arnd Bergmann <a...@arndb.de> This change to avoid the deprecated internal interface looks great to me. Acked-by: Willem de Bruijn <will...@google.com>
Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
On Mon, Nov 27, 2017 at 11:59 AM, Jiri Pirkowrote: > Mon, Nov 27, 2017 at 05:19:25PM CET, a...@arndb.de wrote: >>I tried to figure out what it would take to do a version 4 mmap packet >>socket interface to completely avoid the y2106 overflow problem. This is >>what I came up with, reusing most of the v3 code, except for the parts >>where we access the timestamps. >> >>For kselftest, I'm adding support for testing v4 in addition to v1-v3, >>but the test currently does not look at the timestamps, so it won't >>check that the timestamp format actually works as intended, only that >>I didn't break the parts that worked in the v3 selftest. >> >>Overall, this is more of a mess than I expected, so it's probably not >>worth doing a v4 format just for the timestamp, but the patch can serve >>as a reference for anyone that needs a new format for other reasons and >>fixes this along with the other changes. >> >>Signed-off-by: Arnd Bergmann >>--- > > [...] > > >>@@ -250,7 +269,8 @@ struct tpacket_block_desc { >> enum tpacket_versions { >> TPACKET_V1, >> TPACKET_V2, >>- TPACKET_V3 >>+ TPACKET_V3, >>+ TPACKET_V4, > > I wonder with how many versions are we going to eventually end up with :O There already is an effort to come up with a new AF_PACKET V4 [1]. We should make sure that any new interface does not have the Y2038/Y2106 issue. But, if a new version is being developed and that subsumes all existing use cases, then there probably is no need for another version that is a very small diff to V3. If adding support for existing applications is useful, another approach would be to add a new socket option that changes the semantics for the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single check after filling in those structs whether the option is set and, if so, overwrite the two fields. [1] https://lwn.net/Articles/737947/
Re: [PATCH] [RFC v2] packet: experimental support for 64-bit timestamps
On Tue, Nov 28, 2017 at 8:14 AM, Arnd Bergmannwrote: > This is a second attempt to allow 64-bit timestamps in packet sockets, Thanks for coding up this variant. > The implementation is fairly straightforward, but I'm less sure about the > interface. Using SOF_TIMESTAMPING_* flags in PACKET_TIMESTAMP is a bit > odd already since most of the other flags make no sense here. Adding two > more flags that only make sense for packet sockets but not the normal > SO_TIMESTAMPING option on other sockets makes this even more confusing. Agreed. Unfortunately, we're already stuck with SOL_PACKET/PACKET_TIMESTAMP accepting SOF_TIMESTAMPING_RAW_HARDWARE. Perhaps we can define a new PF_PACKET specific enum where the equivalent option has the same value, so is backwards compatible: enum { PACKET_TIMESTAMP_ORIG = 0, PACKET_TIMESTAMP_ZERO = 1 << 0, PACKET_TIMESTAMP_NS64 = 1 << 1, PACKET_TIMESTAMP_HW = 1 << 6 }; and BUILD_BUG_ON(PACKET_TIMESTAMP_RAW != SOF_TIMESTAMPING_RAW_HARDWARE) to document the dependency. At high level, the code looks great to me, itself.
Re: [PATCH 2/2] [RFC] packet: experimental support for 64-bit timestamps
On Tue, Nov 28, 2017 at 3:46 AM, Arnd Bergmannwrote: > On Tue, Nov 28, 2017 at 8:04 AM, Björn Töpel wrote: >> 2017-11-27 21:51 GMT+01:00 Arnd Bergmann : >> [...] There already is an effort to come up with a new AF_PACKET V4 [1]. We should make sure that any new interface does not have the Y2038/Y2106 issue. But, if a new version is being developed and that subsumes all existing use cases, then there probably is no need for another version that is a very small diff to V3. >>> >>> Ah, perfect, that's good timing. Adding Björn to Cc here. >>> >> >> Unfortunately, for the Y2038/Y2106 cases, we'll be (as a result of >> netdevconf discussions) moving the AF_PACKET V4 implementation to a >> separate, new, address/packet family. > > Ok, I see. Does it matter whether the replacement is a new version or a new packet family? If adding support for existing applications is useful, another approach would be to add a new socket option that changes the semantics for the two u32 fields in each of V1, V2 and V3 to hold nsec. Add a single check after filling in those structs whether the option is set and, if so, overwrite the two fields. [1] https://lwn.net/Articles/737947/ >>> >>> I don't think that's necessary. As long as the V4 capabilities are a >>> superset of V1-V3, we should be able to just require all users to >>> move to V4 (or later) in the next 89 years, and make sure that they >>> use unsigned seconds if they care about 2038. >>> >> >> Given that V4 wont be around for AF_PACKET -- at least not in the >> shape of our patches -- Willem's suggestion is probably a good way >> forward. > > That leaves one question: should we do that now, or wait until some > other reason for a V4 comes up? I don't mind creating another > patch for this, just want to get a feeling of whether the API clutter > is worth it when we have a way out that works until y2106 (at > which point we run into other problems as well). I don't expect that we'll have another packet version independent from the work that Björn is doing. The choice to implement using a new packet family is given by the complexity of the existing code, especially the various locking mechanisms. >From that point of view, and if we want to offer a Y2106 proof AF_PACKET independent from the above, no reason to wait.
Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
On Tue, Nov 28, 2017 at 3:32 PM, Arnd Bergmann <a...@arndb.de> wrote: > As I noticed in my previous patch to remove the 'timespec' usage in > the packet socket, the timestamps in the packet socket are slightly > inefficient as they convert a nanosecond value into seconds/nanoseconds > or seconds/microseconds. > > This adds two new socket options for the timestamp to resolve that: > > PACKET_SKIPTIMESTAMP sets a flag to indicate whether to generate > timestamps at all. When this is set, all timestamps are hardcoded to > zero, which saves a few cycles for the conversion and the access of > the hardware clocksource. The idea was taken from pktgen, which has an > F_NO_TIMESTAMP option for the same purpose. > > PACKET_TIMESTAMP_NS64 changes the interpretation of the time stamp fields: > instead of having 32 bits for seconds plus 32 bits for nanoseconds or > microseconds, we now always send down 64 bits worth of nanoseconds when > this flag is set. > > Link: https://patchwork.kernel.org/patch/10077199/ > Suggested-by: Willem de Bruijn <willemdebruijn.ker...@gmail.com> > Signed-off-by: Arnd Bergmann <a...@arndb.de> This works. Another option would be to add a PACKET_TIMESTAMP_EX with the semantics we discussed previously + fail hard when any undefined bits are set. I don't feel strong either way, we don't intend to extend further. If taking this approach, it might be good to split into separate patches, one for each flag? > -static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 > *ts, > +static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo, >unsigned int flags) Argument flags is no longer used. > { > + struct packet_sock *po = pkt_sk(skb->sk); > struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); > + ktime_t stamp; > + u32 type; > + > + if (po->tp_skiptstamp) > + return 0; > > if (shhwtstamps && > - (flags & SOF_TIMESTAMPING_RAW_HARDWARE) && > - ktime_to_timespec64_cond(shhwtstamps->hwtstamp, ts)) > - return TP_STATUS_TS_RAW_HARDWARE; > + (po->tp_tstamp & SOF_TIMESTAMPING_RAW_HARDWARE) && > + shhwtstamps->hwtstamp) { > + stamp = shhwtstamps->hwtstamp; > + type = TP_STATUS_TS_RAW_HARDWARE; > + } else if (skb->tstamp) { > + stamp = skb->tstamp; > + type = TP_STATUS_TS_SOFTWARE; > + } else { > + return 0; > + } > > - if (ktime_to_timespec64_cond(skb->tstamp, ts)) > - return TP_STATUS_TS_SOFTWARE; > + if (po->tp_tstamp_ns64) { > + __u64 ns = ktime_to_ns(stamp); > > - return 0; > + *hi = upper_32_bits(ns); > + *lo = lower_32_bits(ns); > + } else { > + struct timespec64 ts = ktime_to_timespec64(stamp); > + > + *hi = ts.tv_sec; > + if (po->tp_version == TPACKET_V1) Very minor: may want to invert test to make newer the protocols the likely branch. > static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame, > struct sk_buff *skb) > { > union tpacket_uhdr h; > - struct timespec64 ts; > - __u32 ts_status; > + __u32 ts_status, hi, lo; > > - if (!(ts_status = tpacket_get_timestamp(skb, , po->tp_tstamp))) > + if (!(ts_status = tpacket_get_timestamp(skb, , , > po->tp_tstamp))) > return 0; > > h.raw = frame; > - /* > -* versions 1 through 3 overflow the timestamps in y2106, since they > -* all store the seconds in a 32-bit unsigned integer. > -* If we create a version 4, that should have a 64-bit timestamp, > -* either 64-bit seconds + 32-bit nanoseconds, or just 64-bit > -* nanoseconds. > -*/ Probably no need to introduce this in patch 1/2 when removing it in 2/2. > @@ -2191,8 +2226,8 @@ static int tpacket_rcv(struct sk_buff *skb, struct > net_device *dev, > unsigned long status = TP_STATUS_USER; > unsigned short macoff, netoff, hdrlen; > struct sk_buff *copy_skb = NULL; > - struct timespec64 ts; > __u32 ts_status; > + __u32 hi, lo; since this function is not time-specific, the context of hi and lo is not immediately obvious here. tstamp_hi, tstamp_lo? Or even __u32 tstamp[2] and have tpacket_get_timestamp and packet_get_time take one fewer argument.
Re: [PATCH] net: openvswitch: datapath: fix data type in queue_gso_packets
On Sat, Nov 25, 2017 at 2:14 PM, Gustavo A. R. Silva <garsi...@embeddedor.com> wrote: > gso_type is being used in binary AND operations together with SKB_GSO_UDP. > The issue is that variable gso_type is of type unsigned short and > SKB_GSO_UDP expands to more than 16 bits: > > SKB_GSO_UDP = 1 << 16 > > this makes any binary AND operation between gso_type and SKB_GSO_UDP to > be always zero, hence making some code unreachable and likely causing > undesired behavior. > > Fix this by changing the data type of variable gso_type to unsigned int. > > Addresses-Coverity-ID: 1462223 > Fixes: 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet") > Signed-off-by: Gustavo A. R. Silva <garsi...@embeddedor.com> Acked-by: Willem de Bruijn <will...@google.com> Good catch, thanks! The issue here is that I brought back SKB_GSO_UDP at the end of the list at 1 << 16 to avoid renaming of all that used to follow, while it used to be defined as 1 << 1. The skb_shinfo(skb)->gso_type field itself has been expanded as of v4.12 in commit 7f564528a480 ("skbuff: Extend gso_type to unsigned int."). A quick scan shows a few nic drivers that also still cast to unsigned short: bnxt, atl1c and qede. Since UFO hardware offload no longer exists, this is safe wrt this patch. And it is fine for older kernels as no driver supported the previous entry at 1 << 16, SKB_GSO_ESP. But it is fragile wrt follow-on offloads. Probably a net-next patch. The only other likely issue I spotted with the longer gso_type is in trace events. net_dev_start_xmit and net_dev_rx_verbose_template both export as u16.
Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
> Thanks for the review! Any suggestions for how to do the testing? If you have > existing test cases, could you give my next version a test run to see if there > are any regressions and if the timestamps work as expected? > > I see that there are test cases in tools/testing/selftests/net/, but none > of them seem to use the time stamps so far, and I'm not overly familiar > with how it works in the details to extend it in a meaningful way. I could not find any good tests for this interface, either. The only user of the interface I found was a little tool I wrote a few years ago that compares timestamps at multiple points in the transmit path for latency measurement [1]. But it may be easier to just write a new test under tools/testing/selftests/net for this purpose. I can help with that, too, if you want. [1] https://github.com/wdebruij/kerneltools/blob/master/tools/tcplate/tcplate.c
Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
On Wed, Nov 29, 2017 at 8:39 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <a...@arndb.de> wrote: >> On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn >> <willemdebruijn.ker...@gmail.com> wrote: >>>> Thanks for the review! Any suggestions for how to do the testing? If you >>>> have >>>> existing test cases, could you give my next version a test run to see if >>>> there >>>> are any regressions and if the timestamps work as expected? >>>> >>>> I see that there are test cases in tools/testing/selftests/net/, but none >>>> of them seem to use the time stamps so far, and I'm not overly familiar >>>> with how it works in the details to extend it in a meaningful way. >>> >>> I could not find any good tests for this interface, either. The only >>> user of the interface I found was a little tool I wrote a few years >>> ago that compares timestamps at multiple points in the transmit >>> path for latency measurement [1]. But it may be easier to just write >>> a new test under tools/testing/selftests/net for this purpose. I can >>> help with that, too, if you want. >> >> Thanks, that would be great! > > I'll reply to this thread with git send-email with an extension to > tools/testing/selftests/net/psock_tpacket.c. It appears that it did not end up in this thread. At least not when using gmail threading. Patch at http://patchwork.ozlabs.org/patch/842854/
Re: [PATCH] [RFC v3] packet: experimental support for 64-bit timestamps
On Wed, Nov 29, 2017 at 3:06 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Wed, Nov 29, 2017 at 5:51 PM, Willem de Bruijn > <willemdebruijn.ker...@gmail.com> wrote: >>> Thanks for the review! Any suggestions for how to do the testing? If you >>> have >>> existing test cases, could you give my next version a test run to see if >>> there >>> are any regressions and if the timestamps work as expected? >>> >>> I see that there are test cases in tools/testing/selftests/net/, but none >>> of them seem to use the time stamps so far, and I'm not overly familiar >>> with how it works in the details to extend it in a meaningful way. >> >> I could not find any good tests for this interface, either. The only >> user of the interface I found was a little tool I wrote a few years >> ago that compares timestamps at multiple points in the transmit >> path for latency measurement [1]. But it may be easier to just write >> a new test under tools/testing/selftests/net for this purpose. I can >> help with that, too, if you want. > > Thanks, that would be great! I'll reply to this thread with git send-email with an extension to tools/testing/selftests/net/psock_tpacket.c. I can resend that for submission after your feature is merged (as it depends on it) or feel free to include it in your patchset. The test currently fails for the ns64 case. I probably did not convert correctly, but have to leave the office and want to send what I have. Two other comments: the test crashed the kernel due to a NULL ptr in tpacket_get_timestamp. We cannot rely on skb->sk being set to the packet socket here. And assignment to bitfield requires a cast to boolean. diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index f55f330ab547..e9decc7fc5c3 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -439,9 +439,9 @@ static int __packet_get_status(struct packet_sock *po, void *frame) } } -static __u32 tpacket_get_timestamp(struct sk_buff *skb, __u32 *hi, __u32 *lo) +static __u32 tpacket_get_timestamp(struct packet_sock *po, struct sk_buff *skb, + __u32 *hi, __u32 *lo) { - struct packet_sock *po = pkt_sk(skb->sk); struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb); ktime_t stamp; u32 type; @@ -508,7 +508,7 @@ static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame, union tpacket_uhdr h; __u32 ts_status, hi, lo; - if (!(ts_status = tpacket_get_timestamp(skb, , ))) + if (!(ts_status = tpacket_get_timestamp(po, skb, , ))) return 0; h.raw = frame; @@ -2352,7 +2352,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, skb_copy_bits(skb, 0, h.raw + macoff, snaplen); - if (!(ts_status = tpacket_get_timestamp(skb, _hi, _lo))) + if (!(ts_status = tpacket_get_timestamp(po, skb, _hi, _lo))) packet_get_time(po, _hi, _lo); status |= ts_status; @@ -3835,7 +3835,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (copy_from_user(, optval, sizeof(val))) return -EFAULT; - po->tp_skiptstamp = val; + po->tp_skiptstamp = !!val; return 0; } case PACKET_TIMESTAMP_NS64: @@ -3847,7 +3847,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (copy_from_user(, optval, sizeof(val))) return -EFAULT; - po->tp_tstamp_ns64 = val; + po->tp_tstamp_ns64 = !!val; return 0; } case PACKET_FANOUT:
Re: [PATCH net-next V3] tun: add eBPF based queue selection method
On Mon, Dec 4, 2017 at 4:31 AM, Jason Wangwrote: > This patch introduces an eBPF based queue selection method. With this, > the policy could be offloaded to userspace completely through a new > ioctl TUNSETSTEERINGEBPF. > > Signed-off-by: Jason Wang > --- > +static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) > +{ > + struct tun_steering_prog *prog; > + u16 ret = 0; > + > + prog = rcu_dereference(tun->steering_prog); > + if (prog) > + ret = bpf_prog_run_clear_cb(prog->prog, skb); This dereferences tun->steering_prog for a second time. It is safe in this load balancing case to assign a few extra packets to queue 0. But the issue can also be avoided by replacing the function with a direct call in tun_net_xmit: struct tun_steering_prog *s = rcu_dereference(tun->steering_prog); if (s) ret = bpf_prog_run_clear_cb(s->prog, skb) % tun->numqueues; > /* Net device start xmit */ > -static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > +static void tun_automq_xmit(struct tun_struct *tun, struct sk_buff *skb) > { > - struct tun_struct *tun = netdev_priv(dev); > - int txq = skb->queue_mapping; > - struct tun_file *tfile; > - u32 numqueues = 0; > - > - rcu_read_lock(); > - tfile = rcu_dereference(tun->tfiles[txq]); > - numqueues = READ_ONCE(tun->numqueues); > - > - /* Drop packet if interface is not attached */ > - if (txq >= numqueues) > - goto drop; > - > #ifdef CONFIG_RPS > - if (numqueues == 1 && static_key_false(_needed)) { > + if (tun->numqueues == 1 && static_key_false(_needed)) { > /* Select queue was not called for the skbuff, so we extract > the > * RPS hash and save it into the flow_table here. > */ > @@ -969,6 +986,26 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, > struct net_device *dev) > } > } > #endif > +} > + > +/* Net device start xmit */ > +static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) > +{ > + struct tun_struct *tun = netdev_priv(dev); > + int txq = skb->queue_mapping; > + struct tun_file *tfile; > + u32 numqueues = 0; > + > + rcu_read_lock(); > + tfile = rcu_dereference(tun->tfiles[txq]); > + numqueues = READ_ONCE(tun->numqueues); Now tun->numqueues is read twice, reversing commit fa35864e0bb7 ("tuntap: Fix for a race in accessing numqueues"). I don't see anything left that would cause a divide by zero after the relevant code was converted from divide to multiple and subsequently even removed. But if it's safe to read multiple times, might as well remove the READ_ONCE. > @@ -1551,7 +1588,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, > struct tun_file *tfile, > int copylen; > bool zerocopy = false; > int err; > - u32 rxhash; > + u32 rxhash = 0; > int skb_xdp = 1; > bool frags = tun_napi_frags_enabled(tun); > > @@ -1739,7 +1776,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, > struct tun_file *tfile, > rcu_read_unlock(); > } > > - rxhash = __skb_get_hash_symmetric(skb); > + rcu_read_lock(); > + if (!rcu_dereference(tun->steering_prog)) > + rxhash = __skb_get_hash_symmetric(skb); > + rcu_read_unlock(); > > if (frags) { > /* Exercise flow dissector code path. */ > @@ -1783,7 +1823,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, > struct tun_file *tfile, > u64_stats_update_end(>syncp); > put_cpu_ptr(stats); > > - tun_flow_update(tun, rxhash, tfile); > + if (rxhash) > + tun_flow_update(tun, rxhash, tfile); > + Nit: zero is a valid hash? In which case, an int64_t initialized to -1 is the safer check.
Re: [B.A.T.M.A.N.] [RFC v2 2/6] batman-adv: Rename batman-adv.h to batadv_genl.h
On Fri, Dec 15, 2017 at 12:18 PM, Sven Eckelmann <sven.eckelm...@openmesh.com> wrote: > On Freitag, 15. Dezember 2017 11:57:55 CET Willem de Bruijn wrote: >> > No, this is also bad because batman_adv.h is MIT license and packet.h is >> > GPL-2. So what other name would you suggest for packet.h? >> > batman_adv_packet.h? >> >> Sure, that sounds great. Thanks. > > Really? Isn't include/uapi/linux/batman_adv_packet.h looking like an accident > which never should have had happened? My only point was that renaming and modifying existing uapi files can break userspace compilation. As long as the existing files are not changed, I don't have a strong opinion on naming for new files.
Re: [B.A.T.M.A.N.] [RFC v2 2/6] batman-adv: Rename batman-adv.h to batadv_genl.h
On Fri, Dec 15, 2017 at 6:48 AM, Sven Eckelmann <sven.eckelm...@openmesh.com> wrote: > On Freitag, 15. Dezember 2017 11:32:05 CET Sven Eckelmann wrote: >> On Mittwoch, 6. Dezember 2017 11:58:14 CET Willem de Bruijn wrote: >> [...] >> > >> > --- >> > >> > MAINTAINERS| 2 +- >> > >> > include/uapi/linux/{batman_adv.h => batadv_genl.h} | 6 +++--- >> > >> >> > >> This and the previous patch changes uapi. That might break userspace >> > >> applications that rely on it. >> > > >> > > I am not aware of any application because all (alfred, batctl and some >> > > gluon >> > > integration) of them currently ship their own copy because distribution >> > > didn't >> > > catch up. And this is also the reason why I want to do it now - not >> > > later. >> > >> > That assumes that you know all applications, including those not >> > publicly available. It may be true in this instance, but it is not >> > possible to be certain. >> >> I've just talked with Simon. Because you have a problem with these two >> changes, he suggested that I should drop these two patches and merge packet.h >> with the uapi batadv genl header batman_adv.h > > No, this is also bad because batman_adv.h is MIT license and packet.h is > GPL-2. So what other name would you suggest for packet.h? batman_adv_packet.h? Sure, that sounds great. Thanks.
Re: [RFC v3 0/4] flow_dissector: Provide basic batman-adv unicast handling
On Fri, Dec 15, 2017 at 1:23 PM, Sven Eckelmann <sven.eckelm...@openmesh.com> wrote: > Hi, > > we are currently starting to use batman-adv as mesh protocol on multicore > embedded devices. These usually don't have a lot of CPU power per core but > are reasonable fast when using multiple cores. > > It was noticed that sending was working very well but receiving was > basically only using on CPU core per neighbor. The reason for that is > format of the (normal) incoming packet: > > ++ > | ip(v6)hdr | > ++ > | inner ethhdr | > ++ > | batadv unicast hdr | > ++ > | outer ethhdr | > ++ > > The flow dissector will therefore stop after parsing the outer ethernet > header and will not parse the actual ipv(4|6)/... header of the packet. Our > assumption was now that it would help us to add minimal support to the flow > dissector to jump over the batman-adv unicast and inner ethernet header > (like in gre ETH_P_TEB). The patch was implemented in a slightly hacky > way [1] and the results looked quite promising. > > I didn't get any feedback how the files should actually be named and I am > not really happy with the current names - so please feel free to propose > better names. > > The discussion of the RFC v2 can be found in the related patches of > https://patchwork.ozlabs.org/cover/844783/ > > > Changes in v3: > == > > * removed change of uapi/linux/batman_adv.h to uapi/linux/batadv_genl.h > - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com> > * removed naming fixes for enums/defines in uapi/linux/batadv_genl.h > - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com> > * renamed uapi/linux/batadv.h to uapi/linux/batadv_packet.h > * moved batadv dissector functionality in own function > - requested by Tom Herbert <t...@herbertland.com> > * added support for flags FLOW_DISSECTOR_F_STOP_AT_ENCAP and > FLOW_DIS_ENCAPSULATION > - requested by Willem de Bruijn <willemdebruijn.ker...@gmail.com> Thanks for making these changes. The flow dissection looks good to me. One possible issue is that this exposes kernel fixed width types u8/u16/.. to userspace. For posix compatibility reasons, uapi headers use the variant with underscores.
Re: [PATCH net-next V2 3/3] tun: add eBPF based queue selection method
On Tue, Oct 31, 2017 at 7:32 PM, Jason Wangwrote: > This patch introduces an eBPF based queue selection method based on > the flow steering policy ops. Userspace could load an eBPF program > through TUNSETSTEERINGEBPF. This gives much more flexibility compare > to simple but hard coded policy in kernel. > > Signed-off-by: Jason Wang > --- > +static int tun_set_steering_ebpf(struct tun_struct *tun, void __user *data) > +{ > + struct bpf_prog *prog; > + u32 fd; > + > + if (copy_from_user(, data, sizeof(fd))) > + return -EFAULT; > + > + prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); If the idea is to allow guests to pass BPF programs down to the host, you may want to define a new program type that is more restrictive than socket filter. The external functions allowed for socket filters (sk_filter_func_proto) are relatively few (compared to, say, clsact), but may still leak host information to a guest. More importantly, guest security considerations limits how we can extend socket filters later.
Re: [PATCH net-next V2 1/3] tun: abstract flow steering logic
On Tue, Oct 31, 2017 at 7:32 PM, Jason Wangwrote: > tun now use flow caches based automatic queue steering method. This > may not suffice all user cases. To extend it to be able to use more > flow steering policy, this patch abstracts flow steering logic into > tun_steering_ops, then we can declare and use different methods in > the future. > Signed-off-by: Jason Wang > --- > drivers/net/tun.c | 85 > +-- > 1 file changed, 63 insertions(+), 22 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index ea29da9..bff6259 100644 The previous RFC enabled support for multiple pluggable steering policies. But as all can be implemented in BPF and we only plan to support an eBPF policy besides the legacy one, this patch is no longer needed. We can save a few indirect function calls.
Re: [PATCH net-next V2 2/3] tun: introduce ioctls to set and get steering policies
On Tue, Oct 31, 2017 at 7:32 PM, Jason Wangwrote: > This patch introduces new ioctl for change packet steering policy for > tun. Only automatic flow steering is supported, more policies will > come. > > Signed-off-by: Jason Wang > --- > drivers/net/tun.c | 35 ++- > include/uapi/linux/if_tun.h | 7 +++ > 2 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index bff6259..ab109ff 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -122,7 +122,8 @@ do { > \ > #define TUN_VNET_BE 0x4000 > > #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \ > - IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS) > + IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS | \ > + IFF_MULTI_STEERING) > > #define GOODCOPY_LEN 128 > > @@ -2516,6 +2517,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned > int cmd, > unsigned int ifindex; > int le; > int ret; > + unsigned int steering; > > if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || _IOC_TYPE(cmd) == > SOCK_IOC_TYPE) { > if (copy_from_user(, argp, ifreq_len)) > @@ -2774,6 +2776,37 @@ static long __tun_chr_ioctl(struct file *file, > unsigned int cmd, > ret = 0; > break; > > + case TUNSETSTEERING: > + ret = -EFAULT; > + if (copy_from_user(, argp, sizeof(steering))) > + break; > + ret = 0; > + switch (steering) { > + case TUN_STEERING_AUTOMQ: > + tun->steering_ops = _automq_ops; > + break; > + default: > + ret = -EFAULT; > + } > + break; > + > + case TUNGETSTEERING: > + ret = 0; > + if (tun->steering_ops == _automq_ops) > + steering = TUN_STEERING_AUTOMQ; > + else > + BUG(); > + if (copy_to_user(argp, , sizeof(steering))) > + ret = -EFAULT; > + break; > + > + case TUNGETSTEERINGFEATURES: > + ret = 0; > + steering = TUN_STEERING_AUTOMQ; > + if (copy_to_user(argp, , sizeof(steering))) > + ret = -EFAULT; > + break; > + Similar to my comment in patch 1/3: if only eBPF is used, these calls can be avoided in favor of only TUNSETSTEERINGEBPF from patch 3/3.