Re: [PATCH] vhost: introduce O(1) vq metadata cache

2016-12-14 Thread kbuild test robot
Hi Jason,

[auto build test WARNING on vhost/linux-next]
[also build test WARNING on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-introduce-O-1-vq-metadata-cache/20161214-160153
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-randconfig-x005-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/vhost/vhost.c: In function 'vhost_vq_meta_fetch':
>> drivers/vhost/vhost.c:719:9: warning: cast to pointer from integer of 
>> different size [-Wint-to-pointer-cast]
 return (void *)(node->userspace_addr + (u64)addr - node->start);
^

vim +719 drivers/vhost/vhost.c

   703 node->start,
   704 node->size))
   705  return 0;
   706  }
   707  return 1;
   708  }
   709  
   710  static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue 
*vq,
   711 u64 addr, unsigned int 
size,
   712 int type)
   713  {
   714  const struct vhost_umem_node *node = vq->meta_iotlb[type];
   715  
   716  if (!node)
   717  return NULL;
   718  
 > 719  return (void *)(node->userspace_addr + (u64)addr - node->start);
   720  }
   721  
   722  /* Can we switch to this memory table? */
   723  /* Caller should have device mutex but not vq mutex */
   724  static int memory_access_ok(struct vhost_dev *d, struct vhost_umem 
*umem,
   725  int log_all)
   726  {
   727  int i;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: stmmac driver...

2016-12-14 Thread Jie Deng
Hi David,

 Giuseppe and Alexandre,

 There are a lot of patches and discussions happening around the stammc
 driver lately and both of you are listed as the maintainers.

 I really need prompt and conclusive reviews of these patch submissions
 from you, and participation in all discussions about the driver.
>>>
>>> yes we are trying to do the best.
>>>
 Otherwise I have only three things I can do: 1) let the patches rot in
 patchwork for days 2) trust that the patches are sane and fit your
 desires and goals and just apply them or 3) reject them since they
 aren't being reviewed properly.
>>>
>>> at this stage, I think the best is: (3).
>> I think the patches David mentioned also included XLGMAC. He sent this email
>> before I explained QoS and XLGMAC were different IPs. Do you mind we do 
>> XLGMAC
>> development under drivers/net/ethernet/synopsys/ ? I think we don't have
>> conflict since we will keep QoS development in stmmac.
>
> Great. Many thanks for the clarification :-)
>
> Regards
> Peppe
>
Do you agree that we do XLGMAC  development under drivers/net/ethernet/synopsys/
in the future ?
There is no conflict of interest since this is a new IP without driver. As you
see, there are several drivers for QoS (GMAC) and several drivers for XGMAC. We
want to avoid this situation for the new IP XLGMAC.

Regards,
Jie



Re: [PATCH v2 net-next 1/2] phy: add phy fixup unregister functions

2016-12-14 Thread Dongpo Li
Hi all,

On 2016/12/8 4:26, woojung@microchip.com wrote:
>>From : Woojung Huh 
> 
> Add functions to unregister phy fixup for modules.
> 
> int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask)
>   Unregister phy fixup from phy_fixup_list per bus_id, phy_uid &
>   phy_uid_mask
> 
> int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask)
>   Unregister phy fixup from phy_fixup_list.
>   Use it for fixup registered by phy_register_fixup_for_uid()
> 
> int phy_unregister_fixup_for_id(const char *bus_id)
>   Unregister phy fixup from phy_fixup_list.
>   Use it for fixup registered by phy_register_fixup_for_id()
> 
> Signed-off-by: Woojung Huh 
> ---
>  Documentation/networking/phy.txt |  9 
>  drivers/net/phy/phy_device.c | 47 
> 
>  include/linux/phy.h  |  4 
>  3 files changed, 60 insertions(+)
> 
> diff --git a/Documentation/networking/phy.txt 
> b/Documentation/networking/phy.txt
> index e017d93..16f90d8 100644
> --- a/Documentation/networking/phy.txt
> +++ b/Documentation/networking/phy.txt
> @@ -407,6 +407,15 @@ Board Fixups
>   The stubs set one of the two matching criteria, and set the other one to
>   match anything.
>  
> + When phy_register_fixup() or *_for_uid()/*_for_id() is called at module,
> + unregister fixup and free allocate memory are required.
> +
> + Call one of following function before unloading module.
> +
> + int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
> + int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
> + int phy_register_fixup_for_id(const char *phy_id);
> +
>  Standards
>  
>   IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer 
> Specifications, Section Two:
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index aeaf1bc..32fa7c7 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -235,6 +235,53 @@ int phy_register_fixup_for_id(const char *bus_id,
>  }
>  EXPORT_SYMBOL(phy_register_fixup_for_id);
>  
> +/**
> + * phy_unregister_fixup - remove a phy_fixup from the list
> + * @bus_id: A string matches fixup->bus_id (or PHY_ANY_ID) in phy_fixup_list
> + * @phy_uid: A phy id matches fixup->phy_id (or PHY_ANY_UID) in 
> phy_fixup_list
> + * @phy_uid_mask: Applied to phy_uid and fixup->phy_uid before comparison
> + */
> +int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask)
> +{
> + struct list_head *pos, *n;
> + struct phy_fixup *fixup;
> + int ret;
> +
> + ret = -ENODEV;
> +
> + mutex_lock(&phy_fixup_lock);
> + list_for_each_safe(pos, n, &phy_fixup_list) {
> + fixup = list_entry(pos, struct phy_fixup, list);
> +
> + if ((!strcmp(fixup->bus_id, bus_id)) &&
> + ((fixup->phy_uid & phy_uid_mask) ==
> +  (phy_uid & phy_uid_mask))) {
> + list_del(&fixup->list);
> + kfree(fixup);
> + ret = 0;
> + break;
> + }
> + }
> + mutex_unlock(&phy_fixup_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(phy_unregister_fixup);
> +
I just want to commit the unregister patch and found this patch. Good job!
But I consider this patch may miss something.
If one SoC has 2 MAC ports and each port uses the different network driver,
the 2 drivers may register fixup for the same PHY chip with different
"run" function because the PHY chip works in different mode.
In such a case, this patch doesn't consider "run" function and may cause 
problem.
When removing the driver which register fixup at last, it will remove another
driver's fixup.
Should this condition be considered and fixed?

> +/* Unregisters a fixup of any PHY with the UID in phy_uid */
> +int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask)
> +{
> + return phy_unregister_fixup(PHY_ANY_ID, phy_uid, phy_uid_mask);
> +}
> +EXPORT_SYMBOL(phy_unregister_fixup_for_uid);
> +
> +/* Unregisters a fixup of the PHY with id string bus_id */
> +int phy_unregister_fixup_for_id(const char *bus_id)
> +{
> + return phy_unregister_fixup(bus_id, PHY_ANY_UID, 0x);
> +}
> +EXPORT_SYMBOL(phy_unregister_fixup_for_id);
> +
>  /* Returns 1 if fixup matches phydev in bus_id and phy_uid.
>   * Fixups can be set to match any in one or more fields.
>   */
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index feb8a98..f7d95f6 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -860,6 +860,10 @@ int phy_register_fixup_for_id(const char *bus_id,
>  int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
>  int (*run)(struct phy_device *));
>  
> +int phy_unregister_fixup(const char *bus_id, u32 phy_uid, u32 phy_uid_mask);
> +int phy_unregister_fixup_for_id(const char *bus_id);
> +int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
> +
>  int

Re: stmmac driver...

2016-12-14 Thread Giuseppe CAVALLARO

Hello Jie

On 12/14/2016 5:05 AM, Jie Deng wrote:

Hi Peppe,

On 2016/12/12 22:17, Giuseppe CAVALLARO wrote:

Hi David

On 12/7/2016 7:06 PM, David Miller wrote:


Giuseppe and Alexandre,

There are a lot of patches and discussions happening around the stammc
driver lately and both of you are listed as the maintainers.

I really need prompt and conclusive reviews of these patch submissions
from you, and participation in all discussions about the driver.


yes we are trying to do the best.


Otherwise I have only three things I can do: 1) let the patches rot in
patchwork for days 2) trust that the patches are sane and fit your
desires and goals and just apply them or 3) reject them since they
aren't being reviewed properly.


at this stage, I think the best is: (3).

I think the patches David mentioned also included XLGMAC. He sent this email
before I explained QoS and XLGMAC were different IPs. Do you mind we do XLGMAC
development under drivers/net/ethernet/synopsys/ ? I think we don't have
conflict since we will keep QoS development in stmmac.


Great. Many thanks for the clarification :-)

Regards
Peppe





Thanks in advance.


you are welcome


Peppe







[PATCH] net: davicom: dm9000: use new api ethtool_{get|set}_link_ksettings

2016-12-14 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/davicom/dm9000.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/davicom/dm9000.c 
b/drivers/net/ethernet/davicom/dm9000.c
index f1a81c5..008dc81 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -570,19 +570,21 @@ static void dm9000_set_msglevel(struct net_device *dev, 
u32 value)
dm->msg_enable = value;
 }
 
-static int dm9000_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int dm9000_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct board_info *dm = to_dm9000_board(dev);
 
-   mii_ethtool_gset(&dm->mii, cmd);
+   mii_ethtool_get_link_ksettings(&dm->mii, cmd);
return 0;
 }
 
-static int dm9000_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int dm9000_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct board_info *dm = to_dm9000_board(dev);
 
-   return mii_ethtool_sset(&dm->mii, cmd);
+   return mii_ethtool_set_link_ksettings(&dm->mii, cmd);
 }
 
 static int dm9000_nway_reset(struct net_device *dev)
@@ -741,8 +743,6 @@ static int dm9000_set_wol(struct net_device *dev, struct 
ethtool_wolinfo *w)
 
 static const struct ethtool_ops dm9000_ethtool_ops = {
.get_drvinfo= dm9000_get_drvinfo,
-   .get_settings   = dm9000_get_settings,
-   .set_settings   = dm9000_set_settings,
.get_msglevel   = dm9000_get_msglevel,
.set_msglevel   = dm9000_set_msglevel,
.nway_reset = dm9000_nway_reset,
@@ -752,6 +752,8 @@ static int dm9000_set_wol(struct net_device *dev, struct 
ethtool_wolinfo *w)
.get_eeprom_len = dm9000_get_eeprom_len,
.get_eeprom = dm9000_get_eeprom,
.set_eeprom = dm9000_set_eeprom,
+   .get_link_ksettings = dm9000_get_link_ksettings,
+   .set_link_ksettings = dm9000_set_link_ksettings,
 };
 
 static void dm9000_show_carrier(struct board_info *db,
-- 
1.7.4.4



Re: [Query] Delayed vxlan socket creation?

2016-12-14 Thread Jiri Benc
On Wed, 14 Dec 2016 07:49:24 +, Du, Fan wrote:
> I'm interested to one Docker issue[1] which looks like related to kernel 
> vxlan socket creation
> as described in the thread. From my limited knowledge here, socket creation 
> is synchronous ,
> and after the *socket* syscall, the sock handle will be valid and ready to 
> linkup.
> 
> Somehow I'm not sure the detailed scenario here, and which/how possible 
> commit fix?

baf606d9c9b1^..56ef9c909b40

 Jiri


Re: [PATCH] vhost: introduce O(1) vq metadata cache

2016-12-14 Thread Jason Wang



On 2016年12月14日 16:14, kbuild test robot wrote:

Hi Jason,

[auto build test WARNING on vhost/linux-next]
[also build test WARNING on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Jason-Wang/vhost-introduce-O-1-vq-metadata-cache/20161214-160153
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
config: i386-randconfig-x005-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
 # save the attached .config to linux build tree


Thanks, V2 will be posted soon.



Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Jesper Dangaard Brouer
On Tue, 13 Dec 2016 12:08:21 -0800
John Fastabend  wrote:

> On 16-12-13 11:53 AM, David Miller wrote:
> > From: John Fastabend 
> > Date: Tue, 13 Dec 2016 09:43:59 -0800
> >   
> >> What does "zero-copy send packet-pages to the application/socket that
> >> requested this" mean? At the moment on x86 page-flipping appears to be
> >> more expensive than memcpy (I can post some data shortly) and shared
> >> memory was proposed and rejected for security reasons when we were
> >> working on bifurcated driver.  
> > 
> > The whole idea is that we map all the active RX ring pages into
> > userspace from the start.
> > 
> > And just how Jesper's page pool work will avoid DMA map/unmap,
> > it will also avoid changing the userspace mapping of the pages
> > as well.
> > 
> > Thus avoiding the TLB/VM overhead altogether.
> >   

Exactly.  It is worth mentioning that pages entering the page pool need
to be cleared (measured cost 143 cycles), in order to not leak any
kernel info.  The primary focus of this design is to make sure not to
leak kernel info to userspace, but with an "exclusive" mode also
support isolation between applications.


> I get this but it requires applications to be isolated. The pages from
> a queue can not be shared between multiple applications in different
> trust domains. And the application has to be cooperative meaning it
> can't "look" at data that has not been marked by the stack as OK. In
> these schemes we tend to end up with something like virtio/vhost or
> af_packet.

I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
two would require CAP_NET_ADMIN privileges.  All modes have a trust
domain id, that need to match e.g. when page reach the socket.

Mode-1 "Shared": Application choose lowest isolation level, allowing
 multiple application to mmap VMA area.

Mode-2 "Single-user": Application request it want to be the only user
 of the RX queue.  This blocks other application to mmap VMA area.

Mode-3 "Exclusive": Application request to own RX queue.  Packets are
 no longer allowed for normal netstack delivery.

Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
still allowed to travel netstack and thus can contain packet data from
other normal applications.  This is part of the design, to share the
NIC between netstack and an accelerated userspace application using RX
zero-copy delivery.


> Any ACLs/filtering/switching/headers need to be done in hardware or
> the application trust boundaries are broken.

The software solution outlined allow the application to make the choice
of what trust boundary it wants.

The "exclusive" mode-3 make most sense together with HW filters.
Already today, we support creating a new RX queue based on ethtool
ntuple HW filter and then you simply attach your application that queue
in mode-3, and have full isolation.

 
> If the above can not be met then a copy is needed. What I am trying
> to tease out is the above comment along with other statements like
> this "can be done with out HW filter features".

Does this address your concerns?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


RE: [PATCH 3/3] secure_seq: use fast&secure siphash instead of slow&insecure md5

2016-12-14 Thread David Laight
From: Jason A. Donenfeld
> Sent: 14 December 2016 00:17
> This gives a clear speed and security improvement. Rather than manually
> filling MD5 buffers, we simply create a layout by a simple anonymous
> struct, for which gcc generates rather efficient code.
...
> + const struct {
> + struct in6_addr saddr;
> + struct in6_addr daddr;
> + __be16 sport;
> + __be16 dport;
> + } __packed combined = {
> + .saddr = *(struct in6_addr *)saddr,
> + .daddr = *(struct in6_addr *)daddr,
> + .sport = sport,
> + .dport = dport
> + };

You need to look at the effect of marking this (and the other)
structures 'packed' on architectures like sparc64.

David



RE: [PATCH 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread David Laight
From: Jason A. Donenfeld
> Sent: 14 December 2016 00:17
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.
> 
> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
> 
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.
> 
> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.
> 
> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
> 
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
> 
> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.
...
> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN])
...
> + u64 k0 = get_unaligned_le64(key);
> + u64 k1 = get_unaligned_le64(key + sizeof(u64));
...
> + m = get_unaligned_le64(data);

All these unaligned accesses are going to get expensive on architectures
like sparc64.

David



dsa: handling more than 1 cpu port

2016-12-14 Thread John Crispin
Hi Andrew,

switches supported by qca8k have 2 gmacs that we can wire an external
mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
switch. Thw switch itself is aware of one of the MACs being the cpu port
and expects this to be port/mac0. Using the other will break the
hardware offloading features. The original series from Matthieu added a
feature to configure the switch in a pass-through mode [1]. This
resulted in us having to define the pass-through inside the dts which
means that we loose userland configurability. Assume the setup as
follows ...

port0 - cpu port wired to SoCs MAC0
port1-4 - the lan ports
port5 - the wan port
port6 - wired to the SoCs MAC1

What i have done now is bring up one bridge for port1-4 and a second one
for port5-6. Once setup I can pass traffic on the SoCs MAC1 and it will
flow via port6 and egress on port5. So far so good, however due to the
way the port based vlans are setup and how the bridge_join/leave() logic
works, port5/6 will also fwd traffic to the cpu port. the driver has now
to tell that we are trunking traffic on eth1 via port6. also the MII
mode is not known to the driver. Adding some hackish register writes
will make this work nicely. My proposed way of fixing this cleanly in an
upstream friendly way would be as follows

1) add an extra dsa,ethernet property to the 2nd MII port

dsa@0 {
compatible = "qca,ar8xxx";

dsa,ethernet = <&gmac1>;

[...]

switch@0 {
[...]

port@5 {
reg = <5>;
label = "wan";
phy-handle = <&phy_port5>;
};

port@6 {
reg = <6>;
label = "gmac2";

dsa,ethernet = <&gmac2>;
fixed-link {
speed = <1000>;
full-duplex;
};
};
};
};

2) fix up the port_bridge_join/leave() logic such that if a port is
present in the bridge that has a reference to a ethernet interface it
will remove all ports in the bridge from the port based vlan of the
actual cpu port.

3) in case of this switch we also need to fiddle with the bcast/mcast
flooding registers

would this be ok and would it be better to probe the extra dsa_ethernet
inside the subsystem or the driver ? i was considering to do add a
dsa_is_trunk_port() or similar to achieve this.

John




[1] https://patchwork.ozlabs.org/patch/477525/


Hi

2016-12-14 Thread mrsnicole...@ono.com
I am Mrs Nicole Marois, i have a pending project of fulfillment to put
in your hand, i will need your support to make this dream come through,
could you let me know your interest to enable me give you further
information, and I hereby advice that you send the below mentioned
information

I decided to will/donate the sum of $4.5 Million US Dollars to you for 
the
good work of God, and also to help the motherless and less privilege
and also forth assistance of the widows. At the moment I cannot take
any telephone calls right now due to the fact that my relatives (that
have squandered the funds agave them for this purpose before) are
around me and my health status also. I have adjusted my will and my
lawyer is aware.

I have willed those properties to you by quoting my personal file
routing and account information. And I have also notified the bank
that I am willing that properties to you for a good, effective and
prudent work. I know I don't know you but I have been directed to do
this by God.ok Please contact this woman for more details you might
not get me on line in time contact this email if you need more 
information 
ok

Email: mrsnicole2mar...@gmail.com


Yours fairly friend,
Mrs Nicole Benoite Marois. 


Re: sctp: suspicious rcu_dereference_check() usage in sctp_epaddr_lookup_transport

2016-12-14 Thread Xin Long
On Wed, Dec 14, 2016 at 5:37 AM, Marcelo Ricardo Leitner
 wrote:
> On Tue, Dec 13, 2016 at 07:07:01PM +0100, Dmitry Vyukov wrote:
>> Hello,
>>
>> I am getting the following reports while running syzkaller fuzzer:
>>
>> [ INFO: suspicious RCU usage. ]
>> 4.9.0+ #85 Not tainted
>> ---
>> ./include/linux/rhashtable.h:572 suspicious rcu_dereference_check() usage!
>>
>> other info that might help us debug this:
>>
>> rcu_scheduler_active = 1, debug_locks = 0
>> 1 lock held by syz-executor1/18023:
>>  #0:  (sk_lock-AF_INET){+.+.+.}, at: [< inline >] lock_sock
>> include/net/sock.h:1454
>>  #0:  (sk_lock-AF_INET){+.+.+.}, at: []
>> sctp_getsockopt+0x45f/0x6800 net/sctp/socket.c:6432
>>
>> stack backtrace:
>> CPU: 2 PID: 18023 Comm: syz-executor1 Not tainted 4.9.0+ #85
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>> [< inline >] __dump_stack lib/dump_stack.c:15
>> [] dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
>> [] lockdep_rcu_suspicious+0x139/0x180
>> kernel/locking/lockdep.c:4448
>> [< inline >] __rhashtable_lookup ./include/linux/rhashtable.h:572
>> [< inline >] rhltable_lookup ./include/linux/rhashtable.h:660
>> [] sctp_epaddr_lookup_transport+0x641/0x930
>> net/sctp/input.c:946
>
> I think this was introduced in the rhlist converstion. We had removed
> some rcu_read_lock() calls on sctp stack because rhashtable was already
> calling it, but then we didn't add them back when moving to rhlist.
>
> This code:
> +/* return a transport without holding it, as it's only used under sock lock 
> */
>  struct sctp_transport *sctp_epaddr_lookup_transport(
> const struct sctp_endpoint *ep,
> const union sctp_addr *paddr)
>  {
> struct net *net = sock_net(ep->base.sk);
> +   struct rhlist_head *tmp, *list;
> +   struct sctp_transport *t;
> struct sctp_hash_cmp_arg arg = {
> -   .ep= ep,
> .paddr = paddr,
> .net   = net,
> +   .lport = htons(ep->base.bind_addr.port),
> };
>
> -   return rhashtable_lookup_fast(&sctp_transport_hashtable, &arg,
> - sctp_hash_params);
> +   list = rhltable_lookup(&sctp_transport_hashtable, &arg,
> +  sctp_hash_params);
>
> Had an implicit rcu_read_lock() on rhashtable_lookup_fast, but it
> doesn't on rhltable_lookup and rhltable_lookup uses _rcu calls which
> assumes we have rcu read protection.

You're right, we need to call rcu_read_lock before using rhltable_lookup.
will post a fix for it, thanks.


[RFC PATCH net-next v4 2/2] macb: Enable 1588 support in SAMA5Dx platforms.

2016-12-14 Thread Andrei Pistirica
This patch does the following:
- Enable HW time stamp for the following platforms: SAMA5D2, SAMA5D3 and
  SAMA5D4.
- HW time stamp capabilities are advertised via ethtool and macb ioctl is
  updated accordingly.
- HW time stamp on the PTP Ethernet packets are received using the
  SO_TIMESTAMPING API. Where timers are obtained from the PTP event/peer
  registers.

Note: Patch on net-next, on December 7th.

Signed-off-by: Andrei Pistirica 
---
Patch history:

Version 1:
Integration with SAMA5D2 only. This feature wasn't tested on any
other platform that might use cadence/gem.

Patch is not completely ported to the very latest version of net-next,
and it will be after review.

Version 2 modifications:
- add PTP caps for SAMA5D2/3/4 platforms
- and cosmetic changes

Version 3 modifications:
- add support for sama5D2/3/4 platforms using GEM-PTP interface.

Version 4 modifications:
- time stamp only PTP_V2 events
- maximum adjustment value is set based on Richard's input

Note: Patch on net-next, on December 14th. 

 drivers/net/ethernet/cadence/macb.c | 168 ++--
 1 file changed, 163 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index 538544a..8d5c976 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -714,6 +714,8 @@ static void macb_tx_interrupt(struct macb_queue *queue)
 
/* First, update TX stats if needed */
if (skb) {
+   gem_ptp_do_txstamp(bp, skb);
+
netdev_vdbg(bp->dev, "skb %u (data %p) TX 
complete\n",
macb_tx_ring_wrap(bp, tail),
skb->data);
@@ -878,6 +880,8 @@ static int gem_rx(struct macb *bp, int budget)
GEM_BFEXT(RX_CSUM, ctrl) & GEM_RX_CSUM_CHECKED_MASK)
skb->ip_summed = CHECKSUM_UNNECESSARY;
 
+   gem_ptp_do_rxstamp(bp, skb);
+
bp->stats.rx_packets++;
bp->stats.rx_bytes += skb->len;
 
@@ -2080,6 +2084,9 @@ static int macb_open(struct net_device *dev)
 
netif_tx_start_all_queues(dev);
 
+   if (bp->ptp_info)
+   bp->ptp_info->ptp_init(dev);
+
return 0;
 }
 
@@ -2101,6 +2108,9 @@ static int macb_close(struct net_device *dev)
 
macb_free_consistent(bp);
 
+   if (bp->ptp_info)
+   bp->ptp_info->ptp_remove(dev);
+
return 0;
 }
 
@@ -2374,6 +2384,133 @@ static int macb_set_ringparam(struct net_device *netdev,
return 0;
 }
 
+#ifdef CONFIG_MACB_USE_HWSTAMP
+static unsigned int gem_get_tsu_rate(struct macb *bp)
+{
+   /* Note: TSU rate is hardwired to PCLK. */
+   return clk_get_rate(bp->pclk);
+}
+
+static s32 gem_get_ptp_max_adj(void)
+{
+   return 3921508;
+}
+
+static int gem_get_ts_info(struct net_device *dev,
+  struct ethtool_ts_info *info)
+{
+   struct macb *bp = netdev_priv(dev);
+
+   ethtool_op_get_ts_info(dev, info);
+   info->so_timestamping =
+   SOF_TIMESTAMPING_TX_SOFTWARE |
+   SOF_TIMESTAMPING_RX_SOFTWARE |
+   SOF_TIMESTAMPING_SOFTWARE |
+   SOF_TIMESTAMPING_TX_HARDWARE |
+   SOF_TIMESTAMPING_RX_HARDWARE |
+   SOF_TIMESTAMPING_RAW_HARDWARE;
+   info->phc_index = -1;
+
+   if (bp->ptp_clock)
+   info->phc_index = ptp_clock_index(bp->ptp_clock);
+
+   return 0;
+}
+
+static int gem_set_hwtst(struct net_device *netdev,
+struct ifreq *ifr, int cmd)
+{
+   struct hwtstamp_config config;
+   struct macb *priv = netdev_priv(netdev);
+   u32 regval;
+
+   netdev_vdbg(netdev, "macb_hwtstamp_ioctl\n");
+
+   if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
+   return -EFAULT;
+
+   /* reserved for future extensions */
+   if (config.flags)
+   return -EINVAL;
+
+   switch (config.tx_type) {
+   case HWTSTAMP_TX_OFF:
+   priv->hwts_tx_en = false;
+   break;
+   case HWTSTAMP_TX_ON:
+   priv->hwts_tx_en = true;
+   break;
+   default:
+   return -ERANGE;
+   }
+
+   switch (config.rx_filter) {
+   case HWTSTAMP_FILTER_NONE:
+   if (priv->hwts_rx_en)
+   priv->hwts_rx_en = false;
+   break;
+   case HWTSTAMP_FILTER_PTP_V2_L4_EVENT:
+   case HWTSTAMP_FILTER_PTP_V2_L2_EVENT:
+   case HWTSTAMP_FILTER_PTP_V2_L2_SYNC:
+   case HWTSTAMP_FILTER_PTP_V2_L4_SYNC:
+   case HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ:
+   case HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ:
+   case HWTSTAMP_FILTER_PTP_V2_EVENT:
+   case HWTSTAMP_FILTER_PTP_V2_SYNC:
+   case HWTSTAMP_FILTER_PTP_V2_DELAY_REQ:
+   config.rx_filter = HWTSTAMP_FILT

[RFC PATCH net-next v4 1/2] macb: Add 1588 support in Cadence GEM.

2016-12-14 Thread Andrei Pistirica
Cadence GEM provides a 102 bit time counter with 48 bits for seconds,
30 bits for nsecs and 24 bits for sub-nsecs to control 1588 timestamping.

This patch does the following:
- Registers to ptp clock framework
- Timer initialization is done by writing time of day to the timer counter.
- ns increment register is programmed as NSEC_PER_SEC/tsu-clock-rate.
  For a 16 bit subns precision, the subns increment equals
  remainder of (NS_PER_SEC/TSU_CLK) * (2^16).
- Timestamps are obtained from the TX/RX PTP event/PEER registers.
  The timestamp obtained thus is updated in skb for upper layers to access.
- The drivers register functions with ptp to perform time and frequency
  adjustment.
- Time adjustment is done by writing to the 1558_ADJUST register.
  The controller will read the delta in this register and update the timer
  counter register. Alternatively, for large time offset adjustments,
  the driver reads the secs and nsecs counter values, adds/subtracts the
  delta and updates the timer counter.
- Frequency is adjusted by adjusting addend (8bit nanosecond increment) and
  addendsub (16bit increment nanosecond fractions).
  The 102bit counter is incremented at nominal frequency with addend and
  addendsub values. Each period addend and addendsub values are adjusted
  based on ppm drift.

Signed-off-by: Andrei Pistirica 
Signed-off-by: Harini Katakam 
---
Patch history:

Version 1:
This patch is based on original Harini's patch, implemented in a
separate file to ease the review/maintanance and integration with
other platforms (e.g. Zynq Ultrascale+ MPSoC).
Feature was tested on SAMA5D2 platform using ptp4l v1.6 from linuxptp
project and also with ptpd2 version 2.3.1. PTP was tested over
IPv4,IPv6 and 802.3 protocols.

In case that macb is compiled as a module, it has been renamed to
cadence-macb.ko to avoid naming confusion in Makefile.

Version 2 modifications:
- bitfields for TSU are named according to SAMA5D2 data sheet
- identify GEM-PTP support based on platform capability
- add spinlock for TSU access
- change macb_ptp_adjfreq and use fewer 64bit divisions

Version 3 modifications:
- new adjfine api with one 64 division for frequency adjustment 
  (based on Richard's input)
- add maximum adjustment frequency (ppb) based on nominal frequency
- per platform PTP configuration
- cosmetic changes
Note 1: Kbuild uses "select" instead of "imply", and the macb maintainer agreed
to make the change when it will be available in net-next.

Version 4 modifications:
- update adjfine for a better approximation
- add maximum adjustment frequency callback to PTP platform configuraion

Note 1: This driver does not support GEM-GXL!
Note 2: Patch on net-next, on December 14th. 

 drivers/net/ethernet/cadence/Kconfig|  10 +-
 drivers/net/ethernet/cadence/Makefile   |   8 +-
 drivers/net/ethernet/cadence/macb.h | 118 ++
 drivers/net/ethernet/cadence/macb_ptp.c | 366 
 4 files changed, 500 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/cadence/macb_ptp.c

diff --git a/drivers/net/ethernet/cadence/Kconfig 
b/drivers/net/ethernet/cadence/Kconfig
index f0bcb15..ebbc65f 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -29,6 +29,14 @@ config MACB
  support for the MACB/GEM chip.
 
  To compile this driver as a module, choose M here: the module
- will be called macb.
+ will be called cadence-macb.
+
+config MACB_USE_HWSTAMP
+   bool "Use IEEE 1588 hwstamp"
+   depends on MACB
+   default y
+   select PTP_1588_CLOCK
+   ---help---
+ Enable IEEE 1588 Precision Time Protocol (PTP) support for MACB.
 
 endif # NET_CADENCE
diff --git a/drivers/net/ethernet/cadence/Makefile 
b/drivers/net/ethernet/cadence/Makefile
index 91f79b1..4402d42 100644
--- a/drivers/net/ethernet/cadence/Makefile
+++ b/drivers/net/ethernet/cadence/Makefile
@@ -2,4 +2,10 @@
 # Makefile for the Atmel network device drivers.
 #
 
-obj-$(CONFIG_MACB) += macb.o
+cadence-macb-y := macb.o
+
+ifeq ($(CONFIG_MACB_USE_HWSTAMP),y)
+cadence-macb-y += macb_ptp.o
+endif
+
+obj-$(CONFIG_MACB) += cadence-macb.o
diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index d67adad..e65e985 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -10,6 +10,9 @@
 #ifndef _MACB_H
 #define _MACB_H
 
+#include 
+#include 
+
 #define MACB_GREGS_NBR 16
 #define MACB_GREGS_VERSION 2
 #define MACB_MAX_QUEUES 8
@@ -131,6 +134,20 @@
 #define GEM_RXIPCCNT   0x01a8 /* IP header Checksum Error Counter */
 #define GEM_RXTCPCCNT  0x01ac /* TCP Checksum Error Counter */
 #define GEM_RXUDPCCNT  0x01b0 /* UDP Checksum Error Counter */
+#define GEM_TISUBN 0x01bc /* 1588 Timer Increment Sub-ns */
+#define GEM_TSH0x01c0 /* 1588 Timer Seconds High */
+#define GEM_TSL   

[PATCH V2] vhost: introduce O(1) vq metadata cache

2016-12-14 Thread Jason Wang
When device IOTLB is enabled, all address translations were stored in
interval tree. O(lgN) searching time could be slow for virtqueue
metadata (avail, used and descriptors) since they were accessed much
often than other addresses. So this patch introduces an O(1) array
which points to the interval tree nodes that store the translations of
vq metadata. Those array were update during vq IOTLB prefetching and
were reset during each invalidation and tlb update. Each time we want
to access vq metadata, this small array were queried before interval
tree. This would be sufficient for static mappings but not dynamic
mappings, we could do optimizations on top.

Test were done with l2fwd in guest (2M hugepage):

   noiommu  | before| after
tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)

We can almost reach the same performance as noiommu mode.

Signed-off-by: Jason Wang 
---
Changes from V1:
- silent 32bit build warning
---
 drivers/vhost/vhost.c | 136 --
 drivers/vhost/vhost.h |   8 +++
 2 files changed, 118 insertions(+), 26 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89..50ed625 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -282,6 +282,22 @@ void vhost_poll_queue(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
+static void __vhost_vq_meta_reset(struct vhost_virtqueue *vq)
+{
+   int j;
+
+   for (j = 0; j < VHOST_NUM_ADDRS; j++)
+   vq->meta_iotlb[j] = NULL;
+}
+
+static void vhost_vq_meta_reset(struct vhost_dev *d)
+{
+   int i;
+
+   for (i = 0; i < d->nvqs; ++i)
+   __vhost_vq_meta_reset(d->vqs[i]);
+}
+
 static void vhost_vq_reset(struct vhost_dev *dev,
   struct vhost_virtqueue *vq)
 {
@@ -311,6 +327,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->busyloop_timeout = 0;
vq->umem = NULL;
vq->iotlb = NULL;
+   __vhost_vq_meta_reset(vq);
 }
 
 static int vhost_worker(void *data)
@@ -690,6 +707,18 @@ static int vq_memory_access_ok(void __user *log_base, 
struct vhost_umem *umem,
return 1;
 }
 
+static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
+  u64 addr, unsigned int size,
+  int type)
+{
+   const struct vhost_umem_node *node = vq->meta_iotlb[type];
+
+   if (!node)
+   return NULL;
+
+   return (void *)(uintptr_t)(node->userspace_addr + addr - node->start);
+}
+
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
 static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
@@ -732,8 +761,14 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, 
void *to,
 * could be access through iotlb. So -EAGAIN should
 * not happen in this case.
 */
-   /* TODO: more fast path */
struct iov_iter t;
+   void __user *uaddr = vhost_vq_meta_fetch(vq,
+(u64)(uintptr_t)to, size,
+VHOST_ADDR_DESC);
+
+   if (uaddr)
+   return __copy_to_user(uaddr, from, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)to, size, 
vq->iotlb_iov,
 ARRAY_SIZE(vq->iotlb_iov),
 VHOST_ACCESS_WO);
@@ -761,8 +796,14 @@ static int vhost_copy_from_user(struct vhost_virtqueue 
*vq, void *to,
 * could be access through iotlb. So -EAGAIN should
 * not happen in this case.
 */
-   /* TODO: more fast path */
+   void __user *uaddr = vhost_vq_meta_fetch(vq,
+(u64)(uintptr_t)from, size,
+VHOST_ADDR_DESC);
struct iov_iter f;
+
+   if (uaddr)
+   return __copy_from_user(to, uaddr, size);
+
ret = translate_desc(vq, (u64)(uintptr_t)from, size, 
vq->iotlb_iov,
 ARRAY_SIZE(vq->iotlb_iov),
 VHOST_ACCESS_RO);
@@ -782,17 +823,12 @@ static int vhost_copy_from_user(struct vhost_virtqueue 
*vq, void *to,
return ret;
 }
 
-static void __user *__vhost_get_user(struct vhost_virtqueue *vq,
-void *addr, unsigned size)
+static void __user *__vhost_get_user_slow(struct vhost_virtqueue *vq,
+ void *addr, unsigned int size,
+ int type)
 {
int ret;
 
-   /* This function should be called after iotlb
-* prefetch, which means we're sure that vq
-* could be access through iotlb. So -EAGAIN should
-  

Re: dsa: handling more than 1 cpu port

2016-12-14 Thread Andrew Lunn
On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
> Hi Andrew,
> 
> switches supported by qca8k have 2 gmacs that we can wire an external
> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
> switch. Thw switch itself is aware of one of the MACs being the cpu port
> and expects this to be port/mac0. Using the other will break the
> hardware offloading features.

Just to be sure here. There is no way to use the second port connected
to the CPU as a CPU port?

The Marvell chips do allow this. So i developed a proof of concept
which had a mapping between cpu ports and slave ports. slave port X
should you cpu port y for its traffic. This never got past proof of
concept. 

If this can be made to work for qca8k, i would prefer having this
general concept, than specific hacks for pass through.

Andrew


Re: dsa: handling more than 1 cpu port

2016-12-14 Thread John Crispin


On 14/12/2016 11:31, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
>> Hi Andrew,
>>
>> switches supported by qca8k have 2 gmacs that we can wire an external
>> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
>> switch. Thw switch itself is aware of one of the MACs being the cpu port
>> and expects this to be port/mac0. Using the other will break the
>> hardware offloading features.
> 
> Just to be sure here. There is no way to use the second port connected
> to the CPU as a CPU port?

both macs are considered cpu ports and both allow for the tag to be
injected. for HW NAT/routing/... offloading to work, the lan ports neet
to trunk via port0 and not port6 however.

> 
> The Marvell chips do allow this. So i developed a proof of concept
> which had a mapping between cpu ports and slave ports. slave port X
> should you cpu port y for its traffic. This never got past proof of
> concept. 
> 
> If this can be made to work for qca8k, i would prefer having this
> general concept, than specific hacks for pass through.

oh cool, can you send those patches my way please ? how do you configure
this from userland ? does the cpu port get its on swdev which i just add
to my lan bridge and then add the 2nd cpu port to the wan bridge ?

John


Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-14 Thread Andrew Lunn
On Wed, Dec 14, 2016 at 11:03:14AM +0100, Volodymyr Bendiuga wrote:
> Hi,
> 
> I understand what you wrote, but we do not refresh anything
> at intervals. FDB dump works only on user request, i.e. user
> run bridge fdb dump command. What we did is just a smarter
> way of feeding the dump request with entries obtained from
> switchcore. Interesting thing is that fdb dump in switchcore
> reads all entries, and from those entries only one is returned at
> a time, others are discarded, because request comes as per port.

Ah, O.K.

> What we do is we dump entries from switchcore once, when the
> first dump request comes, save all of them in the cache, and then
> all following fdb dump requests for each port will be fed from the cache,
> so no more hardware dump operations. We set the cache to be valid for
> 2 seconds, but this could be adjusted and tweaked. So in our case
> we decrease the number of MDIO reads quite a lot.

> What we are thinking now, is that this logics could be moved
> to net/dsa layer, and maybe unified with fdb load logics, if possible.

We should check what the other switches do. Can they perform a dump
with the hardware performing the port filter? Those switches will not
benefit from such a cache. But they should also not suffer.

Combining it with load is a bigger question. Currently the drivers are
stateless. That makes the error handling very simple. We don't have to
worry about running out of memory, since we don't allocate any memory.

If we run out of memory for this dump cache, it is not serious, since
a dump does not result in state change. But if we are out of memory on
a load, we do have state changes. We have to deal with the complexity
of allocating resources in the _prepare call, and then use the
resource in the do call. I would much prefer to avoid this at first,
by optimising the atu get. If we don't see a significant speed up,
then we can consider this complex solution of a cache for load.

 Andrew


[no subject]

2016-12-14 Thread Mr Friedrich Mayrhofer


Good Day,

This is the second time i am sending you this mail.

I, Friedrich Mayrhofer Donate $ 1,000,000.00 to You, Email Me
personally for more details.

Regards.
Friedrich Mayrhofer








Re: net/arp: ARP cache aging failed.

2016-12-14 Thread YueHaibing
On 2016/11/26 4:40, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Fri, 25 Nov 2016, Hannes Frederic Sowa wrote:
> 
>> On 25.11.2016 09:18, Julian Anastasov wrote:
>>>
>>> Another option would be to add similar bit to
>>> struct sock (sk_dst_pending_confirm), may be ip_finish_output2
>>> can propagate it to dst_neigh_output via new arg and then to
>>> reset it.
>>
>> I don't understand how this can help? Maybe I understood it wrong?
> 
>   The idea is, the indication from highest level (TCP) to
> be propagated to lowest level (neigh).
> 
>>> Or to change n->confirmed via some new inline sock
>>> function instead of changing dst_neigh_output. At this place 
>>> skb->sk is optional, may be we need (skb->sk && dst ==
>>> skb->sk->sk_dst_cache) check. Also, when sk_dst_cache changes
>>> we should clear this flag.
>>
>> In "(skb->sk && dst == skb->sk->sk_dst_cache)" where does dst come from?
> 
>   This is in case we do not want to propagate indication
> from TCP to tunnels, see below...
> 
>> I don't see a possibility besides using mac_header or inner_mac_header
>> to look up the incoming MAC address and confirm that one in the neighbor
>> cache so far (we could try to optimize this case for rt_gateway though).
> 
>   My idea is as follows:
> 
> - TCP instead of calling dst_confirm should call new func
> dst_confirm_sk(sk) where sk->sk_dst_pending_confirm is set,
> not dst->pending_confirm.
> 
> - ip_finish_output2: use skb->sk (TCP) to check for
> sk_dst_pending_confirm and update n->confirmed in some
> new inline func
> 
>   Correct me if I'm wrong but here is how I see the
> situation:
> 
> - TCP caches output dst in socket, for example, dst1,
> sets it as skb->dst
> 
> - if xfrm tunnels are used then dst1 can point to some tunnel,
> i.e. it is not in all cases the same skb->dst, as seen by 
> ip_finish_output2
> 
> - netfilter can DNAT and assign different skb->dst
> 
> - as result, before now, dst1->pending_confirm is set but
> it is not used properly because ip_finish_output2 uses
> the final skb->dst which can be different or with
> rt_gateway = 0
> 
> - considering that tunnels do not use dst_confirm,
> n->confirmed is not changed every time. There are some
> interesting cases:
> 
> 1. both dst1 and the final skb->dst point to same
> dst with rt_gateway = 0: n->confirmed is updated but
> as wee see it can be for wrong neigh.
> 
> 2. dst1 is different from skb->dst, so n->confirmed is
> not changed. This can happen on DNAT or when using
> tunnel to secure gateway.
> 
> - in ip_finish_output2 we have skb->sk (original TCP sk) and
> sk arg (equal to skb->sk or second option is sock of some UDP
> tunnel, etc). The idea is to use skb->sk->sk_dst_pending_confirm,
> i.e. highest level sk because the sk arg, if different, does not
> have such flag set (tunnels do not call dst_confirm).
> 
> - ip_finish_output2 should call new func dst_neigh_confirm_sk
> (which has nothing related to dst, hm... better name is needed):
> 
>   if (!IS_ERR(neigh)) {
> - int res = dst_neigh_output(dst, neigh, skb);
> + int res;
> +
> + dst_neigh_confirm_sk(skb->sk, neigh);
> + res = dst_neigh_output(dst, neigh, skb);
> 
>   which should do something like this:
> 
>   if (sk && sk->sk_dst_pending_confirm) {
>   unsigned long now = jiffies;
> 
>   sk->sk_dst_pending_confirm = 0;
>   /* avoid dirtying neighbour */
>   if (n->confirmed != now)
>   n->confirmed = now;
>   }
> 
>   Additional dst == skb->sk->sk_dst_cache check will
> not propagate the indication on DNAT/tunnel. For me,
> it is better without such check.
> 
>   So, the idea is to move TCP and other similar
> users to the new dst_confirm_sk() method. If other
> dst_confirm() users are left, they should be checked
> if dsts with rt_gateway = 0 can be wrongly used.
> 
> Regards
> 
> --
> Julian Anastasov 
> 
> .
> 

Sorry for so late.

Based on your ideas, I make a patch and test it for a while.

It works for me.

>From dc033fe4bac8931590e15774a298f04ccea8ed4c Mon Sep 17 00:00:00 2001
From: YueHabing 
Date: Wed, 14 Dec 2016 02:43:51 -0500
Subject: [PATCH] arp: do neigh confirm based on sk arg

Signed-off-by: YueHabing 
---
 include/net/sock.h | 18 ++
 net/ipv4/ip_output.c   |  5 -
 net/ipv4/ping.c|  2 +-
 net/ipv4/raw.c |  2 +-
 net/ipv4/tcp_input.c   |  8 ++--
 net/ipv4/tcp_metrics.c |  5 +++--
 net/ipv4/udp.c |  2 +-
 net/ipv6/raw.c |  2 +-
 net/ipv6/route.c   |  6 +++---
 net/ipv6/udp.c |  2 +-
 10 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 92b2697..ece8dfa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -434,6 +434,7 @@ struct sock {
struct sk_buff  *sk_send_head;
__s32   sk_peek_off;
int sk_write

Re: [PATCH 2/3] selftests: do not require bash to run bpf tests

2016-12-14 Thread Daniel Borkmann

On 12/14/2016 11:58 AM, Rolf Eike Beer wrote:

 From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
From: Rolf Eike Beer 
Date: Wed, 14 Dec 2016 09:58:12 +0100
Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests

Nothing in this minimal script seems to require bash. We often run these tests
on embedded devices where the only shell available is the busybox ash.

Signed-off-by: Rolf Eike Beer 


Acked-by: Daniel Borkmann 


Re: dsa: handling more than 1 cpu port

2016-12-14 Thread John Crispin


On 14/12/2016 12:00, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>>
>>
>> On 14/12/2016 11:31, Andrew Lunn wrote:
>>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
 Hi Andrew,

 switches supported by qca8k have 2 gmacs that we can wire an external
 mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
 switch. Thw switch itself is aware of one of the MACs being the cpu port
 and expects this to be port/mac0. Using the other will break the
 hardware offloading features.
>>>
>>> Just to be sure here. There is no way to use the second port connected
>>> to the CPU as a CPU port?
>>
>> both macs are considered cpu ports and both allow for the tag to be
>> injected. for HW NAT/routing/... offloading to work, the lan ports neet
>> to trunk via port0 and not port6 however.
> 
> Maybe you can do a restricted version of the generic solution. LAN
> ports are mapped to cpu port0. WAN port to cpu port 6?

hardcoding it is exactly what i want to avoid, same as using magic name
matching. the dts should describe the HW and not the usage dictated by
what is printed on the casing of the switch. as you mention below this
is marketing chatter. imho ports should have any name and we should be
able to bridge them as we feel happy and attach the bridges to any cpu
port that we want.

> 
>>> The Marvell chips do allow this. So i developed a proof of concept
>>> which had a mapping between cpu ports and slave ports. slave port X
>>> should you cpu port y for its traffic. This never got past proof of
>>> concept. 
>>>
>>> If this can be made to work for qca8k, i would prefer having this
>>> general concept, than specific hacks for pass through.
>>
>> oh cool, can you send those patches my way please ? how do you configure
>> this from userland ? does the cpu port get its on swdev which i just add
>> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
> 
> https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
> 
> You don't configure anything from userland. Which was actually a
> criticism. It is in device tree. But my solution is generic. Having
> one WAN port and four bridges LAN ports is a pure marketing
> requirement. The hardware will happily do two WAN ports and 3 LAN
> ports, for example. And the switch is happy to take traffic for the
> WAN port and a LAN port over the same CPU port, and keep the traffic
> separate. So we can have some form of load balancing. We are not
> limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
> performance. And to the user it is all transparent.
> 
> This PoC is for the old DSA binding. The new binding makes it easier
> to express this. Which is one of the reasons for the new binding.
> 
>   Andrew
> 

i'll have a look at the patches. thanks !

John



Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Hannes Frederic Sowa
Hello,

On 14.12.2016 04:59, Jason A. Donenfeld wrote:
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.

Can you show or cite benchmarks in comparison with jhash? Last time I
looked, especially for short inputs, siphash didn't beat jhash (also on
all the 32 bit devices etc.).

> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
> 
> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.

This pretty much depends on the linearity of the hash function? I don't
think a crypto secure hash function is needed for a hash table. Albeit I
agree that siphash certainly looks good to be used here.

> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.

I am pretty sure that SipHash still needs a random key per hash table
also. So far it was only the choice of hash function you are questioning.

> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
> 
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.

Hmm, I tried to follow up with all the HashDoS work and so far didn't
see any HashDoS attacks against the Jenkins/SpookyHash family.

If this is an issue we might need to also put those changes into stable.

> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.

Bye,
Hannes



Re: dsa: handling more than 1 cpu port

2016-12-14 Thread Andrew Lunn
On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
> 
> 
> On 14/12/2016 11:31, Andrew Lunn wrote:
> > On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
> >> Hi Andrew,
> >>
> >> switches supported by qca8k have 2 gmacs that we can wire an external
> >> mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
> >> switch. Thw switch itself is aware of one of the MACs being the cpu port
> >> and expects this to be port/mac0. Using the other will break the
> >> hardware offloading features.
> > 
> > Just to be sure here. There is no way to use the second port connected
> > to the CPU as a CPU port?
> 
> both macs are considered cpu ports and both allow for the tag to be
> injected. for HW NAT/routing/... offloading to work, the lan ports neet
> to trunk via port0 and not port6 however.

Maybe you can do a restricted version of the generic solution. LAN
ports are mapped to cpu port0. WAN port to cpu port 6?

> > The Marvell chips do allow this. So i developed a proof of concept
> > which had a mapping between cpu ports and slave ports. slave port X
> > should you cpu port y for its traffic. This never got past proof of
> > concept. 
> > 
> > If this can be made to work for qca8k, i would prefer having this
> > general concept, than specific hacks for pass through.
> 
> oh cool, can you send those patches my way please ? how do you configure
> this from userland ? does the cpu port get its on swdev which i just add
> to my lan bridge and then add the 2nd cpu port to the wan bridge ?

https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu

You don't configure anything from userland. Which was actually a
criticism. It is in device tree. But my solution is generic. Having
one WAN port and four bridges LAN ports is a pure marketing
requirement. The hardware will happily do two WAN ports and 3 LAN
ports, for example. And the switch is happy to take traffic for the
WAN port and a LAN port over the same CPU port, and keep the traffic
separate. So we can have some form of load balancing. We are not
limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
performance. And to the user it is all transparent.

This PoC is for the old DSA binding. The new binding makes it easier
to express this. Which is one of the reasons for the new binding.

Andrew


[PATCH 2/3] selftests: do not require bash to run bpf tests

2016-12-14 Thread Rolf Eike Beer
>From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
From: Rolf Eike Beer 
Date: Wed, 14 Dec 2016 09:58:12 +0100
Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests

Nothing in this minimal script seems to require bash. We often run these tests
on embedded devices where the only shell available is the busybox ash.

Signed-off-by: Rolf Eike Beer 
---
 tools/testing/selftests/bpf/test_kmod.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_kmod.sh 
b/tools/testing/selftests/bpf/test_kmod.sh
index 92e627a..6d58cca 100755
--- a/tools/testing/selftests/bpf/test_kmod.sh
+++ b/tools/testing/selftests/bpf/test_kmod.sh
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/bin/sh
 
 SRC_TREE=../../../../
 
-- 
2.8.3


-- 
Rolf Eike Beer, emlix GmbH, http://www.emlix.com
Fon +49 551 30664-0, Fax +49 551 30664-11
Bertha-von-Suttner-Str. 9, 37085 Göttingen, Germany
Sitz der Gesellschaft: Göttingen, Amtsgericht Göttingen HR B 3160
Geschäftsführung: Heike Jordan, Dr. Uwe Kracke – Ust-IdNr.: DE 205 198 055

emlix – smart embedded open source



Re: Re: Synopsys Ethernet QoS

2016-12-14 Thread Pavel Machek
Hi!

> There are some performance problems with the stmmac driver though:
> 
> When running iperf3 with 3 streams:
> iperf3 -c 192.168.0.90 -P 3 -t 30
> iperf3 -c 192.168.0.90 -P 3 -t 30 -R
> 
> I get really bad fairness between the streams.
> 
> This appears to be an issue with how TX IRQ coalescing is implemented in 
> stmmac.
> Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.

Yes, I'm hitting the same problem

https://lkml.org/lkml/2016/12/11/90

> Also netperf TCP_RR and UDP_RR gives really bad results compared to the
> dwceqos driver (without IRQ coalescing).
> 2000 transactions/sec vs 9000 transactions/sec.
> Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
> gives the same performance. I guess it's a trade off, low CPU usage
> vs low latency, so I don't know how important TCP_RR/UDP_RR really is.

Well... 75% performance hit is a bug, plain and simple, not CPU tradeoff.

> The best thing would be to get a good working TX IRQ coalesce
> implementation with HR timers in stmmac.

Actually it seems that using HR timers is not the only problem, AFAICT
the logic is wrong way around. (But yes, it needs to be HR timer, too,
and probably packet count needs to be reduced, too.)

Best regards,

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Jason A. Donenfeld
Hi David,

On Wed, Dec 14, 2016 at 10:56 AM, David Laight  wrote:
> ...
>> +u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN])
> ...
>> + u64 k0 = get_unaligned_le64(key);
>> + u64 k1 = get_unaligned_le64(key + sizeof(u64));
> ...
>> + m = get_unaligned_le64(data);
>
> All these unaligned accesses are going to get expensive on architectures
> like sparc64.

Yes, the unaligned accesses aren't pretty. Since in pretty much all
use cases thus far, the data can easily be made aligned, perhaps it
makes sense to create siphash24() and siphash24_unaligned(). Any
thoughts on doing something like that?

Jason


Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
Hi David,

On Wed, Dec 14, 2016 at 10:51 AM, David Laight  wrote:
> From: Jason A. Donenfeld
>> Sent: 14 December 2016 00:17
>> This gives a clear speed and security improvement. Rather than manually
>> filling MD5 buffers, we simply create a layout by a simple anonymous
>> struct, for which gcc generates rather efficient code.
> ...
>> + const struct {
>> + struct in6_addr saddr;
>> + struct in6_addr daddr;
>> + __be16 sport;
>> + __be16 dport;
>> + } __packed combined = {
>> + .saddr = *(struct in6_addr *)saddr,
>> + .daddr = *(struct in6_addr *)daddr,
>> + .sport = sport,
>> + .dport = dport
>> + };
>
> You need to look at the effect of marking this (and the other)
> structures 'packed' on architectures like sparc64.

In all current uses of __packed in the code, I think the impact is
precisely zero, because all structures have members in descending
order of size, with each member being a perfect multiple of the one
below it. The __packed is therefore just there for safety, in case
somebody comes in and screws everything up by sticking a u8 in
between. In that case, it wouldn't be desirable to hash the structure
padding bits. In the worst case, I don't believe the impact would be
worse than a byte-by-byte memcpy, which is what the old code did. But
anyway, these structures are already naturally packed anyway, so the
present impact is nil.

Jason


Re: [PATCH] arp: do neigh confirm based on sk arg

2016-12-14 Thread kbuild test robot
Hi YueHaibing,

[auto build test WARNING on v4.9-rc8]
[cannot apply to net/master net-next/master sparc-next/master next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/YueHaibing/arp-do-neigh-confirm-based-on-sk-arg/20161214-191755
config: openrisc-or1ksim_defconfig (attached as .config)
compiler: or32-linux-gcc (GCC) 4.5.1-or32-1.0rc1
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=openrisc 

All warnings (new ones prefixed by >>):

   net/ipv4/tcp_input.c: In function 'tcp_rcv_state_process':
>> net/ipv4/tcp_input.c:6003:21: warning: unused variable 'dst'

vim +/dst +6003 net/ipv4/tcp_input.c

^1da177e4 Linus Torvalds 2005-04-16  5987*/
168a8f580 Jerry Chu  2012-08-31  5988   
tcp_rearm_rto(sk);
168a8f580 Jerry Chu  2012-08-31  5989   } else
^1da177e4 Linus Torvalds 2005-04-16  5990   
tcp_init_metrics(sk);
^1da177e4 Linus Torvalds 2005-04-16  5991  
c0402760f Yuchung Cheng  2016-09-19  5992   if 
(!inet_csk(sk)->icsk_ca_ops->cong_control)
02cf4ebd8 Neal Cardwell  2013-10-21  5993   
tcp_update_pacing_rate(sk);
02cf4ebd8 Neal Cardwell  2013-10-21  5994  
61eb90035 Joe Perches2013-05-24  5995   /* Prevent spurious 
tcp_cwnd_restart() on first data packet */
^1da177e4 Linus Torvalds 2005-04-16  5996   tp->lsndtime = 
tcp_time_stamp;
^1da177e4 Linus Torvalds 2005-04-16  5997  
^1da177e4 Linus Torvalds 2005-04-16  5998   
tcp_initialize_rcv_mss(sk);
^1da177e4 Linus Torvalds 2005-04-16  5999   tcp_fast_path_on(tp);
^1da177e4 Linus Torvalds 2005-04-16  6000   break;
^1da177e4 Linus Torvalds 2005-04-16  6001  
c48b22daa Joe Perches2013-05-24  6002   case TCP_FIN_WAIT1: {
c48b22daa Joe Perches2013-05-24 @6003   struct dst_entry *dst;
c48b22daa Joe Perches2013-05-24  6004   int tmo;
c48b22daa Joe Perches2013-05-24  6005  
168a8f580 Jerry Chu  2012-08-31  6006   /* If we enter the 
TCP_FIN_WAIT1 state and we are a
168a8f580 Jerry Chu  2012-08-31  6007* Fast Open socket and 
this is the first acceptable
168a8f580 Jerry Chu  2012-08-31  6008* ACK we have 
received, this would have acknowledged
168a8f580 Jerry Chu  2012-08-31  6009* our SYNACK so stop 
the SYNACK timer.
168a8f580 Jerry Chu  2012-08-31  6010*/
00db41243 Ian Morris 2015-04-03  6011   if (req) {

:: The code at line 6003 was first introduced by commit
:: c48b22daa6062fff9eded311b4d6974c29b40487 tcp: Remove 2 indentation 
levels in tcp_rcv_state_process

:: TO: Joe Perches 
:: CC: David S. Miller 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def

2016-12-14 Thread Tariq Toukan

Thanks Ozgur for your report.


On 12/12/2016 8:18 PM, Leon Romanovsky wrote:

On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:

Dear Romanovsky;

Please avoid top-posting in your replies.
Thanks


I'm trying to learn english and I apologize for my mistake words and phrases. So, I think the code 
when call to "sg_set_buf" and next time set memory and buffer. For example, isn't to call 
"WARN_ON" function, get a error to implicit declaration, right?

Because, you will use to "BUG_ON" get a error implicit declaration of functions.

I'm not sure that I followed you. mem->offset is set by sg_set_buf from
buf variable returned by dma_alloc_coherent(). HW needs to get very
precise size of this buf, in multiple of pages and aligned to pages
boundaries.


 sg_set_buf(mem, buf, PAGE_SIZE << order);
 WARN_ON(mem->offset);

See the patch inline which removes this BUG_ON in proper and safe way.

 From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
From: Leon Romanovsky 
Date: Mon, 12 Dec 2016 20:02:45 +0200
Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine

This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address aligment in advance and performing proper
folding in case of error.

Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas 
Signed-off-by: Leon Romanovsky 
---
  drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 2a9dd46..e1f9e7c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, 
struct scatterlist *mem,
if (!buf)
return -ENOMEM;

+   if (offset_in_page(buf)) {
+   dma_free_coherent(dev, PAGE_SIZE << order,
+ buf, sg_dma_address(mem));
+   return -ENOMEM;
+   }
+
sg_set_buf(mem, buf, PAGE_SIZE << order);
-   BUG_ON(mem->offset);
sg_dma_len(mem) = PAGE_SIZE << order;
return 0;
  }
--

Thanks Leon for the patch. It is the right way to do so.
Reviewed-by: Tariq Toukan 

We will submit Leon's patch in a new email.

Regards,
Tariq

2.10.2





Re: Synopsys Ethernet QoS

2016-12-14 Thread Pavel Machek
Hi!

> So if there is a long time before handling interrupts,
> I guess that it makes sense that one stream could
> get an advantage in the net scheduler.
> 
> If I find the time, and if no one beats me to it, I will try to replace
> the normal timers with HR timers + a smaller default timeout.
> 

Can you try something like this? Highres timers will be needed, too,
but this fixes the logic problem.

You'll need to apply it twice as code is copy&pasted.

Best regards,
Pavel

+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c

 */
priv->tx_count_frames += nfrags + 1;
if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
-   mod_timer(&priv->txtimer,
- STMMAC_COAL_TIMER(priv->tx_coal_timer));
+   if (priv->tx_count_frames == nfrags + 1)
+   mod_timer(&priv->txtimer,
+ STMMAC_COAL_TIMER(priv->tx_coal_timer));
} else {
priv->tx_count_frames = 0;
priv->hw->desc->set_tx_ic(desc);


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: [PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock

2016-12-14 Thread Stefan Hajnoczi
On Wed, Dec 14, 2016 at 07:24:36PM +0800, Gao feng wrote:
> Multi vsocks may setup the same cid at the same time.
> 
> Signed-off-by: Gao feng 
> ---
>  drivers/vhost/vsock.c | 25 +
>  1 file changed, 17 insertions(+), 8 deletions(-)

Good catch, a classic time-of-check-to-time-of-use race condition.

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Jason A. Donenfeld
Hi Hannes,

On Wed, Dec 14, 2016 at 12:21 PM, Hannes Frederic Sowa
 wrote:
> Can you show or cite benchmarks in comparison with jhash? Last time I
> looked, especially for short inputs, siphash didn't beat jhash (also on
> all the 32 bit devices etc.).

I assume that jhash is likely faster than siphash, but I wouldn't be
surprised if with optimization we can make siphash at least pretty
close on 64-bit platforms. (I'll do some tests though; maybe I'm wrong
and jhash is already slower.)

With that said, siphash is here to replace uses of jhash where
hashtable poisoning vulnerabilities make it necessary. Where there's
no significant security improvement, if there's no speed improvement
either, then of course nothing's going to change.

I should have mentioned md5_transform in this first message too, as
two other patches in this series actually replace md5_transform usage
with siphash. I think in this case, siphash is a clear performance
winner (and security winner) over md5_transform. So if the push back
against replacing jhash usages is just too high, at the very least it
remains useful already for the md5_transform usage.


> This pretty much depends on the linearity of the hash function? I don't
> think a crypto secure hash function is needed for a hash table. Albeit I
> agree that siphash certainly looks good to be used here.

In order to prevent the aforementioned poisoning attacks, a PRF with
perfect linearity is required, which is what's achieved when it's a
cryptographically secure one. Check out section 7 of
https://131002.net/siphash/siphash.pdf .

> I am pretty sure that SipHash still needs a random key per hash table
> also. So far it was only the choice of hash function you are questioning.

Siphash needs a random secret key, yes. The point is that the hash
function remains secure so long as the secret key is kept secret.
Other functions can't make the same guarantee, and so nervous periodic
key rotation is necessary, but in most cases nothing is done, and so
things just leak over time.


> Hmm, I tried to follow up with all the HashDoS work and so far didn't
> see any HashDoS attacks against the Jenkins/SpookyHash family.
>
> If this is an issue we might need to also put those changes into stable.

jhash just isn't secure; it's not a cryptographically secure PRF. If
there hasn't already been an academic paper put out there about it
this year, let's make this thread 1000 messages long to garner
attention, and next year perhaps we'll see one. No doubt that
motivated government organizations, defense contractors, criminals,
and other netizens have already done research in private. Replacing
insecure functions with secure functions is usually a good thing.

Jason


Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Hannes Frederic Sowa
On 14.12.2016 13:53, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Wed, Dec 14, 2016 at 10:51 AM, David Laight  
> wrote:
>> From: Jason A. Donenfeld
>>> Sent: 14 December 2016 00:17
>>> This gives a clear speed and security improvement. Rather than manually
>>> filling MD5 buffers, we simply create a layout by a simple anonymous
>>> struct, for which gcc generates rather efficient code.
>> ...
>>> + const struct {
>>> + struct in6_addr saddr;
>>> + struct in6_addr daddr;
>>> + __be16 sport;
>>> + __be16 dport;
>>> + } __packed combined = {
>>> + .saddr = *(struct in6_addr *)saddr,
>>> + .daddr = *(struct in6_addr *)daddr,
>>> + .sport = sport,
>>> + .dport = dport
>>> + };
>>
>> You need to look at the effect of marking this (and the other)
>> structures 'packed' on architectures like sparc64.
> 
> In all current uses of __packed in the code, I think the impact is
> precisely zero, because all structures have members in descending
> order of size, with each member being a perfect multiple of the one
> below it. The __packed is therefore just there for safety, in case
> somebody comes in and screws everything up by sticking a u8 in
> between. In that case, it wouldn't be desirable to hash the structure
> padding bits. In the worst case, I don't believe the impact would be
> worse than a byte-by-byte memcpy, which is what the old code did. But
> anyway, these structures are already naturally packed anyway, so the
> present impact is nil.

__packed not only removes all padding of the struct but also changes the
alignment assumptions for the whole struct itself. The rule, the struct
is aligned by its maximum alignment of a member is no longer true. That
said, the code accessing this struct will change (not on archs that can
deal efficiently with unaligned access, but on others).

A proper test for not introducing padding is to use something with
BUILD_BUG_ON. Also gcc also clears the padding of the struct, so padding
shouldn't be that bad in there (it is better than byte access on mips).

Btw. I think gcc7 gets support for store merging optimization.

Bye,
Hannes



Re: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat

2016-12-14 Thread Alexey Dobriyan
On Tue, Dec 13, 2016 at 5:42 PM, David Laight  wrote:
> From: Alexey Dobriyan
>> Sent: 13 December 2016 14:23
> ...
>> Well, the point of the patch is to save .text, so might as well save
>> as much as possible. Any form other than "ptr[id]" is going
>> to be either bigger or bigger and slower and "ptr" should be the first field.
>
> You've not read and understood the next bit:
>
>> > However if you offset the 'id' values so that only
>> > values 2 up are valid the code becomes:
>> > return net->gen2->ptr[id - 2];
>> > which will be exactly the same code as:
>> > return net->gen1->ptr[id];
>> > but it is much more obvious that 'id' values must be >= 2.
>> >
>> > The '2' should be generated from the structure offset, but with my method
>> > is doesn't actually matter if it is wrong.
>
> If you have foo->bar[id - const] then the compiler has to add the
> offset of 'bar' and subtract for 'const'.
> If the numbers match no add or subtract is needed.
>
> It is much cleaner to do this by explicitly removing the offset on the
> accesses than using a union.

Surprisingly, the trick only works if array index is cast to "unsigned long"
before subtracting.

Code becomes

...
ptr = ng->ptr[(unsigned long)id - 3];
...

I'll post a patch when net-next reopens.


Re: Synopsys Ethernet QoS

2016-12-14 Thread Joao Pinto

Hi,

Às 12:57 PM de 12/14/2016, Pavel Machek escreveu:
> Hi!
> 
>> So if there is a long time before handling interrupts,
>> I guess that it makes sense that one stream could
>> get an advantage in the net scheduler.
>>
>> If I find the time, and if no one beats me to it, I will try to replace
>> the normal timers with HR timers + a smaller default timeout.
>>
> 
> Can you try something like this? Highres timers will be needed, too,
> but this fixes the logic problem.
> 
> You'll need to apply it twice as code is copy&pasted.
> 
> Best regards,
>   Pavel
> 
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> 
>*/
>   priv->tx_count_frames += nfrags + 1;
>   if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
> - mod_timer(&priv->txtimer,
> -   STMMAC_COAL_TIMER(priv->tx_coal_timer));
> + if (priv->tx_count_frames == nfrags + 1)
> + mod_timer(&priv->txtimer,
> +   STMMAC_COAL_TIMER(priv->tx_coal_timer));
>   } else {
>   priv->tx_count_frames = 0;
>   priv->hw->desc->set_tx_ic(desc);
> 
> 

I know that this is completely of topic, but I am facing a dificulty with
stmmac. I have interrupts, mac well configured rx packets being received
successfully, but TX is not working, resulting in Tx errors = Total TX packets.
I have made a lot of debug and my conclusions is that by some reason when using
stmmac after starting tx dma, the hw state machine enters a deadend state
resulting in those errors. Anyone faced this trouble?

Thanks.


Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf

2016-12-14 Thread Arnaldo Carvalho de Melo
Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
> Hi Arnaldo,
> 
> On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
> > Em Thu, Dec 08, 2016 at 06:46:13PM -0800, Joe Stringer escreveu:
> > > (Was "libbpf: Synchronize implementations")
> > > 
> > > Update tools/lib/bpf to provide the remaining bpf wrapper pieces needed 
> > > by the
> > > samples/bpf/ code, then get rid of all of the duplicate BPF libraries in
> > > samples/bpf/libbpf.[ch].
> > > 
> > > ---
> > > v3: Add ack for first patch.
> > >  Split out second patch from v2 into separate changes for remaining 
> > > diff.
> > >  Add patches to switch samples/bpf over to using tools/lib/.
> > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > >  Don't shift non-bpf code into libbpf.
> > >  Drop the patch to synchronize ELF definitions with tc.
> > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > >  First post.
> > 
> > Thanks, applied after addressing the -I$(objtree) issue raised by Wang,
> 
> [ Sorry for late reply. ]
> 
> First of all, glad to see us getting rid of the duplicate lib eventually! :)
> 
> Please note that this might result in hopefully just a minor merge issue
> with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
> which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
> just log buffer and some renames there, which can be discarded for both
> files sitting in selftests.

Yeah, I've got to this point, and the merge has a little bit more than
that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...

- Arnaldo


Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO

2016-12-14 Thread Michael S. Tsirkin
On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote:
> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
> >> This adds support for dynamically setting the LRO feature flag. The
> >> message to control guest features in the backend uses the
> >> CTRL_GUEST_OFFLOADS msg type.
> >>
> >> Signed-off-by: John Fastabend 
> >> ---
> >>  drivers/net/virtio_net.c |   40 +++-
> >>  1 file changed, 39 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index a21d93a..a5c47b1 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -1419,6 +1419,36 @@ static void virtnet_init_settings(struct net_device 
> >> *dev)
> >>.set_settings = virtnet_set_settings,
> >>  };
> >>  
> >> +static int virtnet_set_features(struct net_device *netdev,
> >> +  netdev_features_t features)
> >> +{
> >> +  struct virtnet_info *vi = netdev_priv(netdev);
> >> +  struct virtio_device *vdev = vi->vdev;
> >> +  struct scatterlist sg;
> >> +  u64 offloads = 0;
> >> +
> >> +  if (features & NETIF_F_LRO)
> >> +  offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
> >> +  (1 << VIRTIO_NET_F_GUEST_TSO6);
> >> +
> >> +  if (features & NETIF_F_RXCSUM)
> >> +  offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
> >> +
> >> +  if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
> >> +  sg_init_one(&sg, &offloads, sizeof(uint64_t));
> >> +  if (!virtnet_send_command(vi,
> >> +VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> >> +VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
> >> +&sg)) {
> > 
> > Hmm I just realised that this will slow down setups that bridge
> > virtio net interfaces since bridge calls this if provided.
> > See below.
> 
> 
> Really? What code is trying to turn off GRO via the GUEST_OFFLOADS LRO
> command. My qemu/Linux setup has a set of tap/vhost devices attached to
> a bridge and all of them have LRO enabled even with this patch series.
> 
> I must missing a setup handler somewhere?
> 
> > 
> >> +  dev_warn(&netdev->dev,
> >> +   "Failed to set guest offloads by virtnet 
> >> command.\n");
> >> +  return -EINVAL;
> >> +  }
> >> +  }
> > 
> > Hmm if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS is off, this fails
> > silently. It might actually be a good idea to avoid
> > breaking setups.
> > 
> >> +
> >> +  return 0;
> >> +}
> >> +
> >>  static const struct net_device_ops virtnet_netdev = {
> >>.ndo_open= virtnet_open,
> >>.ndo_stop= virtnet_close,
> >> @@ -1435,6 +1465,7 @@ static void virtnet_init_settings(struct net_device 
> >> *dev)
> >>  #ifdef CONFIG_NET_RX_BUSY_POLL
> >>.ndo_busy_poll  = virtnet_busy_poll,
> >>  #endif
> >> +  .ndo_set_features   = virtnet_set_features,
> >>  };
> >>  
> >>  static void virtnet_config_changed_work(struct work_struct *work)
> >> @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
> >>dev->features |= NETIF_F_RXCSUM;
> >>  
> >> +  if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
> >> +  virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> >> +  dev->features |= NETIF_F_LRO;
> >> +  dev->hw_features |= NETIF_F_LRO;
> > 
> > So the issue is I think that the virtio "LRO" isn't really
> > LRO, it's typically just GRO forwarded to guests.
> > So these are easily re-split along MTU boundaries,
> > which makes it ok to forward these across bridges.
> > 
> > It's not nice that we don't document this in the spec,
> > but it's the reality and people rely on this.
> > 
> > For now, how about doing a custom thing and just disable/enable
> > it as XDP is attached/detached?
> 
> The annoying part about doing this is ethtool will say that it is fixed
> yet it will be changed by seemingly unrelated operation. I'm not sure I
> like the idea to start automatically configuring the link via xdp_set.

I really don't like the idea of dropping performance
by a factor of 3 for people bridging two virtio net
interfaces.

So how about a simple approach for now, just disable
XDP if GUEST_TSO is enabled?

We can discuss better approaches in next version.


> > 
> >> +  }
> >> +
> >>dev->vlan_features = dev->features;
> >>  
> >>/* MTU range: 68 - 65535 */
> >> @@ -2057,7 +2094,8 @@ static int virtnet_restore(struct virtio_device 
> >> *vdev)
> >>VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
> >>VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >> -  VIRTIO_NET_F_MTU
> >> +  VIRTIO_NET_F_MTU, \
> >> +  VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> >>  
> >>  static unsigned int features[] = {
> >>  

Re: [PATCH] arp: do neigh confirm based on sk arg

2016-12-14 Thread kbuild test robot
Hi YueHaibing,

[auto build test WARNING on v4.9-rc8]
[cannot apply to net/master net-next/master sparc-next/master next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/YueHaibing/arp-do-neigh-confirm-based-on-sk-arg/20161214-191755
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

>> include/net/sock.h:452: warning: No description found for parameter 
>> 'sk_dst_pending_confirm'

vim +/sk_dst_pending_confirm +452 include/net/sock.h

^1da177e4 Linus Torvalds  2005-04-16  436   int 
sk_write_pending;
4ea59a6cc YueHaibing  2016-12-14  437   unsigned short  
sk_dst_pending_confirm;
d5f642384 Alexey Dobriyan 2008-11-04  438  #ifdef CONFIG_SECURITY
^1da177e4 Linus Torvalds  2005-04-16  439   void
*sk_security;
d5f642384 Alexey Dobriyan 2008-11-04  440  #endif
2a56a1fec Tejun Heo   2015-12-07  441   struct sock_cgroup_data 
sk_cgrp_data;
baac50bbc Johannes Weiner 2016-01-14  442   struct mem_cgroup   
*sk_memcg;
^1da177e4 Linus Torvalds  2005-04-16  443   void
(*sk_state_change)(struct sock *sk);
676d23690 David S. Miller 2014-04-11  444   void
(*sk_data_ready)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  445   void
(*sk_write_space)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  446   void
(*sk_error_report)(struct sock *sk);
^1da177e4 Linus Torvalds  2005-04-16  447   int 
(*sk_backlog_rcv)(struct sock *sk,
^1da177e4 Linus Torvalds  2005-04-16  448   
  struct sk_buff *skb);
^1da177e4 Linus Torvalds  2005-04-16  449   void
(*sk_destruct)(struct sock *sk);
ef456144d Craig Gallek2016-01-04  450   struct sock_reuseport __rcu 
*sk_reuseport_cb;
a4298e452 Eric Dumazet2016-04-01  451   struct rcu_head sk_rcu;
^1da177e4 Linus Torvalds  2005-04-16 @452  };
^1da177e4 Linus Torvalds  2005-04-16  453  
559835ea7 Pravin B Shelar 2013-09-24  454  #define __sk_user_data(sk) ((*((void 
__rcu **)&(sk)->sk_user_data)))
559835ea7 Pravin B Shelar 2013-09-24  455  
559835ea7 Pravin B Shelar 2013-09-24  456  #define 
rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk)))
559835ea7 Pravin B Shelar 2013-09-24  457  #define rcu_assign_sk_user_data(sk, 
ptr) rcu_assign_pointer(__sk_user_data((sk)), ptr)
559835ea7 Pravin B Shelar 2013-09-24  458  
4a17fd522 Pavel Emelyanov 2012-04-19  459  /*
4a17fd522 Pavel Emelyanov 2012-04-19  460   * SK_CAN_REUSE and SK_NO_REUSE on a 
socket mean that the socket is OK

:: The code at line 452 was first introduced by commit
:: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:: TO: Linus Torvalds 
:: CC: Linus Torvalds 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCHv2 2/5] sh_eth: enable wake-on-lan for Gen2 devices

2016-12-14 Thread Sergei Shtylyov

Hello!

   You forgot "R-Car" before "Gen2" in the subject.

On 12/12/2016 07:09 PM, Niklas Söderlund wrote:


Tested on Gen2 r8a7791/Koelsch.

Signed-off-by: Niklas Söderlund 
---
 drivers/net/ethernet/renesas/sh_eth.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 87640b9..348ed22 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -624,8 +624,9 @@ static struct sh_eth_cpu_data r8a779x_data = {

.register_type  = SH_ETH_REG_FAST_RCAR,

-   .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD,
-   .ecsipr_value   = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP,
+   .ecsr_value = ECSR_PSRTO | ECSR_LCHNG | ECSR_ICD | ECSR_MPD,
+   .ecsipr_value   = ECSIPR_PSRTOIP | ECSIPR_LCHNGIP | ECSIPR_ICDIP |
+ ECSIPR_MPDIP,


  These expressions seem to have been sorted by the bit # before your patch, 
now they aren't... care to fix? :-)


[...]

MBR, Sergei



Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
Hi Hannes,

Thanks for the feedback.

> __packed not only removes all padding of the struct but also changes the
> alignment assumptions for the whole struct itself. The rule, the struct
> is aligned by its maximum alignment of a member is no longer true. That
> said, the code accessing this struct will change (not on archs that can
> deal efficiently with unaligned access, but on others).

That's interesting. There currently aren't any alignment requirements
in siphash because we use the unaligned helper functions, but as David
pointed out in another thread, maybe that too should change. In that
case, we'd have an aligned-only version of the function that requires
8-byte aligned input. Perhaps the best way to go about that would be
to just mark the struct as __packed __aligned(8). Or, I guess, since
64-bit accesses gets split into two on 32-bit, that'd be best descried
as __packed __aligned(sizeof(long)). Would that be an acceptable
solution?

Jason


[PATCH] vsock: lookup and setup guest_cid inside vhost_vsock_lock

2016-12-14 Thread Gao feng
Multi vsocks may setup the same cid at the same time.

Signed-off-by: Gao feng 
---
 drivers/vhost/vsock.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index e3b30ea..a08332b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -50,11 +50,10 @@ static u32 vhost_transport_get_local_cid(void)
return VHOST_VSOCK_DEFAULT_HOST_CID;
 }
 
-static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+static struct vhost_vsock *__vhost_vsock_get(u32 guest_cid)
 {
struct vhost_vsock *vsock;
 
-   spin_lock_bh(&vhost_vsock_lock);
list_for_each_entry(vsock, &vhost_vsock_list, list) {
u32 other_cid = vsock->guest_cid;
 
@@ -63,15 +62,24 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
continue;
 
if (other_cid == guest_cid) {
-   spin_unlock_bh(&vhost_vsock_lock);
return vsock;
}
}
-   spin_unlock_bh(&vhost_vsock_lock);
 
return NULL;
 }
 
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+{
+   struct vhost_vsock *vsock;
+
+   spin_lock_bh(&vhost_vsock_lock);
+   vsock = __vhost_vsock_get(guest_cid);
+   spin_unlock_bh(&vhost_vsock_lock);
+
+   return vsock;
+}
+
 static void
 vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct vhost_virtqueue *vq)
@@ -562,11 +570,12 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, 
u64 guest_cid)
return -EINVAL;
 
/* Refuse if CID is already in use */
-   other = vhost_vsock_get(guest_cid);
-   if (other && other != vsock)
-   return -EADDRINUSE;
-
spin_lock_bh(&vhost_vsock_lock);
+   other = __vhost_vsock_get(guest_cid);
+   if (other && other != vsock) {
+   spin_unlock_bh(&vhost_vsock_lock);
+   return -EADDRINUSE;
+   }
vsock->guest_cid = guest_cid;
spin_unlock_bh(&vhost_vsock_lock);
 
-- 
2.5.5



Re: [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-14 Thread Michał Mirosław
On Tue, Dec 13, 2016 at 05:21:18PM -0800, Stephen Hemminger wrote:
> On Sat,  3 Dec 2016 10:22:28 +0100 (CET)
> Michał Mirosław  wrote:
> 
> > This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> > intact through linux networking stack.
> > 
> > Signed-off-by: Michał Mirosław 
> > ---
> > 
> > Dear NetDevs
> > 
> > I guess this needs to be split to the prep..convert[]..finish sequence,
> > but if you like it as is, then it's ready.
> > 
> > The biggest question is if the modified interface and vlan_present
> > is the way to go. This can be changed to use vlan_proto != 0 instead
> > of an extra flag bit.
> > 
> > As I can't test most of the driver changes, please look at them carefully.
> > OVS and bridge eyes are especially welcome.
> > 
> > Best Regards,
> > Michał Mirosław
> 
> Is the motivation to support 802.1ad Drop Eligability Indicator (DEI)?
> 
> If so then you need to be more verbose in the commit log, and lots more
> work is needed. You need to rename fields and validate every place a
> driver is using DEI bit to make sure it really does the right thing
> on that hardware. It is not just a mechanical change.

There are not many mentions of CFI bit in the Linux tree. Places that
used it as VLAN_TAG_PRESENT are fixed with this patchset. Other uses are:

 - VLAN code: ignored
 - ebt_vlan: ignored
 - OVS: cleared because of netlink API assumptions
 - DSA: transferred to/from (E)DSA tag
 - drivers: gianfar: uses properly in filtering rules
 - drivers: cnic: false-positive (uses only VLAN ID, CFI bit marks the field 
'valid')
 - drivers: qedr: false-positive (like cnic)

So unless there is something hidden in the hardware, no driver does anything
special with the CFI bit.

After this patchset only OVS will need further modifications to be able to
support handling of DEI bit.

Best Regards,
Michał Mirosław


Re: [PATCH 2/3] selftests: do not require bash to run bpf tests

2016-12-14 Thread Shuah Khan
On 12/14/2016 04:03 AM, Daniel Borkmann wrote:
> On 12/14/2016 11:58 AM, Rolf Eike Beer wrote:
>>  From b9d6c1b7427d708ef2d4d57aac17b700b3694d71 Mon Sep 17 00:00:00 2001
>> From: Rolf Eike Beer 
>> Date: Wed, 14 Dec 2016 09:58:12 +0100
>> Subject: [PATCH 2/3] selftests: do not require bash to run bpf tests
>>
>> Nothing in this minimal script seems to require bash. We often run these 
>> tests
>> on embedded devices where the only shell available is the busybox ash.
>>
>> Signed-off-by: Rolf Eike Beer 
> 
> Acked-by: Daniel Borkmann 

Thanks. I will get these into 4.10-rc1 or rc2

-- Shuah


Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread kbuild test robot
Hi Arvind,

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624
config: mips-cavium_octeon_defconfig (attached as .config)
compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 
'octeon_mgmt_probe':
>> drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' 
>> undeclared (first use in this function)
  dev_err(dev, "failed to map I/O memory\n");
  ^~~
   drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each 
undeclared identifier is reported only once for each function it appears in

vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c

  1467  
  1468  p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, 
p->mix_size);
  1469  p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, 
p->agl_size);
  1470  p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, 
p->agl_prt_ctl_phys,
  1471 p->agl_prt_ctl_size);
  1472  if (!p->mix || !p->agl || !p->agl_prt_ctl) {
> 1473  dev_err(dev, "failed to map I/O memory\n");
  1474  result = -ENOMEM;
  1475  goto err;
  1476  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


RE: [PATCH 3/3] netns: fix net_generic() "id - 1" bloat

2016-12-14 Thread David Laight
From: Alexey Dobriyan 
> Sent: 14 December 2016 13:20
...
> > If you have foo->bar[id - const] then the compiler has to add the
> > offset of 'bar' and subtract for 'const'.
> > If the numbers match no add or subtract is needed.
> >
> > It is much cleaner to do this by explicitly removing the offset on the
> > accesses than using a union.
> 
> Surprisingly, the trick only works if array index is cast to "unsigned long"
> before subtracting.
> 
> Code becomes
> 
> ...
> ptr = ng->ptr[(unsigned long)id - 3];
> ...

The compiler may also be able to optimise it away if 'id' is 'int'
rather than 'unsigned int'.

Oh, if you need casts like that use an accessor function.

David




RE: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread David Laight
From: Jason A. Donenfeld
> Sent: 14 December 2016 13:44
> To: Hannes Frederic Sowa
> > __packed not only removes all padding of the struct but also changes the
> > alignment assumptions for the whole struct itself. The rule, the struct
> > is aligned by its maximum alignment of a member is no longer true. That
> > said, the code accessing this struct will change (not on archs that can
> > deal efficiently with unaligned access, but on others).
> 
> That's interesting. There currently aren't any alignment requirements
> in siphash because we use the unaligned helper functions, but as David
> pointed out in another thread, maybe that too should change. In that
> case, we'd have an aligned-only version of the function that requires
> 8-byte aligned input. Perhaps the best way to go about that would be
> to just mark the struct as __packed __aligned(8). Or, I guess, since
> 64-bit accesses gets split into two on 32-bit, that'd be best descried
> as __packed __aligned(sizeof(long)). Would that be an acceptable
> solution?

Just remove the __packed and ensure that the structure is 'nice'.
This includes ensuring there is no 'tail padding'.
In some cases you'll need to put the port number into a 32bit field.

I'd also require that the key be aligned.
It probably ought to be a named structure type with two 64bit members
(or with an array member that has two elements).

David



Re: [PATCHv3 perf/core 0/7] Reuse libbpf from samples/bpf

2016-12-14 Thread Arnaldo Carvalho de Melo
Em Wed, Dec 14, 2016 at 10:25:01AM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Fri, Dec 09, 2016 at 04:30:54PM +0100, Daniel Borkmann escreveu:
> > On 12/09/2016 04:09 PM, Arnaldo Carvalho de Melo wrote:
> > > > v3: Add ack for first patch.
> > > >  Split out second patch from v2 into separate changes for remaining 
> > > > diff.
> > > >  Add patches to switch samples/bpf over to using tools/lib/.
> > > > v2: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > > >  Don't shift non-bpf code into libbpf.
> > > >  Drop the patch to synchronize ELF definitions with tc.
> > > > v1: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html
> > > >  First post.

> > > Thanks, applied after addressing the -I$(objtree) issue raised by Wang,

> > [ Sorry for late reply. ]

> > First of all, glad to see us getting rid of the duplicate lib eventually! :)
> > 
> > Please note that this might result in hopefully just a minor merge issue
> > with net-next. Looks like patch 4/7 touches test_maps.c and test_verifier.c,
> > which moved to a new bpf selftest suite [1] this net-next cycle. Seems it's
> > just log buffer and some renames there, which can be discarded for both
> > files sitting in selftests.
> 
> Yeah, I've got to this point, and the merge has a little bit more than
> that, including BPF_PROG_ATTACH/BPF_PROG_DETACH, etc, working on it...

So, Joe, can you try refreshing this work, starting from what I have in
perf/core? It has the changes coming from net-next that Daniel warned us about
and some more.

[acme@jouet linux]$ git log --oneline -5
1f125a4aa4d8 tools lib bpf: Add flags to bpf_create_map()
5adf5614f72d tools lib bpf: use __u32 from linux/types.h
ff687c38d803 tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
53452c69b4c3 perf annotate: Fix jump target outside of function address range
2f41ae602b57 perf annotate: Support jump instruction with target as second 
operand
[acme@jouet linux]$

I tried refreshing it, but it seems samples/bpf/ needs some love and
care first, as I can't get it to build before these patches, to make
sure nothing gets broken.

Trying to bisect it I get to what seems multiple bisect breakages, last
tag I got it to build, with lots of warnings, was v4.8, after that I get
things like the ones below.

I could try fixing it, but may be missing something, and want to push the other
stuff in this branch...

[acme@jouet linux]$ egrep SAMPLES\|BPF .config
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NETFILTER_XT_MATCH_BPF=m
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_BPF_EVENTS=y
# CONFIG_TEST_BPF is not set
CONFIG_SAMPLES=y
[acme@jouet linux]$ 

[acme@jouet linux]$ make -C samples/bpf
make: Entering directory '/home/acme/git/linux/samples/bpf'
make -C ../../ $PWD/
make[1]: Entering directory '/home/acme/git/linux'
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CHK include/generated/timeconst.h
  CHK include/generated/bounds.h
  CHK include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  HOSTCC  /home/acme/git/linux/samples/bpf/bpf_load.o
In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0:
/home/acme/git/linux/samples/bpf/bpf_helpers.h:76:11: error: 
‘BPF_FUNC_skb_in_cgroup’ undeclared here (not in a function)
  (void *) BPF_FUNC_skb_in_cgroup;
   ^~
scripts/Makefile.host:124: recipe for target 
'/home/acme/git/linux/samples/bpf/bpf_load.o' failed
make[2]: *** [/home/acme/git/linux/samples/bpf/bpf_load.o] Error 1
Makefile:1646: recipe for target '/home/acme/git/linux/samples/bpf/' failed

[acme@jouet linux]$ make -C samples/bpf
make: Entering directory '/home/acme/git/linux/samples/bpf'
make -C ../../ $PWD/
make[1]: Entering directory '/home/acme/git/linux'
scripts/kconfig/conf  --silentoldconfig Kconfig
#
# configuration written to .config
#
  SYSTBL  arch/x86/entry/syscalls/../../include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/entry/syscalls/../../include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/entry/syscalls/../../include/generated/uapi/asm/unistd_32.h
  CHK include/config/kernel.release
  UPD include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  UPD include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  UPD include/generated/utsrelease.h
  CHK include/generated/timeconst.h
  CC  kernel/bounds.s
  CHK include/generated/bounds.h
  GEN scripts/gdb/linux/constants.py
  CC  arch/x86/kernel/asm-offsets.s
  CHK include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  HOSTCC  /home/acme/git/linux/samples/bpf/bpf_load.o
In file included from /home/acme/git/linux/samples/bpf/bpf_load.c:21:0:
/home/acme/git/linux/samples/bpf/bpf_helpers.h:49:11: error: 
‘BPF_FUNC_current_task_under_cgroup’ undeclared here (not in a function)

[PATCH net-next 1/1] driver: ipvlan: Define common functions to decrease duplicated codes used to add or del IP address

2016-12-14 Thread fgao
From: Gao Feng 

There are some duplicated codes in ipvlan_add_addr6/4 and
ipvlan_del_addr6/4. Now define two common functions ipvlan_add_addr
and ipvlan_del_addr to decrease the duplicated codes.
It could be helful to maintain the codes.

Signed-off-by: Gao Feng 
---
 drivers/net/ipvlan/ipvlan_main.c | 68 +---
 1 file changed, 29 insertions(+), 39 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 693ec5b..5874d30 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -669,23 +669,22 @@ static int ipvlan_device_event(struct notifier_block 
*unused,
return NOTIFY_DONE;
 }
 
-static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+static int ipvlan_add_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
 {
struct ipvl_addr *addr;
 
-   if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
-   netif_err(ipvlan, ifup, ipvlan->dev,
- "Failed to add IPv6=%pI6c addr for %s intf\n",
- ip6_addr, ipvlan->dev->name);
-   return -EINVAL;
-   }
addr = kzalloc(sizeof(struct ipvl_addr), GFP_ATOMIC);
if (!addr)
return -ENOMEM;
 
addr->master = ipvlan;
-   memcpy(&addr->ip6addr, ip6_addr, sizeof(struct in6_addr));
-   addr->atype = IPVL_IPV6;
+   if (is_v6) {
+   memcpy(&addr->ip6addr, iaddr, sizeof(struct in6_addr));
+   addr->atype = IPVL_IPV6;
+   } else {
+   memcpy(&addr->ip4addr, iaddr, sizeof(struct in_addr));
+   addr->atype = IPVL_IPV4;
+   }
list_add_tail(&addr->anode, &ipvlan->addrs);
 
/* If the interface is not up, the address will be added to the hash
@@ -697,11 +696,11 @@ static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, 
struct in6_addr *ip6_addr)
return 0;
 }
 
-static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr 
*ip6_addr)
+static void ipvlan_del_addr(struct ipvl_dev *ipvlan, void *iaddr, bool is_v6)
 {
struct ipvl_addr *addr;
 
-   addr = ipvlan_find_addr(ipvlan, ip6_addr, true);
+   addr = ipvlan_find_addr(ipvlan, iaddr, is_v6);
if (!addr)
return;
 
@@ -712,6 +711,23 @@ static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, 
struct in6_addr *ip6_addr)
return;
 }
 
+static int ipvlan_add_addr6(struct ipvl_dev *ipvlan, struct in6_addr *ip6_addr)
+{
+   if (ipvlan_addr_busy(ipvlan->port, ip6_addr, true)) {
+   netif_err(ipvlan, ifup, ipvlan->dev,
+ "Failed to add IPv6=%pI6c addr for %s intf\n",
+ ip6_addr, ipvlan->dev->name);
+   return -EINVAL;
+   }
+
+   return ipvlan_add_addr(ipvlan, ip6_addr, true);
+}
+
+static void ipvlan_del_addr6(struct ipvl_dev *ipvlan, struct in6_addr 
*ip6_addr)
+{
+   return ipvlan_del_addr(ipvlan, ip6_addr, true);
+}
+
 static int ipvlan_addr6_event(struct notifier_block *unused,
  unsigned long event, void *ptr)
 {
@@ -745,45 +761,19 @@ static int ipvlan_addr6_event(struct notifier_block 
*unused,
 
 static int ipvlan_add_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
 {
-   struct ipvl_addr *addr;
-
if (ipvlan_addr_busy(ipvlan->port, ip4_addr, false)) {
netif_err(ipvlan, ifup, ipvlan->dev,
  "Failed to add IPv4=%pI4 on %s intf.\n",
  ip4_addr, ipvlan->dev->name);
return -EINVAL;
}
-   addr = kzalloc(sizeof(struct ipvl_addr), GFP_KERNEL);
-   if (!addr)
-   return -ENOMEM;
-
-   addr->master = ipvlan;
-   memcpy(&addr->ip4addr, ip4_addr, sizeof(struct in_addr));
-   addr->atype = IPVL_IPV4;
-   list_add_tail(&addr->anode, &ipvlan->addrs);
-
-   /* If the interface is not up, the address will be added to the hash
-* list by ipvlan_open.
-*/
-   if (netif_running(ipvlan->dev))
-   ipvlan_ht_addr_add(ipvlan, addr);
 
-   return 0;
+   return ipvlan_add_addr(ipvlan, ip4_addr, false);
 }
 
 static void ipvlan_del_addr4(struct ipvl_dev *ipvlan, struct in_addr *ip4_addr)
 {
-   struct ipvl_addr *addr;
-
-   addr = ipvlan_find_addr(ipvlan, ip4_addr, false);
-   if (!addr)
-   return;
-
-   ipvlan_ht_addr_del(addr);
-   list_del(&addr->anode);
-   kfree_rcu(addr, rcu);
-
-   return;
+   return ipvlan_del_addr(ipvlan, ip4_addr, false);
 }
 
 static int ipvlan_addr4_event(struct notifier_block *unused,
-- 
1.9.1




Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Hannes Frederic Sowa
Hello,

On 14.12.2016 14:10, Jason A. Donenfeld wrote:
> On Wed, Dec 14, 2016 at 12:21 PM, Hannes Frederic Sowa
>  wrote:
>> Can you show or cite benchmarks in comparison with jhash? Last time I
>> looked, especially for short inputs, siphash didn't beat jhash (also on
>> all the 32 bit devices etc.).
> 
> I assume that jhash is likely faster than siphash, but I wouldn't be
> surprised if with optimization we can make siphash at least pretty
> close on 64-bit platforms. (I'll do some tests though; maybe I'm wrong
> and jhash is already slower.)

Yes, numbers would be very usable here. I am mostly concerned about
small plastic router cases. E.g. assume you double packet processing
time with a change of the hashing function at what point is the actual
packet processing more of an attack vector than the hashtable?

> With that said, siphash is here to replace uses of jhash where
> hashtable poisoning vulnerabilities make it necessary. Where there's
> no significant security improvement, if there's no speed improvement
> either, then of course nothing's going to change.

It still changes currently well working source. ;-)

> I should have mentioned md5_transform in this first message too, as
> two other patches in this series actually replace md5_transform usage
> with siphash. I think in this case, siphash is a clear performance
> winner (and security winner) over md5_transform. So if the push back
> against replacing jhash usages is just too high, at the very least it
> remains useful already for the md5_transform usage.

MD5 is considered broken because its collision resistance is broken?
SipHash doesn't even claim to have collision resistance (which we don't
need here)?

But I agree, certainly it could be a nice speed-up!

>> This pretty much depends on the linearity of the hash function? I don't
>> think a crypto secure hash function is needed for a hash table. Albeit I
>> agree that siphash certainly looks good to be used here.
> 
> In order to prevent the aforementioned poisoning attacks, a PRF with
> perfect linearity is required, which is what's achieved when it's a
> cryptographically secure one. Check out section 7 of
> https://131002.net/siphash/siphash.pdf .

I think you mean non-linearity. Otherwise I agree that siphash is
certainly a better suited hashing algorithm as far as I know. But it
would be really interesting to compare some performance numbers. Hard to
say anything without them.

>> I am pretty sure that SipHash still needs a random key per hash table
>> also. So far it was only the choice of hash function you are questioning.
> 
> Siphash needs a random secret key, yes. The point is that the hash
> function remains secure so long as the secret key is kept secret.
> Other functions can't make the same guarantee, and so nervous periodic
> key rotation is necessary, but in most cases nothing is done, and so
> things just leak over time.
> 
> 
>> Hmm, I tried to follow up with all the HashDoS work and so far didn't
>> see any HashDoS attacks against the Jenkins/SpookyHash family.
>>
>> If this is an issue we might need to also put those changes into stable.
> 
> jhash just isn't secure; it's not a cryptographically secure PRF. If
> there hasn't already been an academic paper put out there about it
> this year, let's make this thread 1000 messages long to garner
> attention, and next year perhaps we'll see one. No doubt that
> motivated government organizations, defense contractors, criminals,
> and other netizens have already done research in private. Replacing
> insecure functions with secure functions is usually a good thing.

I think this is a weak argument.

In general I am in favor to switch to siphash, but it would be nice to
see some benchmarks with the specific kernel implementation also on some
smaller 32 bit CPUs and especially without using any SIMD instructions
(which might have been used in paper comparison).

Bye,
Hannes



RE: [PATCH v2 net-next 1/2] phy: add phy fixup unregister functions

2016-12-14 Thread Woojung.Huh
> I just want to commit the unregister patch and found this patch. Good job!
> But I consider this patch may miss something.
> If one SoC has 2 MAC ports and each port uses the different network driver,
> the 2 drivers may register fixup for the same PHY chip with different
> "run" function because the PHY chip works in different mode.
> In such a case, this patch doesn't consider "run" function and may cause
> problem.
> When removing the driver which register fixup at last, it will remove another
> driver's fixup.
> Should this condition be considered and fixed?
Good point.
Current phy fixup is independent LIST from phydev structure,
and, fixup runs in two places of phy_device_register() and phy_init_hw().
It's not clear that it needs two separate fixup, but it may be good idea to
pass phy fixup when calling phy_attach() or phy_attach_direct() and
put it under phydev structure.
So, fixup can be called at phy_init_hw() per phy device and remove
When phy detached.
Welcome any comments.

- Woojung



netfilter -stable backport request

2016-12-14 Thread Eric Desrochers
Hi,

I would like to request a -stable backport for the following patchset that as 
we speak can be found in pablo's nf-next:

# git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

[PATCH 1/3]
commit 2394ae21e8b652aff0db1c02e946243c1e2f5edb
netfilter: x_tables: pass xt_counters struct instead of packet counter
   
[PATCH 2/3]
commit 18b61e8161cc308cbfd06d2e2c6c0758dfd925ef
netfilter: x_tables: pass xt_counters struct to counter allocator

[PATCH 3/3]
commit 722d6785e3b29a3b9f95c4d77542a1416094786a
netfilter: x_tables: pack percpu counter allocations

Please add this to stable branches : v4.4.x, v4.8.x

The above patchset is fixing a netfilter regression which introduced a 
performance slowdown in binary arp/ip/ip6tables starting at commit :

#v4.2-rc1
commit 71ae0dff02d756e4d2ca710b79f2ff5390029a5f
netfilter: xtables: use percpu rule counters

Regards,

Eric


[PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation

2016-12-14 Thread Daniel Mack
This trie implements a longest prefix match algorithm that can be used
to match IP addresses to a stored set of ranges.

Internally, data is stored in an unbalanced trie of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.

Tries may be created with prefix lengths that are multiples of 8, in
the range from 8 to 2048. The key used for lookup and update operations
is a struct bpf_lpm_trie_key, and the value is a uint64_t.

The code carries more information about the internal implementation.

Signed-off-by: Daniel Mack 
Reviewed-by: David Herrmann 
---
 include/uapi/linux/bpf.h |   7 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/lpm_trie.c| 491 +++
 3 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/lpm_trie.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0eb0e87..d564277 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -63,6 +63,12 @@ struct bpf_insn {
__s32   imm;/* signed immediate constant */
 };
 
+/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */
+struct bpf_lpm_trie_key {
+   __u32   prefixlen;  /* up to 32 for AF_INET, 128 for AF_INET6 */
+   __u8data[0];/* Arbitrary size */
+};
+
 /* BPF syscall commands, see bpf(2) man-page for details. */
 enum bpf_cmd {
BPF_MAP_CREATE,
@@ -89,6 +95,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_ARRAY,
BPF_MAP_TYPE_LRU_HASH,
BPF_MAP_TYPE_LRU_PERCPU_HASH,
+   BPF_MAP_TYPE_LPM_TRIE,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 1276474..e1ce4f4 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
bpf_lru_list.o lpm_trie.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
new file mode 100644
index 000..cae759d
--- /dev/null
+++ b/kernel/bpf/lpm_trie.c
@@ -0,0 +1,491 @@
+/*
+ * Longest prefix match list implementation
+ *
+ * Copyright (c) 2016 Daniel Mack
+ * Copyright (c) 2016 David Herrmann
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Intermediate node */
+#define LPM_TREE_NODE_FLAG_IM BIT(0)
+
+struct lpm_trie_node;
+
+struct lpm_trie_node {
+   struct rcu_head rcu;
+   struct lpm_trie_node*child[2];
+   u32 prefixlen;
+   u32 flags;
+   u64 value;
+   u8  data[0];
+};
+
+struct lpm_trie {
+   struct bpf_map  map;
+   struct lpm_trie_node*root;
+   size_t  n_entries;
+   size_t  max_prefixlen;
+   size_t  data_size;
+   spinlock_t  lock;
+};
+
+/*
+ * This trie implements a longest prefix match algorithm that can be used to
+ * match IP addresses to a stored set of ranges.
+ *
+ * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is
+ * interpreted as big endian, so data[0] stores the most significant byte.
+ *
+ * Match ranges are internally stored in instances of struct lpm_trie_node
+ * which each contain their prefix length as well as two pointers that may
+ * lead to more nodes containing more specific matches. Each node also stores
+ * a value that is defined by and returned to userspace via the update_elem
+ * and lookup functions.
+ *
+ * For instance, let's start with a trie that was created with a prefix length
+ * of 32, so it can be used for IPv4 addresses, and one single element that
+ * matches 192.168.0.0/16. The data array would hence contain
+ * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will
+ * stick to IP-address notation for readability though.
+ *
+ * As the trie is empty initially, the new node (1) will be places as root
+ * node, denoted as (R) in the example below. As there are no other node, both
+ * child pointers are %NULL.
+ *
+ *  ++
+ *  |   (1)  (R) |
+ *  | 192.168.0.0/16 |
+ *  |value: 1|
+ *  |   [0][1]   |
+ *  ++
+ *
+ * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already
+ * a node with the same data and a smaller prefix (ie, a less specific one),
+ * node (2) will become a child of (1). In child index depends on the next bit
+ * that is outside of that (1) matches, and that bit i

[PATCH RFC 0/2] bpf: add longest prefix match map

2016-12-14 Thread Daniel Mack
This patch set adds longest prefix match algorithm that can be used to
match IP addresses to a stored set of ranges. It is exposed as a bpf
map type.
   
Internally, data is stored in an unbalanced tree of nodes that has a
maximum height of n, where n is the prefixlen the trie was created
with.
 
Not that this has nothing to do with fib or fib6 and is in no way meant
to replace or share code with it. It's rather a much simpler
implementation that is specifically written with bpf maps in mind.
 
Patch 1/2 adds the implementation, and 2/2 an extensive test suite.
 
Feedback is much appreciated.
 
 
Thanks,
Daniel

Daniel Mack (1):
  bpf: add a longest prefix match trie map implementation

David Herrmann (1):
  bpf: Add tests for the lpm trie map

 include/uapi/linux/bpf.h   |   7 +
 kernel/bpf/Makefile|   2 +-
 kernel/bpf/lpm_trie.c  | 491 +
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 
 6 files changed, 850 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/lpm_trie.c
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

-- 
2.9.3



[PATCH RFC 2/2] bpf: Add tests for the lpm trie map

2016-12-14 Thread Daniel Mack
From: David Herrmann 

The first part of this program runs randomized tests against the
lpm-bpf-map. It implements a "Trivial Longest Prefix Match" (tlpm)
based on simple, linear, single linked lists. The implementation
should be pretty straightforward.

Based on tlpm, this inserts randomized data into bpf-lpm-maps and
verifies the trie-based bpf-map implementation behaves the same way
as tlpm.

The second part uses 'real world' IPv4 and IPv6 addresses and tests
the trie with those.

Signed-off-by: David Herrmann 
Signed-off-by: Daniel Mack 
---
 tools/testing/selftests/bpf/.gitignore |   1 +
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/test_lpm_map.c | 348 +
 3 files changed, 351 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_lpm_map.c

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 071431b..d3b1c9b 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,3 +1,4 @@
 test_verifier
 test_maps
 test_lru_map
+test_lpm_map
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 7a5f245..064a3e5 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,8 +1,8 @@
 CFLAGS += -Wall -O2 -I../../../../usr/include
 
-test_objs = test_verifier test_maps test_lru_map
+test_objs = test_verifier test_maps test_lru_map test_lpm_map
 
-TEST_PROGS := test_verifier test_maps test_lru_map test_kmod.sh
+TEST_PROGS := test_verifier test_maps test_lru_map test_lpm_map test_kmod.sh
 TEST_FILES := $(test_objs)
 
 all: $(test_objs)
diff --git a/tools/testing/selftests/bpf/test_lpm_map.c 
b/tools/testing/selftests/bpf/test_lpm_map.c
new file mode 100644
index 000..08db750
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_lpm_map.c
@@ -0,0 +1,348 @@
+/*
+ * Randomized tests for eBPF longest-prefix-match maps
+ *
+ * This program runs randomized tests against the lpm-bpf-map. It implements a
+ * "Trivial Longest Prefix Match" (tlpm) based on simple, linear, singly linked
+ * lists. The implementation should be pretty straightforward.
+ *
+ * Based on tlpm, this inserts randomized data into bpf-lpm-maps and verifies
+ * the trie-based bpf-map implementation behaves the same way as tlpm.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_sys.h"
+#include "bpf_util.h"
+
+struct tlpm_node {
+   struct tlpm_node *next;
+   size_t n_bits;
+   uint8_t key[];
+};
+
+static struct tlpm_node *tlpm_add(struct tlpm_node *list,
+ const uint8_t *key,
+ size_t n_bits)
+{
+   struct tlpm_node *node;
+   size_t n;
+
+   /* add new entry with @key/@n_bits to @list and return new head */
+
+   n = (n_bits + 7) / 8;
+   node = malloc(sizeof(*node) + n);
+   assert(node);
+
+   node->next = list;
+   node->n_bits = n_bits;
+   memcpy(node->key, key, n);
+
+   return node;
+}
+
+static void tlpm_clear(struct tlpm_node *list)
+{
+   struct tlpm_node *node;
+
+   /* free all entries in @list */
+
+   while ((node = list)) {
+   list = list->next;
+   free(node);
+   }
+}
+
+static struct tlpm_node *tlpm_match(struct tlpm_node *list,
+   const uint8_t *key,
+   size_t n_bits)
+{
+   struct tlpm_node *best = NULL;
+   size_t i;
+
+   /*
+* Perform longest prefix-match on @key/@n_bits. That is, iterate all
+* entries and match each prefix against @key. Remember the "best"
+* entry we find (i.e., the longest prefix that matches) and return it
+* to the caller when done.
+*/
+
+   for ( ; list; list = list->next) {
+   for (i = 0; i < n_bits && i < list->n_bits; ++i) {
+   if ((key[i / 8] & (1 << (7 - i % 8))) !=
+   (list->key[i / 8] & (1 << (7 - i % 8
+   break;
+   }
+
+   if (i >= list->n_bits) {
+   if (!best || i > best->n_bits)
+   best = list;
+   }
+   }
+
+   return best;
+}
+
+static void test_lpm_basic(void)
+{
+   struct tlpm_node *list = NULL, *t1, *t2;
+
+   /* very basic, static tests to verify tlpm works as expected */
+
+   assert(!tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+
+   t1 = list = tlpm_add(list, (uint8_t[]){ 0xff }, 8);
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff }, 8));
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0xff }, 16));
+   assert(t1 == tlpm_match(list, (uint8_t[]){ 0xff, 0x00 }, 16));
+   assert(!tlpm_match(list, (uint8_t[]){ 0x7f }, 8));
+   assert(!tlpm_mat

Re: dsa: handling more than 1 cpu port

2016-12-14 Thread John Crispin


On 14/12/2016 12:00, Andrew Lunn wrote:
> On Wed, Dec 14, 2016 at 11:35:30AM +0100, John Crispin wrote:
>>
>>
>> On 14/12/2016 11:31, Andrew Lunn wrote:
>>> On Wed, Dec 14, 2016 at 11:01:54AM +0100, John Crispin wrote:
 Hi Andrew,

 switches supported by qca8k have 2 gmacs that we can wire an external
 mii interface to. Usually this is used to wire 2 of the SoCs MACs to the
 switch. Thw switch itself is aware of one of the MACs being the cpu port
 and expects this to be port/mac0. Using the other will break the
 hardware offloading features.
>>>
>>> Just to be sure here. There is no way to use the second port connected
>>> to the CPU as a CPU port?
>>
>> both macs are considered cpu ports and both allow for the tag to be
>> injected. for HW NAT/routing/... offloading to work, the lan ports neet
>> to trunk via port0 and not port6 however.
> 
> Maybe you can do a restricted version of the generic solution. LAN
> ports are mapped to cpu port0. WAN port to cpu port 6?
> 
>>> The Marvell chips do allow this. So i developed a proof of concept
>>> which had a mapping between cpu ports and slave ports. slave port X
>>> should you cpu port y for its traffic. This never got past proof of
>>> concept. 
>>>
>>> If this can be made to work for qca8k, i would prefer having this
>>> general concept, than specific hacks for pass through.
>>
>> oh cool, can you send those patches my way please ? how do you configure
>> this from userland ? does the cpu port get its on swdev which i just add
>> to my lan bridge and then add the 2nd cpu port to the wan bridge ?
> 
> https://github.com/lunn/linux/tree/v4.1-rc4-net-next-multiple-cpu
> 
> You don't configure anything from userland. Which was actually a
> criticism. It is in device tree. But my solution is generic. Having
> one WAN port and four bridges LAN ports is a pure marketing
> requirement. The hardware will happily do two WAN ports and 3 LAN
> ports, for example. And the switch is happy to take traffic for the
> WAN port and a LAN port over the same CPU port, and keep the traffic
> separate. So we can have some form of load balancing. We are not
> limited to 1->1, 1->4, we can do 1->2, 1->3 to increase the overall
> performance. And to the user it is all transparent.
> 
> This PoC is for the old DSA binding. The new binding makes it easier
> to express this. Which is one of the reasons for the new binding.
> 
>   Andrew
> 

Hi Andrew,

spent some time looking at this and thinking about possible solutions.
my initial idea was to detect which cpu port to based on the cpu port
being included inside the bridge. however that wont allow us to control
ports using the tag outside of a bridge. i think that your approach is
the only sane one. we could add a sysfs interface later, allowing us to
change the default cpu port <-> mappings, but the device tree needs to
include some sane defaults. i'll use your patches as a base for a series.

John



ipv6: handle -EFAULT from skb_copy_bits

2016-12-14 Thread Dave Jones
It seems to be possible to craft a packet for sendmsg that triggers
the -EFAULT path in skb_copy_bits resulting in a BUG_ON that looks like:

RIP: 0010:[] [] rawv6_sendmsg+0xc30/0xc40
RSP: 0018:881f6c4a7c18  EFLAGS: 00010282
RAX: fff2 RBX: 881f6c681680 RCX: 0002
RDX: 881f6c4a7cf8 RSI: 0030 RDI: 881fed0f6a00
RBP: 881f6c4a7da8 R08:  R09: 0009
R10: 881fed0f6a00 R11: 0009 R12: 0030
R13: 881fed0f6a00 R14: 881fee39ba00 R15: 881fefa93a80

Call Trace:
 [] ? unmap_page_range+0x693/0x830
 [] inet_sendmsg+0x67/0xa0
 [] sock_sendmsg+0x38/0x50
 [] SYSC_sendto+0xef/0x170
 [] SyS_sendto+0xe/0x10
 [] do_syscall_64+0x50/0xa0
 [] entry_SYSCALL64_slow_path+0x25/0x25

Handle this in rawv6_push_pending_frames and jump to the failure path.

Signed-off-by: Dave Jones 

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 291ebc260e70..35aa82faa052 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -591,7 +591,9 @@ static int rawv6_push_pending_frames(struct sock *sk, 
struct flowi6 *fl6,
}
 
offset += skb_transport_offset(skb);
-   BUG_ON(skb_copy_bits(skb, offset, &csum, 2));
+   err = skb_copy_bits(skb, offset, &csum, 2);
+   if (err < 0)
+   goto out;
 
/* in case cksum was not initialized */
if (unlikely(csum))



ravb/sh_eth/b44: BUG: sleeping function called from invalid context

2016-12-14 Thread Geert Uytterhoeven
Hi,

When CONFIG_DEBUG_ATOMIC_SLEEP=y, running "ethtool -s eth0 speed 100"
on Salvator-X gives:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:97
in_atomic(): 1, irqs_disabled(): 128, pid: 1683, name: ethtool
CPU: 0 PID: 1683 Comm: ethtool Tainted: GW
4.9.0-salvator-x-00426-g326519df42c65007-dirty #976
Hardware name: Renesas Salvator-X board based on r8a7796 (DT)
Call trace:
[] dump_backtrace+0x0/0x208
[] show_stack+0x14/0x1c
[] dump_stack+0x94/0xb4
[] ___might_sleep+0x108/0x11c
[] __might_sleep+0x84/0x94
[] mutex_lock+0x24/0x40
[] phy_start_aneg+0x20/0x130
[] phy_ethtool_ksettings_set+0xd0/0xe8
[] ravb_set_link_ksettings+0x4c/0xa4
[] ethtool_set_settings+0xec/0xfc
[] dev_ethtool+0x188/0x17c4
[] dev_ioctl+0x53c/0x6b8
[] sock_do_ioctl.constprop.45+0x3c/0x4c
[] sock_ioctl+0x33c/0x370
[] vfs_ioctl+0x20/0x38
[] do_vfs_ioctl+0x844/0x954
[] SyS_ioctl+0x44/0x68
[] el0_svc_naked+0x24/0x28

ravb_set_link_ksettings() calls phy_ethtool_ksettings_set() with a spinlock
held and interrupts disabled, while phy_start_aneg() tries to obtain a mutex.

The same issue is present in sh_eth_set_link_ksettings() (verified) and
b44_set_link_ksettings() (code inspection).

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread Arvind Yadav
Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.

Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device 
*pdev)
p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
   p->agl_prt_ctl_size);
+   if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+   dev_err(&pdev->dev, "failed to map I/O memory\n");
+   result = -ENOMEM;
+   goto err;
+   }
+
spin_lock_init(&p->lock);
 
skb_queue_head_init(&p->tx_list);
-- 
2.7.4



Re: [v1] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread arvind Yadav

Sorry for build failure. I have send new changes. Which does not
this failure.

Thanks
-Arvind

On Wednesday 14 December 2016 08:10 PM, kbuild test robot wrote:

Hi Arvind,

[auto build test ERROR on net-next/master]
[also build test ERROR on v4.9 next-20161214]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arvind-Yadav/net-ethernet-cavium-octeon-octeon_mgmt-Handle-return-NULL-error-from-devm_ioremap/20161213-224624
config: mips-cavium_octeon_defconfig (attached as .config)
compiler: mips64-linux-gnuabi64-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
 wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 make.cross ARCH=mips

All errors (new ones prefixed by >>):

drivers/net/ethernet/cavium/octeon/octeon_mgmt.c: In function 
'octeon_mgmt_probe':

drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: error: 'dev' 
undeclared (first use in this function)

   dev_err(dev, "failed to map I/O memory\n");
   ^~~
drivers/net/ethernet/cavium/octeon/octeon_mgmt.c:1473:11: note: each 
undeclared identifier is reported only once for each function it appears in

vim +/dev +1473 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c

   1467 
   1468 p->mix = (u64)devm_ioremap(&pdev->dev, p->mix_phys, 
p->mix_size);
   1469 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, 
p->agl_size);
   1470 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, 
p->agl_prt_ctl_phys,
   1471p->agl_prt_ctl_size);
   1472 if (!p->mix || !p->agl || !p->agl_prt_ctl) {

1473dev_err(dev, "failed to map I/O memory\n");

   1474 result = -ENOMEM;
   1475 goto err;
   1476 }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation




Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread John Fastabend
On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
> On Tue, 13 Dec 2016 12:08:21 -0800
> John Fastabend  wrote:
> 
>> On 16-12-13 11:53 AM, David Miller wrote:
>>> From: John Fastabend 
>>> Date: Tue, 13 Dec 2016 09:43:59 -0800
>>>   
 What does "zero-copy send packet-pages to the application/socket that
 requested this" mean? At the moment on x86 page-flipping appears to be
 more expensive than memcpy (I can post some data shortly) and shared
 memory was proposed and rejected for security reasons when we were
 working on bifurcated driver.  
>>>
>>> The whole idea is that we map all the active RX ring pages into
>>> userspace from the start.
>>>
>>> And just how Jesper's page pool work will avoid DMA map/unmap,
>>> it will also avoid changing the userspace mapping of the pages
>>> as well.
>>>
>>> Thus avoiding the TLB/VM overhead altogether.
>>>   
> 
> Exactly.  It is worth mentioning that pages entering the page pool need
> to be cleared (measured cost 143 cycles), in order to not leak any
> kernel info.  The primary focus of this design is to make sure not to
> leak kernel info to userspace, but with an "exclusive" mode also
> support isolation between applications.
> 
> 
>> I get this but it requires applications to be isolated. The pages from
>> a queue can not be shared between multiple applications in different
>> trust domains. And the application has to be cooperative meaning it
>> can't "look" at data that has not been marked by the stack as OK. In
>> these schemes we tend to end up with something like virtio/vhost or
>> af_packet.
> 
> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
> two would require CAP_NET_ADMIN privileges.  All modes have a trust
> domain id, that need to match e.g. when page reach the socket.

Even mode 3 should required cap_net_admin we don't want userspace to
grab queues off the nic without it IMO.

> 
> Mode-1 "Shared": Application choose lowest isolation level, allowing
>  multiple application to mmap VMA area.

My only point here is applications can read each others data and all
applications need to cooperate for example one app could try to write
continuously to read only pages causing faults and what not. This is
all non standard and doesn't play well with cgroups and "normal"
applications. It requires a new orchestration model.

I'm a bit skeptical of the use case but I know of a handful of reasons
to use this model. Maybe take a look at the ivshmem implementation in
DPDK.

Also this still requires a hardware filter to push "application" traffic
onto reserved queues/pages as far as I can tell.

> 
> Mode-2 "Single-user": Application request it want to be the only user
>  of the RX queue.  This blocks other application to mmap VMA area.
> 

Assuming data is read-only sharing with the stack is possibly OK :/. I
guess you would need to pools of memory for data and skb so you don't
leak skb into user space.

The devils in the details here. There are lots of hooks in the kernel
that can for example push the packet with a 'redirect' tc action for
example. And letting an app "read" data or impact performance of an
unrelated application is wrong IMO. Stacked devices also provide another
set of details that are a bit difficult to track down see all the
hardware offload efforts.

I assume all these concerns are shared between mode-1 and mode-2

> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>  no longer allowed for normal netstack delivery.
> 

I have patches for this mode already but haven't pushed them due to
an alternative solution using VFIO.

> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
> still allowed to travel netstack and thus can contain packet data from
> other normal applications.  This is part of the design, to share the
> NIC between netstack and an accelerated userspace application using RX
> zero-copy delivery.
> 

I don't think this is acceptable to be honest. Letting an application
potentially read/impact other arbitrary applications on the system
seems like a non-starter even with CAP_NET_ADMIN. At least this was
the conclusion from bifurcated driver work some time ago.

> 
>> Any ACLs/filtering/switching/headers need to be done in hardware or
>> the application trust boundaries are broken.
> 
> The software solution outlined allow the application to make the choice
> of what trust boundary it wants.
> 
> The "exclusive" mode-3 make most sense together with HW filters.
> Already today, we support creating a new RX queue based on ethtool
> ntuple HW filter and then you simply attach your application that queue
> in mode-3, and have full isolation.
> 

Still pretty fuzzy on why mode-1 and mode-2 do not need hw filters?
Without hardware filters we have no way of knowing who/what data is
put in the page.

>  
>> If the above can not be met then a copy is needed. What I am trying
>> to tease out is the above comment along with other statements like
>> this "can be 

Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-14 Thread Theodore Ts'o
On Wed, Dec 14, 2016 at 04:10:37AM +0100, Jason A. Donenfeld wrote:
> This duplicates the current algorithm for get_random_int/long, but uses
> siphash24 instead. This comes with several benefits. It's certainly
> faster and more cryptographically secure than MD5. This patch also
> hashes the pid, entropy, and timestamp as fixed width fields, in order
> to increase diffusion.
> 
> The previous md5 algorithm used a per-cpu md5 state, which caused
> successive calls to the function to chain upon each other. While it's
> not entirely clear that this kind of chaining is absolutely necessary
> when using a secure PRF like siphash24, it can't hurt, and the timing of
> the call chain does add a degree of natural entropy. So, in keeping with
> this design, instead of the massive per-cpu 64-byte md5 state, there is
> instead a per-cpu previously returned value for chaining.
> 
> Signed-off-by: Jason A. Donenfeld 
> Cc: Jean-Philippe Aumasson 

The original reason for get_random_int was because the original
urandom algorithms were too slow.  When we moved to chacha20, which is
must faster, I didn't think to revisit get_random_int() and
get_random_long().

One somewhat undesirable aspect of the current algorithm is that we
never change random_int_secret.  So I've been toying with the
following, which is 4 times faster than md5.  (I haven't tried
benchmarking against siphash yet.)

[3.606139] random benchmark!!
[3.606276] get_random_int # cycles: 326578
[3.606317] get_random_int_new # cycles: 95438
[3.607423] get_random_bytes # cycles: 2653388

  - Ted

P.S.  It's interesting to note that siphash24 and chacha20 are both
add-rotate-xor based algorithms.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..be172ea75799 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1681,6 +1681,38 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+
+static int rand_benchmark(void)
+{
+   cycles_t start,finish;
+   int i, out;
+
+   pr_crit("random benchmark!!\n");
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_int();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int # cycles: %llu\n", finish - start);
+
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_int_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int_new # cycles: %llu\n", finish - start);
+
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_bytes(&out, sizeof(out));
+   }
+   finish = get_cycles();
+   pr_err("get_random_bytes # cycles: %llu\n", finish - start);
+   return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2096,10 @@ unsigned int get_random_int(void)
__u32 *hash;
unsigned int ret;
 
+#if 0  // force slow path
if (arch_get_random_int(&ret))
return ret;
+#endif
 
hash = get_cpu_var(get_random_int_hash);
 
@@ -2100,6 +2134,38 @@ unsigned long get_random_long(void)
 }
 EXPORT_SYMBOL(get_random_long);
 
+struct random_buf {
+   __u8 buf[CHACHA20_BLOCK_SIZE];
+   int ptr;
+};
+
+static DEFINE_PER_CPU(struct random_buf, batched_entropy);
+
+static void get_batched_entropy(void *buf, int n)
+{
+   struct random_buf *p;
+
+   p = &get_cpu_var(batched_entropy);
+
+   if ((p->ptr == 0) ||
+   (p->ptr + n >= CHACHA20_BLOCK_SIZE)) {
+   extract_crng(p->buf);
+   p->ptr = 0;
+   }
+   BUG_ON(n > CHACHA20_BLOCK_SIZE);
+   memcpy(buf, p->buf, n);
+   p->ptr += n;
+   put_cpu_var(batched_entropy);
+}
+
+static unsigned int get_random_int_new(void)
+{
+   int ret;
+
+   get_batched_entropy(&ret, sizeof(ret));
+   return ret;
+}
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start: The smallest acceptable address the caller will take.


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-14 Thread Theodore Ts'o
On Wed, Dec 14, 2016 at 12:24:43PM +0800, Wei Xu wrote:
> 
> BTW, although this is a guest issue, is there anyway to view the GCE
> host kernel or qemu(if it is) version?

No, there isn't, as far as I know.

 - Ted


Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Alexander Duyck
On Wed, Dec 14, 2016 at 8:32 AM, John Fastabend
 wrote:
> On 16-12-14 01:39 AM, Jesper Dangaard Brouer wrote:
>> On Tue, 13 Dec 2016 12:08:21 -0800
>> John Fastabend  wrote:
>>
>>> On 16-12-13 11:53 AM, David Miller wrote:
 From: John Fastabend 
 Date: Tue, 13 Dec 2016 09:43:59 -0800

> What does "zero-copy send packet-pages to the application/socket that
> requested this" mean? At the moment on x86 page-flipping appears to be
> more expensive than memcpy (I can post some data shortly) and shared
> memory was proposed and rejected for security reasons when we were
> working on bifurcated driver.

 The whole idea is that we map all the active RX ring pages into
 userspace from the start.

 And just how Jesper's page pool work will avoid DMA map/unmap,
 it will also avoid changing the userspace mapping of the pages
 as well.

 Thus avoiding the TLB/VM overhead altogether.

>>
>> Exactly.  It is worth mentioning that pages entering the page pool need
>> to be cleared (measured cost 143 cycles), in order to not leak any
>> kernel info.  The primary focus of this design is to make sure not to
>> leak kernel info to userspace, but with an "exclusive" mode also
>> support isolation between applications.
>>
>>
>>> I get this but it requires applications to be isolated. The pages from
>>> a queue can not be shared between multiple applications in different
>>> trust domains. And the application has to be cooperative meaning it
>>> can't "look" at data that has not been marked by the stack as OK. In
>>> these schemes we tend to end up with something like virtio/vhost or
>>> af_packet.
>>
>> I expect 3 modes, when enabling RX-zero-copy on a page_pool. The first
>> two would require CAP_NET_ADMIN privileges.  All modes have a trust
>> domain id, that need to match e.g. when page reach the socket.
>
> Even mode 3 should required cap_net_admin we don't want userspace to
> grab queues off the nic without it IMO.
>
>>
>> Mode-1 "Shared": Application choose lowest isolation level, allowing
>>  multiple application to mmap VMA area.
>
> My only point here is applications can read each others data and all
> applications need to cooperate for example one app could try to write
> continuously to read only pages causing faults and what not. This is
> all non standard and doesn't play well with cgroups and "normal"
> applications. It requires a new orchestration model.
>
> I'm a bit skeptical of the use case but I know of a handful of reasons
> to use this model. Maybe take a look at the ivshmem implementation in
> DPDK.
>
> Also this still requires a hardware filter to push "application" traffic
> onto reserved queues/pages as far as I can tell.
>
>>
>> Mode-2 "Single-user": Application request it want to be the only user
>>  of the RX queue.  This blocks other application to mmap VMA area.
>>
>
> Assuming data is read-only sharing with the stack is possibly OK :/. I
> guess you would need to pools of memory for data and skb so you don't
> leak skb into user space.
>
> The devils in the details here. There are lots of hooks in the kernel
> that can for example push the packet with a 'redirect' tc action for
> example. And letting an app "read" data or impact performance of an
> unrelated application is wrong IMO. Stacked devices also provide another
> set of details that are a bit difficult to track down see all the
> hardware offload efforts.
>
> I assume all these concerns are shared between mode-1 and mode-2
>
>> Mode-3 "Exclusive": Application request to own RX queue.  Packets are
>>  no longer allowed for normal netstack delivery.
>>
>
> I have patches for this mode already but haven't pushed them due to
> an alternative solution using VFIO.
>
>> Notice mode-2 still requires CAP_NET_ADMIN, because packets/pages are
>> still allowed to travel netstack and thus can contain packet data from
>> other normal applications.  This is part of the design, to share the
>> NIC between netstack and an accelerated userspace application using RX
>> zero-copy delivery.
>>
>
> I don't think this is acceptable to be honest. Letting an application
> potentially read/impact other arbitrary applications on the system
> seems like a non-starter even with CAP_NET_ADMIN. At least this was
> the conclusion from bifurcated driver work some time ago.

I agree.  This is a no-go from the performance perspective as well.
At a minimum you would have to be zeroing out the page between uses to
avoid leaking data, and that assumes that the program we are sending
the pages to is slightly well behaved.  If we think zeroing out an
sk_buff is expensive wait until we are trying to do an entire 4K page.

I think we are stuck with having to use a HW filter to split off
application traffic to a specific ring, and then having to share the
memory between the application and the kernel on that ring only.  Any
other approach just opens us up to all sorts of security concerns
since it would be poss

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Christoph Lameter
On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:

> > Interesting.  So you even imagine sockets registering memory regions
> > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > to register the steering rule (like ibv_create_flow), this would be
> > doable, but we don't (DPDK actually have an interesting proposal[1])
>
> On a side note, this is what windows does with RIO ("registered I/O").
> Maybe you want to look at the API to get some ideas: allocating and
> pinning down memory in user space and registering that with sockets to
> get zero-copy IO.

Yup that is also what I think. Regarding the memory registration and flow
steering for user space RX/TX ring please look at the qpair model
implemented by the RDMA subsystem in the kernel. The memory semantics are
clearly established there and have been in use for more than a decade.



[PATCH net 0/2] net/sched: cls_flower: Fix mask handling

2016-12-14 Thread Paul Blakey
Hi,
The series fix how the mask is being handled.
Thanks.

Paul Blakey (2):
  net/sched: cls_flower: Use mask for addr_type
  net/sched: cls_flower: Use masked key when calling HW offloads

 net/sched/cls_flower.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

-- 
1.8.3.1



[PATCH net 1/2] net/sched: cls_flower: Use mask for addr_type

2016-12-14 Thread Paul Blakey
When addr_type is set, mask should also be set.

Fixes: 66530bdf85eb ('sched,cls_flower: set key address type when present')
Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Paul Blakey 
Reviewed-by: Roi Dayan 
Acked-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e040c51..9758f5a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -509,6 +509,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
 
if (tb[TCA_FLOWER_KEY_IPV4_SRC] || tb[TCA_FLOWER_KEY_IPV4_DST]) {
key->control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+   mask->control.addr_type = ~0;
fl_set_key_val(tb, &key->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC,
   &mask->ipv4.src, TCA_FLOWER_KEY_IPV4_SRC_MASK,
   sizeof(key->ipv4.src));
@@ -517,6 +518,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
   sizeof(key->ipv4.dst));
} else if (tb[TCA_FLOWER_KEY_IPV6_SRC] || tb[TCA_FLOWER_KEY_IPV6_DST]) {
key->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+   mask->control.addr_type = ~0;
fl_set_key_val(tb, &key->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC,
   &mask->ipv6.src, TCA_FLOWER_KEY_IPV6_SRC_MASK,
   sizeof(key->ipv6.src));
@@ -571,6 +573,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
if (tb[TCA_FLOWER_KEY_ENC_IPV4_SRC] ||
tb[TCA_FLOWER_KEY_ENC_IPV4_DST]) {
key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+   mask->enc_control.addr_type = ~0;
fl_set_key_val(tb, &key->enc_ipv4.src,
   TCA_FLOWER_KEY_ENC_IPV4_SRC,
   &mask->enc_ipv4.src,
@@ -586,6 +589,7 @@ static int fl_set_key(struct net *net, struct nlattr **tb,
if (tb[TCA_FLOWER_KEY_ENC_IPV6_SRC] ||
tb[TCA_FLOWER_KEY_ENC_IPV6_DST]) {
key->enc_control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+   mask->enc_control.addr_type = ~0;
fl_set_key_val(tb, &key->enc_ipv6.src,
   TCA_FLOWER_KEY_ENC_IPV6_SRC,
   &mask->enc_ipv6.src,
-- 
1.8.3.1



[PATCH net 2/2] net/sched: cls_flower: Use masked key when calling HW offloads

2016-12-14 Thread Paul Blakey
Zero bits on the mask signify a "don't care" on the corresponding bits
in key. Some HWs require those bits on the key to be zero. Since these
bits are masked anyway, it's okay to provide the masked key to all
drivers.

Fixes: 5b33f48842fa ('net/flower: Introduce hardware offload support')
Signed-off-by: Paul Blakey 
Reviewed-by: Roi Dayan 
Acked-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9758f5a..35ac28d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -252,7 +252,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
offload.cookie = (unsigned long)f;
offload.dissector = dissector;
offload.mask = mask;
-   offload.key = &f->key;
+   offload.key = &f->mkey;
offload.exts = &f->exts;
 
tc->type = TC_SETUP_CLSFLOWER;
-- 
1.8.3.1



Re: [PATCH] IB/mlx4: avoid a -Wmaybe-uninitialize warning

2016-12-14 Thread Doug Ledford
On 10/25/2016 12:16 PM, Arnd Bergmann wrote:
> There is an old warning about mlx4_SW2HW_EQ_wrapper on x86:
> 
> ethernet/mellanox/mlx4/resource_tracker.c: In function 
> ‘mlx4_SW2HW_EQ_wrapper’:
> ethernet/mellanox/mlx4/resource_tracker.c:3071:10: error: ‘eq’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
> 
> The problem here is that gcc won't track the state of the variable
> across a spin_unlock. Moving the assignment out of the lock is
> safe here and avoids the warning.
> 
> Signed-off-by: Arnd Bergmann 

Thanks, applied.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature


Re: [net-next PATCH v5 1/6] net: virtio dynamically disable/enable LRO

2016-12-14 Thread John Fastabend
On 16-12-14 05:31 AM, Michael S. Tsirkin wrote:
> On Thu, Dec 08, 2016 at 04:04:58PM -0800, John Fastabend wrote:
>> On 16-12-08 01:36 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 07, 2016 at 12:11:11PM -0800, John Fastabend wrote:
 This adds support for dynamically setting the LRO feature flag. The
 message to control guest features in the backend uses the
 CTRL_GUEST_OFFLOADS msg type.

 Signed-off-by: John Fastabend 
 ---

[...]

  
  static void virtnet_config_changed_work(struct work_struct *work)
 @@ -1815,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
dev->features |= NETIF_F_RXCSUM;
  
 +  if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
 +  virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
 +  dev->features |= NETIF_F_LRO;
 +  dev->hw_features |= NETIF_F_LRO;
>>>
>>> So the issue is I think that the virtio "LRO" isn't really
>>> LRO, it's typically just GRO forwarded to guests.
>>> So these are easily re-split along MTU boundaries,
>>> which makes it ok to forward these across bridges.
>>>
>>> It's not nice that we don't document this in the spec,
>>> but it's the reality and people rely on this.
>>>
>>> For now, how about doing a custom thing and just disable/enable
>>> it as XDP is attached/detached?
>>
>> The annoying part about doing this is ethtool will say that it is fixed
>> yet it will be changed by seemingly unrelated operation. I'm not sure I
>> like the idea to start automatically configuring the link via xdp_set.
> 
> I really don't like the idea of dropping performance
> by a factor of 3 for people bridging two virtio net
> interfaces.
> 
> So how about a simple approach for now, just disable
> XDP if GUEST_TSO is enabled?
> 
> We can discuss better approaches in next version.
> 

So the proposal is to add a check in XDP setup so that

  if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO{4|6})
return -ENOPSUPP;

Or whatever is the most appropriate return code? Then we can
disable TSO via qemu-system with guest_tso4=off,guest_tso6=off for
XDP use cases.

Sounds like a reasonable start to me. I'll make the change should this
go through DaveMs net-next tree or do you want it on virtio tree? Either
is fine with me.

Thanks,
John



Re: [Query] Delayed vxlan socket creation?

2016-12-14 Thread Cong Wang
On Tue, Dec 13, 2016 at 11:49 PM, Du, Fan  wrote:
> Hi
>
> I'm interested to one Docker issue[1] which looks like related to kernel 
> vxlan socket creation
> as described in the thread. From my limited knowledge here, socket creation 
> is synchronous ,
> and after the *socket* syscall, the sock handle will be valid and ready to 
> linkup.

You need to read the code. vxlan tunnel is a UDP tunnel, it needs a kernel
socket (and a port) to setup UDP communication, unlike GRE tunnel etc.


>
> Somehow I'm not sure the detailed scenario here, and which/how possible 
> commit fix?
> Thanks!
>
> Quoted analysis:
> --
> (Found in kernel 3.13)
> The issue happens because in older kernels when a vxlan interface is created,
> the socket creation is queued up in a worker thread which actually creates
> the socket. But this needs to happen before we bring up the link on the vxlan 
> interface.
> If for some chance, the worker thread hasn't completed the creation of the 
> socket
> before we did link up then when we do link up the kernel checks if the socket 
> was
> created and if not it will return ENOTCONN. This was a bug in the kernel 
> which got fixed
> in later kernels. That is why retrying with a timer fixes the issue.


This was introduced by commit 1c51a9159ddefa5119724a4c7da3fd3ef44b68d5
and later fixed by commit 56ef9c909b40483d2c8cb63fcbf83865f162d5ec.


RE: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread David Laight
From: Christoph Lameter
> Sent: 14 December 2016 17:00
> On Tue, 13 Dec 2016, Hannes Frederic Sowa wrote:
> 
> > > Interesting.  So you even imagine sockets registering memory regions
> > > with the NIC.  If we had a proper NIC HW filter API across the drivers,
> > > to register the steering rule (like ibv_create_flow), this would be
> > > doable, but we don't (DPDK actually have an interesting proposal[1])
> >
> > On a side note, this is what windows does with RIO ("registered I/O").
> > Maybe you want to look at the API to get some ideas: allocating and
> > pinning down memory in user space and registering that with sockets to
> > get zero-copy IO.
> 
> Yup that is also what I think. Regarding the memory registration and flow
> steering for user space RX/TX ring please look at the qpair model
> implemented by the RDMA subsystem in the kernel. The memory semantics are
> clearly established there and have been in use for more than a decade.

Isn't there a bigger problem for transmit?
If the kernel is doing ANY validation on the frames it must copy the
data to memory the application cannot modify before doing the validation.
Otherwise the application could change the data afterwards.

David




Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
On Wed, Dec 14, 2016 at 3:47 PM, David Laight  wrote:
> Just remove the __packed and ensure that the structure is 'nice'.
> This includes ensuring there is no 'tail padding'.
> In some cases you'll need to put the port number into a 32bit field.

I'd rather not. There's no point in wasting extra cycles on hashing
useless data, just for some particular syntactic improvement. __packed
__aligned(8) will do what we want perfectly, I think.

> I'd also require that the key be aligned.

Yep, I'll certainly do this for the siphash24_aligned version in the v3.


Re: [PATCH] infiniband: nes: nes_nic: use new api ethtool_{get|set}_link_ksettings

2016-12-14 Thread Doug Ledford
On 10/25/2016 11:29 AM, Philippe Reynes wrote:
> The ethtool api {get|set}_settings is deprecated.
> We move this driver to new api {get|set}_link_ksettings.
> 
> Signed-off-by: Philippe Reynes 

Thanks, applied.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread David Miller
From: "Jason A. Donenfeld" 
Date: Wed, 14 Dec 2016 13:53:10 +0100

> In all current uses of __packed in the code, I think the impact is
> precisely zero, because all structures have members in descending
> order of size, with each member being a perfect multiple of the one
> below it. The __packed is therefore just there for safety, in case
> somebody comes in and screws everything up by sticking a u8 in
> between.

Just marking the structure __packed, whether necessary or not, makes
the compiler assume that the members are not aligned and causes
byte-by-byte accesses to be performed for words.

Never, _ever_, use __packed unless absolutely necessary, it pessimizes
the code on cpus that require proper alignment of types.


Re: [kernel-hardening] Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-14 Thread Jason A. Donenfeld
Hey Ted,

On Wed, Dec 14, 2016 at 5:37 PM, Theodore Ts'o  wrote:
> One somewhat undesirable aspect of the current algorithm is that we
> never change random_int_secret.

Why exactly would this be a problem? So long as the secret is kept
secret, the PRF is secure. If an attacker can read arbitrary kernel
memory, there are much much bigger issues to be concerned about. As
well, the "chaining" variable I introduce ensures that the random
numbers are, per-cpu, related to the uniqueness of timing of
subsequent calls.

> So I've been toying with the
> following, which is 4 times faster than md5.  (I haven't tried
> benchmarking against siphash yet.)
>
> [3.606139] random benchmark!!
> [3.606276] get_random_int # cycles: 326578
> [3.606317] get_random_int_new # cycles: 95438
> [3.607423] get_random_bytes # cycles: 2653388

Cool, I'll benchmark it against the siphash implementation. I like
what you did with batching up lots of chacha output, and doling it out
bit by bit. I suspect this will be quite fast, because with chacha20
you get an entire block.

> P.S.  It's interesting to note that siphash24 and chacha20 are both
> add-rotate-xor based algorithms.

Quite! Lots of nice shiny things are turning out be be ARX -- ChaCha,
BLAKE2, Siphash, NORX. The simplicity is really appealing.

Jason


Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread arvind Yadav

Yes, I have seen this error. We have a device with very less memory.
Basically it's OMAP2 board. We have to port Android L on this.
It's has 3.10 kernel version. In this device, we were getting Page 
allocation failure.

Vmalloc size was not enough to run all application. So we have decide to
increase vmalloc reserve space. once we increases Vmalloc space.
We start getting ioremap falilure. Kernel is getting NULL-pointer 
dereference error.


Here, It's just check to avoid any kernel crash because of ioremap failure.
We can keep this check to avoid this kind of scenario.

Thanks
-Arvind


On Wednesday 14 December 2016 11:02 PM, David Daney wrote:

On 12/14/2016 08:25 AM, Arvind Yadav wrote:

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.


i
Have you ever seen this failure in the wild?

How was the patch tested?

Thanks,
David Daney



Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c

index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct 
platform_device *pdev)

 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
p->agl_prt_ctl_size);
+if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+dev_err(&pdev->dev, "failed to map I/O memory\n");
+result = -ENOMEM;
+goto err;
+}
+
 spin_lock_init(&p->lock);

 skb_queue_head_init(&p->tx_list);







Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
Hi David,

On Wed, Dec 14, 2016 at 6:56 PM, David Miller  wrote:
> Just marking the structure __packed, whether necessary or not, makes
> the compiler assume that the members are not aligned and causes
> byte-by-byte accesses to be performed for words.
> Never, _ever_, use __packed unless absolutely necessary, it pessimizes
> the code on cpus that require proper alignment of types.

Oh, jimminy cricket, I did not realize that it made assignments
byte-by-byte *always*. So what options am I left with? What
immediately comes to mind are:

1)

struct {
u64 a;
u32 b;
u32 c;
u16 d;
u8 end[];
} a = {
.a = a,
.b = b,
.c = c,
.d = d
};
siphash24(&a, offsetof(typeof(a), end), key);

2)

u8 bytes[sizeof(u64) + sizeof(u32) * 2 + sizeof(u16)];
*(u64 *)&bytes[0] = a;
*(u32 *)&bytes[sizeof(u64)] = b;
*(u32 *)&bytes[sizeof(u64) + sizeof(u32)] = c;
*(u16 *)&bytes[sizeof(u64) + sizeof(u32) * 2] = d;
siphash24(bytes, sizeof(bytes), key);


Personally I find (1) a bit neater than (2). What's your opinion?

Jason


Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread David Daney

On 12/14/2016 10:06 AM, arvind Yadav wrote:

Yes, I have seen this error. We have a device with very less memory.
Basically it's OMAP2 board. We have to port Android L on this.
It's has 3.10 kernel version. In this device, we were getting Page
allocation failure.


This makes absolutely no sense to me.  OCTEON is a mips64 SoC with a ton 
of memory where ioremap can never fail, and it doesn't run Android, and 
you are talking about OMAP2.


Q1: Have you observed a failure on the device for which you are 
modifying the driver?


Q2: Have you tested the patch on hardware that uses the driver you are 
modifying by running network traffic through the Ethernet interface this 
driver controls?


If you cannot answer yes to both of those questions, then you should 
probably note in the changelog that the patch is untested.


David.



Vmalloc size was not enough to run all application. So we have decide to
increase vmalloc reserve space. once we increases Vmalloc space.
We start getting ioremap falilure. Kernel is getting NULL-pointer
dereference error.

Here, It's just check to avoid any kernel crash because of ioremap failure.
We can keep this check to avoid this kind of scenario.

Thanks
-Arvind


On Wednesday 14 December 2016 11:02 PM, David Daney wrote:

On 12/14/2016 08:25 AM, Arvind Yadav wrote:

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.


i
Have you ever seen this failure in the wild?

How was the patch tested?

Thanks,
David Daney



Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct
platform_device *pdev)
 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
p->agl_prt_ctl_size);
+if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+dev_err(&pdev->dev, "failed to map I/O memory\n");
+result = -ENOMEM;
+goto err;
+}
+
 spin_lock_init(&p->lock);

 skb_queue_head_init(&p->tx_list);







Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread arvind Yadav

Hi David,

I have gave my comment.

Thanks
Arvind

On Wednesday 14 December 2016 11:44 PM, David Daney wrote:

On 12/14/2016 10:06 AM, arvind Yadav wrote:

Yes, I have seen this error. We have a device with very less memory.
Basically it's OMAP2 board. We have to port Android L on this.
It's has 3.10 kernel version. In this device, we were getting Page
allocation failure.


This makes absolutely no sense to me.  OCTEON is a mips64 SoC with a 
ton of memory where ioremap can never fail, and it doesn't run 
Android, and you are talking about OMAP2.
  -I just gave as example where i have seen ioremap issue. 
Please don't relate. I know, Now it will not fail.  ioremap will through 
NULL on failure. We should catch this error. Even other driver of MIPS 
soc is having same check. It's just check which will not impact any 
functionality or performance of this driver. It will avoid NULL pointer 
error. We know, if  function is returning any error. we should catch.


Q1: Have you observed a failure on the device for which you are 
modifying the driver?

 -No, I did not observe this error.


Q2: Have you tested the patch on hardware that uses the driver you are 
modifying by running network traffic through the Ethernet interface 
this driver controls?

-Right Now we can not tested these kind of failure,


If you cannot answer yes to both of those questions, then you should 
probably note in the changelog that the patch is untested.





David.



Vmalloc size was not enough to run all application. So we have decide to
increase vmalloc reserve space. once we increases Vmalloc space.
We start getting ioremap falilure. Kernel is getting NULL-pointer
dereference error.

Here, It's just check to avoid any kernel crash because of ioremap 
failure.

We can keep this check to avoid this kind of scenario.

Thanks
-Arvind


On Wednesday 14 December 2016 11:02 PM, David Daney wrote:

On 12/14/2016 08:25 AM, Arvind Yadav wrote:

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.


i
Have you ever seen this failure in the wild?

How was the patch tested?

Thanks,
David Daney



Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct
platform_device *pdev)
 p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
 p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, 
p->agl_prt_ctl_phys,

p->agl_prt_ctl_size);
+if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+dev_err(&pdev->dev, "failed to map I/O memory\n");
+result = -ENOMEM;
+goto err;
+}
+
 spin_lock_init(&p->lock);

 skb_queue_head_init(&p->tx_list);









[PATCH v3 3/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-14 Thread Jason A. Donenfeld
This duplicates the current algorithm for get_random_int/long, but uses
siphash24 instead. This comes with several benefits. It's certainly
faster and more cryptographically secure than MD5. This patch also
hashes the pid, entropy, and timestamp as fixed width fields, in order
to increase diffusion.

The previous md5 algorithm used a per-cpu md5 state, which caused
successive calls to the function to chain upon each other. While it's
not entirely clear that this kind of chaining is absolutely necessary
when using a secure PRF like siphash24, it can't hurt, and the timing of
the call chain does add a degree of natural entropy. So, in keeping with
this design, instead of the massive per-cpu 64-byte md5 state, there is
instead a per-cpu previously returned value for chaining.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Ted Tso 
---
Changes from v2->v3:

  - Structs are no longer packed, to mitigate slow byte-by-byte assignment.

 drivers/char/random.c | 52 ---
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..b1c2e3b26430 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -262,6 +262,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2042,7 +2043,7 @@ struct ctl_table random_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
-static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned;
+static u8 random_int_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
 
 int random_int_secret_init(void)
 {
@@ -2050,8 +2051,7 @@ int random_int_secret_init(void)
return 0;
 }
 
-static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], get_random_int_hash)
-   __aligned(sizeof(unsigned long));
+static DEFINE_PER_CPU(u64, get_random_int_chaining);
 
 /*
  * Get a random word for internal kernel use only. Similar to urandom but
@@ -2061,19 +2061,26 @@ static DEFINE_PER_CPU(__u32 [MD5_DIGEST_WORDS], 
get_random_int_hash)
  */
 unsigned int get_random_int(void)
 {
-   __u32 *hash;
unsigned int ret;
+   struct {
+   u64 chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   char end[];
+   } __aligned(SIPHASH24_ALIGNMENT) combined;
+   u64 *chaining;
 
if (arch_get_random_int(&ret))
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = hash[0];
-   put_cpu_var(get_random_int_hash);
-
+   chaining = get_cpu_ptr(&get_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *)&combined, offsetof(typeof(combined), 
end), random_int_secret);
+   put_cpu_ptr(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_int);
@@ -2083,19 +2090,26 @@ EXPORT_SYMBOL(get_random_int);
  */
 unsigned long get_random_long(void)
 {
-   __u32 *hash;
unsigned long ret;
+   struct {
+   u64 chaining;
+   unsigned long ts;
+   unsigned long entropy;
+   pid_t pid;
+   char end[];
+   } __aligned(SIPHASH24_ALIGNMENT) combined;
+   u64 *chaining;
 
if (arch_get_random_long(&ret))
return ret;
 
-   hash = get_cpu_var(get_random_int_hash);
-
-   hash[0] += current->pid + jiffies + random_get_entropy();
-   md5_transform(hash, random_int_secret);
-   ret = *(unsigned long *)hash;
-   put_cpu_var(get_random_int_hash);
-
+   chaining = get_cpu_ptr(&get_random_int_chaining);
+   combined.chaining = *chaining;
+   combined.ts = jiffies;
+   combined.entropy = random_get_entropy();
+   combined.pid = current->pid;
+   ret = *chaining = siphash24((u8 *)&combined, offsetof(typeof(combined), 
end), random_int_secret);
+   put_cpu_ptr(chaining);
return ret;
 }
 EXPORT_SYMBOL(get_random_long);
-- 
2.11.0



[PATCH v3 2/3] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
This gives a clear speed and security improvement. Siphash is both
faster and is more solid crypto than the aging MD5.

Rather than manually filling MD5 buffers, we simply create
a layout by a simple anonymous struct, for which gcc generates
rather efficient code.

Signed-off-by: Jason A. Donenfeld 
Cc: Andi Kleen 
Cc: David Miller 
Cc: David Laight 
---
Changes from v2->v3:

  - Structs are no longer packed, to mitigate slow byte-by-byte assignment.
  - A typo has been fixed in the port number assignment.

 net/core/secure_seq.c | 166 ++
 1 file changed, 85 insertions(+), 81 deletions(-)

diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 88a8e429fc3e..00eb141c981b 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -1,3 +1,5 @@
+/* Copyright (C) 2016 Jason A. Donenfeld . All Rights 
Reserved. */
+
 #include 
 #include 
 #include 
@@ -8,14 +10,14 @@
 #include 
 #include 
 #include 
-
+#include 
 #include 
 
 #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET)
+#include 
 #include 
-#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4)
 
-static u32 net_secret[NET_SECRET_SIZE] cacheline_aligned;
+static u8 net_secret[SIPHASH24_KEY_LEN] __aligned(SIPHASH24_ALIGNMENT);
 
 static __always_inline void net_secret_init(void)
 {
@@ -44,44 +46,41 @@ static u32 seq_scale(u32 seq)
 u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr,
 __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 sport;
+   __be16 dport;
+   char end[];
+   } __aligned(SIPHASH24_ALIGNMENT) combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .sport = sport,
+   .dport = dport
+   };
+   u64 hash;
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32)daddr[i];
-   secret[4] = net_secret[4] +
-   (((__force u16)sport << 16) + (__force u16)dport);
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0;
-   return seq_scale(hash[0]);
+   hash = siphash24((const u8 *)&combined, offsetof(typeof(combined), 
end), net_secret);
+   *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0;
+   return seq_scale(hash);
 }
 EXPORT_SYMBOL(secure_tcpv6_sequence_number);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
   __be16 dport)
 {
-   u32 secret[MD5_MESSAGE_BYTES / 4];
-   u32 hash[MD5_DIGEST_WORDS];
-   u32 i;
-
+   const struct {
+   struct in6_addr saddr;
+   struct in6_addr daddr;
+   __be16 dport;
+   char end[];
+   } __aligned(SIPHASH24_ALIGNMENT) combined = {
+   .saddr = *(struct in6_addr *)saddr,
+   .daddr = *(struct in6_addr *)daddr,
+   .dport = dport
+   };
net_secret_init();
-   memcpy(hash, saddr, 16);
-   for (i = 0; i < 4; i++)
-   secret[i] = net_secret[i] + (__force u32) daddr[i];
-   secret[4] = net_secret[4] + (__force u32)dport;
-   for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++)
-   secret[i] = net_secret[i];
-
-   md5_transform(hash, secret);
-
-   return hash[0];
+   return siphash24((const u8 *)&combined, offsetof(typeof(combined), 
end), net_secret);
 }
 EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
@@ -91,33 +90,39 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
   __be16 sport, __be16 dport, u32 *tsoff)
 {
-   u32 hash[MD5_DIGEST_WORDS];
-
+   const struct {
+   __be32 saddr;
+   __be32 daddr;
+   __be16 sport;
+   __be16 dport;
+   char end[];
+   } __aligned(SIPHASH24_ALIGNMENT) combined = {
+   .saddr = saddr,
+   .daddr = daddr,
+   .sport = sport,
+   .dport = dport
+   };
+   u64 hash;
net_secret_init();
-   hash[0] = (__force u32)saddr;
-   hash[1] = (__force u32)daddr;
-   hash[2] = ((__force u16)sport << 16) + (__force u16)dport;
-   hash[3] = net_secret[15];
-
-   md5_transform(hash, net_secret);
-
-   *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0;
-   return seq_scale(hash[0]);
+   hash = siphash24((const u8 *)&combined, offsetof(typeof(combined), 
end), net_secret);
+   *tsoff = sysctl_tcp_timestamps == 1 ? (hash 

[PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Jason A. Donenfeld
SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function.

SipHash isn't just some new trendy hash function. It's been around for a
while, and there really isn't anything that comes remotely close to
being useful in the way SipHash is. With that said, why do we need this?

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector.

Linux developers already seem to be aware that this is an issue, and
various places that use hash tables in, say, a network context, use a
non-cryptographically secure function (usually jhash) and then try to
twiddle with the key on a time basis (or in many cases just do nothing
and hope that nobody notices). While this is an admirable attempt at
solving the problem, it doesn't actually fix it. SipHash fixes it.

(It fixes it in such a sound way that you could even build a stream
cipher out of SipHash that would resist the modern cryptanalysis.)

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

Secondly, a few places are using MD5 for creating secure sequence
numbers, port numbers, or fast random numbers. Siphash is a faster, more
fittting, and more secure replacement for MD5 in those situations.

Dozens of languages are already using this internally for their hash
tables. Some of the BSDs already use this in their kernels. SipHash is
a widely known high-speed solution to a widely known problem, and it's
time we catch-up.

Signed-off-by: Jason A. Donenfeld 
Cc: Jean-Philippe Aumasson 
Cc: Daniel J. Bernstein 
Cc: Linus Torvalds 
Cc: Eric Biggers 
Cc: David Laight 
---
Changes from v2->v3:

  - There is now a fast aligned version of the function and a not-as-fast
unaligned version. The requirements for each have been documented in
a docbook-style comment. As well, the header now contains a constant
for the expected alignment.

  - The test suite has been updated to check both the unaligned and aligned
version of the function.

 include/linux/siphash.h |  30 ++
 lib/Kconfig.debug   |   6 +-
 lib/Makefile|   5 +-
 lib/siphash.c   | 153 
 lib/test_siphash.c  |  85 +++
 5 files changed, 274 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/siphash.h
 create mode 100644 lib/siphash.c
 create mode 100644 lib/test_siphash.c

diff --git a/include/linux/siphash.h b/include/linux/siphash.h
new file mode 100644
index ..82dc1a911a2e
--- /dev/null
+++ b/include/linux/siphash.h
@@ -0,0 +1,30 @@
+/* Copyright (C) 2016 Jason A. Donenfeld 
+ *
+ * This file is provided under a dual BSD/GPLv2 license.
+ *
+ * SipHash: a fast short-input PRF
+ * https://131002.net/siphash/
+ */
+
+#ifndef _LINUX_SIPHASH_H
+#define _LINUX_SIPHASH_H
+
+#include 
+
+enum siphash_lengths {
+   SIPHASH24_KEY_LEN = 16,
+   SIPHASH24_ALIGNMENT = 8
+};
+
+u64 siphash24(const u8 *data, size_t len, const u8 key[SIPHASH24_KEY_LEN]);
+
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+static inline u64 siphash24_unaligned(const u8 *data, size_t len, const u8 
key[SIPHASH24_KEY_LEN])
+{
+   return siphash24(data, len, key);
+}
+#else
+u64 siphash24_unaligned(const u8 *data, size_t len, const u8 
key[SIPHASH24_KEY_LEN]);
+#endif
+
+#endif /* _LINUX_SIPHASH_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e6327d102184..32bbf689fc46 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1843,9 +1843,9 @@ config TEST_HASH
tristate "Perform selftest on hash functions"
default n
help
- Enable this option to test the kernel's integer ()
- and string () hash functions on boot
- (or module load).
+ Enable this option to test the kernel's integer (),
+ string (), and siphash ()
+ hash functions on boot (or module load).
 
  This is intended to help people writing architecture-specific
  optimized versions.  If unsure, say N.
diff --git a/lib/Makefile b/lib/Makefile
index 50144a3aeebd..71d398b04a74 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,8 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_ueven

Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread Florian Fainelli
On 12/14/2016 10:39 AM, arvind Yadav wrote:
> Hi David,
> 
> I have gave my comment.
> 
> Thanks
> Arvind
> 
> On Wednesday 14 December 2016 11:44 PM, David Daney wrote:
>> On 12/14/2016 10:06 AM, arvind Yadav wrote:
>>> Yes, I have seen this error. We have a device with very less memory.
>>> Basically it's OMAP2 board. We have to port Android L on this.
>>> It's has 3.10 kernel version. In this device, we were getting Page
>>> allocation failure.
>>
>> This makes absolutely no sense to me.  OCTEON is a mips64 SoC with a
>> ton of memory where ioremap can never fail, and it doesn't run
>> Android, and you are talking about OMAP2.
>   -I just gave as example where i have seen ioremap issue.
> Please don't relate. I know, Now it will not fail.  ioremap will through
> NULL on failure. We should catch this error. Even other driver of MIPS
> soc is having same check. It's just check which will not impact any
> functionality or performance of this driver. It will avoid NULL pointer
> error. We know, if  function is returning any error. we should catch.

Your patch subject should also be changed to insert spaces between
semicolon, so this would be:

net: ethernet: cavium: octeon: octeon_mgmt:
-- 
Florian


stmmac: lockups (was Re: Synopsys Ethernet QoS)

2016-12-14 Thread Pavel Machek
Hi!

> I know that this is completely of topic, but I am facing a dificulty with
> stmmac. I have interrupts, mac well configured rx packets being received
> successfully, but TX is not working, resulting in Tx errors = Total TX 
> packets.
> I have made a lot of debug and my conclusions is that by some reason when 
> using
> stmmac after starting tx dma, the hw state machine enters a deadend state
> resulting in those errors. Anyone faced this trouble?

Well what I'm debugging are lockups after many packets transmitted
(followed by netdev watchdog; stmmac_tx_err() does not work for me, it
kills the device even when run from working state; ifconfig down/up
helps). 4.4 locks up in minutes to hours, 4.9 seems to work better
(but I believe I seen a lockup there, too; once).

So... probably different problem?

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


[v3] net: ethernet: cavium: octeon: octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread Arvind Yadav
Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.

Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device 
*pdev)
p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
   p->agl_prt_ctl_size);
+   if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+   dev_err(&pdev->dev, "failed to map I/O memory\n");
+   result = -ENOMEM;
+   goto err;
+   }
+
spin_lock_init(&p->lock);
 
skb_queue_head_init(&p->tx_list);
-- 
2.7.4



Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread arvind Yadav

Hi,

As per your suggestion, I have change the subject.

Thanks

On Thursday 15 December 2016 12:24 AM, Florian Fainelli wrote:

On 12/14/2016 10:39 AM, arvind Yadav wrote:

Hi David,

I have gave my comment.

Thanks
Arvind

On Wednesday 14 December 2016 11:44 PM, David Daney wrote:

On 12/14/2016 10:06 AM, arvind Yadav wrote:

Yes, I have seen this error. We have a device with very less memory.
Basically it's OMAP2 board. We have to port Android L on this.
It's has 3.10 kernel version. In this device, we were getting Page
allocation failure.

This makes absolutely no sense to me.  OCTEON is a mips64 SoC with a
ton of memory where ioremap can never fail, and it doesn't run
Android, and you are talking about OMAP2.

   -I just gave as example where i have seen ioremap issue.
Please don't relate. I know, Now it will not fail.  ioremap will through
NULL on failure. We should catch this error. Even other driver of MIPS
soc is having same check. It's just check which will not impact any
functionality or performance of this driver. It will avoid NULL pointer
error. We know, if  function is returning any error. we should catch.

Your patch subject should also be changed to insert spaces between
semicolon, so this would be:

net: ethernet: cavium: octeon: octeon_mgmt:




[PATCH net] net: vrf: Fix NAT within a VRF

2016-12-14 Thread David Ahern
Connection tracking with VRF is broken because the pass through the VRF
device drops the connection tracking info. Removing the call to nf_reset
allows DNAT and MASQUERADE to work across interfaces within a VRF.

Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
Signed-off-by: David Ahern 
---
 drivers/net/vrf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 3bca24651dc0..015a1321c7dd 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -849,8 +849,6 @@ static struct sk_buff *vrf_rcv_nfhook(u8 pf, unsigned int 
hook,
 {
struct net *net = dev_net(dev);
 
-   nf_reset(skb);
-
if (NF_HOOK(pf, hook, net, NULL, skb, dev, NULL, vrf_rcv_finish) < 0)
skb = NULL;/* kfree_skb(skb) handled by nf code */
 
-- 
2.1.4



Re: [v2] net:ethernet:cavium:octeon:octeon_mgmt: Handle return NULL error from devm_ioremap

2016-12-14 Thread David Daney

On 12/14/2016 08:25 AM, Arvind Yadav wrote:

Here, If devm_ioremap will fail. It will return NULL.
Kernel can run into a NULL-pointer dereference.
This error check will avoid NULL pointer dereference.



Have you ever seen this failure in the wild?

How was the patch tested?

Thanks,
David Daney



Signed-off-by: Arvind Yadav 
---
 drivers/net/ethernet/cavium/octeon/octeon_mgmt.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c 
b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
index 4ab404f..33c2fec 100644
--- a/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
+++ b/drivers/net/ethernet/cavium/octeon/octeon_mgmt.c
@@ -1479,6 +1479,12 @@ static int octeon_mgmt_probe(struct platform_device 
*pdev)
p->agl = (u64)devm_ioremap(&pdev->dev, p->agl_phys, p->agl_size);
p->agl_prt_ctl = (u64)devm_ioremap(&pdev->dev, p->agl_prt_ctl_phys,
   p->agl_prt_ctl_size);
+   if (!p->mix || !p->agl || !p->agl_prt_ctl) {
+   dev_err(&pdev->dev, "failed to map I/O memory\n");
+   result = -ENOMEM;
+   goto err;
+   }
+
spin_lock_init(&p->lock);

skb_queue_head_init(&p->tx_list);





[PATCH 1/3] Bluetooth: btusb: Use an error label for error paths

2016-12-14 Thread Rajat Jain
Use a label to remove the repetetive cleanup, for error cases.
(This label will also be used in subsequent patches).

Signed-off-by: Rajat Jain 
---
 drivers/bluetooth/btusb.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 2f633df..ce22cef 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -2991,18 +2991,15 @@ static int btusb_probe(struct usb_interface *intf,
err = usb_set_interface(data->udev, 0, 0);
if (err < 0) {
BT_ERR("failed to set interface 0, alt 0 %d", err);
-   hci_free_dev(hdev);
-   return err;
+   goto out_free_dev;
}
}
 
if (data->isoc) {
err = usb_driver_claim_interface(&btusb_driver,
 data->isoc, data);
-   if (err < 0) {
-   hci_free_dev(hdev);
-   return err;
-   }
+   if (err < 0)
+   goto out_free_dev;
}
 
 #ifdef CONFIG_BT_HCIBTUSB_BCM
@@ -3016,14 +3013,16 @@ static int btusb_probe(struct usb_interface *intf,
 #endif
 
err = hci_register_dev(hdev);
-   if (err < 0) {
-   hci_free_dev(hdev);
-   return err;
-   }
+   if (err < 0)
+   goto out_free_dev;
 
usb_set_intfdata(intf, data);
 
return 0;
+
+out_free_dev:
+   hci_free_dev(hdev);
+   return err;
 }
 
 static void btusb_disconnect(struct usb_interface *intf)
-- 
2.8.0.rc3.226.g39d4020



Re: [kernel-hardening] Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-14 Thread Jason A. Donenfeld
Hi again,

On Wed, Dec 14, 2016 at 5:37 PM, Theodore Ts'o  wrote:
> [3.606139] random benchmark!!
> [3.606276] get_random_int # cycles: 326578
> [3.606317] get_random_int_new # cycles: 95438
> [3.607423] get_random_bytes # cycles: 2653388

Looks to me like my siphash implementation is much faster for
get_random_long, and more or less tied for get_random_int:

[1.729370] random benchmark!!
[1.729710] get_random_long # cycles: 349771
[1.730128] get_random_long_chacha # cycles: 359660
[1.730457] get_random_long_siphash # cycles: 94255
[1.731307] get_random_bytes # cycles: 1354894
[1.731707] get_random_int # cycles: 305640
[1.732095] get_random_int_chacha # cycles: 80726
[1.732425] get_random_int_siphash # cycles: 94265
[1.733278] get_random_bytes # cycles: 1315873

Given the increasing usage of get_random_long for ASLR and related, I
think this makes the siphash approach worth pursuing. The chacha
approach is also not significantly different from the md5 approach in
terms of speed for get_rand_long. Additionally, since siphash is a
PRF, I think this opens up a big window for optimizing it even
further.

Benchmark here:
https://git.zx2c4.com/linux-dev/commit/?h=rng-bench

Jason


[PATCH 3/3] Bluetooth: btusb: Configure Marvel to use one of the pins for oob wakeup

2016-12-14 Thread Rajat Jain
The Marvell devices may have many gpio pins, and hence for wakeup
on these out-of-band pins, the chip needs to be told which pin is
to be used for wakeup, using an hci command.

Thus, we read the pin number etc from the device tree node and send
a command to the chip.

Signed-off-by: Rajat Jain 
---
Note that while I would have liked to name the compatible string as more
like "marvell, usb8997-bt", the devicetrees/bindings/usb/usb-device.txt
requires the compatible property to be of the form "usbVID,PID".

 .../{marvell-bt-sd8xxx.txt => marvell-bt-8xxx.txt} | 25 -
 drivers/bluetooth/btusb.c  | 59 ++
 2 files changed, 82 insertions(+), 2 deletions(-)
 rename Documentation/devicetree/bindings/net/{marvell-bt-sd8xxx.txt => 
marvell-bt-8xxx.txt} (76%)

diff --git a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt 
b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
similarity index 76%
rename from Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
rename to Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
index 6a9a63c..471bef8 100644
--- a/Documentation/devicetree/bindings/net/marvell-bt-sd8xxx.txt
+++ b/Documentation/devicetree/bindings/net/marvell-bt-8xxx.txt
@@ -1,4 +1,4 @@
-Marvell 8897/8997 (sd8897/sd8997) bluetooth SDIO devices
+Marvell 8897/8997 (sd8897/sd8997) bluetooth devices (SDIO or USB based)
 --
 
 Required properties:
@@ -6,11 +6,13 @@ Required properties:
   - compatible : should be one of the following:
* "marvell,sd8897-bt"
* "marvell,sd8997-bt"
+   * "usb1286,204e"
 
 Optional properties:
 
   - marvell,cal-data: Calibration data downloaded to the device during
  initialization. This is an array of 28 values(u8).
+ This is only applicable to SDIO devices.
 
   - marvell,wakeup-pin: It represents wakeup pin number of the bluetooth chip.
firmware will use the pin to wakeup host system (u16).
@@ -29,7 +31,9 @@ Example:
 IRQ pin 119 is used as system wakeup source interrupt.
 wakeup pin 13 and gap 100ms are configured so that firmware can wakeup host
 using this device side pin and wakeup latency.
-calibration data is also available in below example.
+
+Example for SDIO device follows (calibration data is also available in
+below example).
 
 &mmc3 {
status = "okay";
@@ -54,3 +58,20 @@ calibration data is also available in below example.
marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
};
 };
+
+Example for USB device:
+
+&usb_host1_ohci {
+status = "okay";
+#address-cells = <1>;
+#size-cells = <0>;
+
+mvl_bt1: bt@1 {
+   compatible = "usb1286,204e";
+   reg = <1>;
+   interrupt-parent = <&gpio0>;
+   interrupts = <119 IRQ_TYPE_LEVEL_LOW>;
+   marvell,wakeup-pin = /bits/ 16 <0x0d>;
+   marvell,wakeup-gap-ms = /bits/ 16 <0x64>;
+};
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 32a6f22..99d7f6d 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -2343,6 +2343,58 @@ static int btusb_shutdown_intel(struct hci_dev *hdev)
return 0;
 }
 
+#ifdef CONFIG_PM
+static const struct of_device_id mvl_oob_wake_match_table[] = {
+   { .compatible = "usb1286,204e" },
+   { }
+};
+MODULE_DEVICE_TABLE(of, mvl_oob_wake_match_table);
+
+/* Configure an out-of-band gpio as wake-up pin, if specified in device tree */
+static int marvell_config_oob_wake(struct hci_dev *hdev)
+{
+   struct sk_buff *skb;
+   struct btusb_data *data = hci_get_drvdata(hdev);
+   struct device *dev = &data->udev->dev;
+   u16 pin, gap, opcode;
+   int ret;
+   u8 cmd[5];
+
+   if (!of_match_device(mvl_oob_wake_match_table, dev))
+   return 0;
+
+   if (of_property_read_u16(dev->of_node, "marvell,wakeup-pin", &pin) ||
+   of_property_read_u16(dev->of_node, "marvell,wakeup-gap-ms", &gap))
+   return -EINVAL;
+
+   /* Vendor specific command to configure a GPIO as wake-up pin */
+   opcode = hci_opcode_pack(0x3F, 0x59);
+   cmd[0] = opcode & 0xFF;
+   cmd[1] = opcode >> 8;
+   cmd[2] = 2; /* length of parameters that follow */
+   cmd[3] = pin;
+   cmd[4] = gap; /* time in ms, for which wakeup pin should be asserted */
+
+   skb = bt_skb_alloc(sizeof(cmd), GFP_KERNEL);
+   if (!skb) {
+   bt_dev_err(hdev, "%s: No memory\n", __func__);
+   return -ENOMEM;
+   }
+
+   memcpy(skb_put(skb, sizeof(cmd)), cmd, sizeof(cmd));
+   hci_skb_pkt_type(skb) = HCI_COMMAND_PKT;
+
+   ret = btusb_send_frame(hdev, skb);
+   if (ret) {
+   bt_dev_err(hdev, "%s: configuration failed\n", __func__);
+   kfree_skb(skb);
+   return ret;
+   }
+
+   return 0;
+}
+#endif
+
 static int btusb_set_bdaddr_marvell(struct hci_dev *hdev,
const bda

[PATCH 2/3] Bluetooth: btusb: Add out-of-band wakeup support

2016-12-14 Thread Rajat Jain
Some BT chips (e.g. Marvell 8997) contain a wakeup pin that can be
connected to a gpio on the CPU side, and can be used to wakeup
the host out-of-band. This can be useful in situations where the
in-band wakeup is not possible or not preferable (e.g. the in-band
wakeup may require the USB host controller to remain active, and
hence consuming more system power during system sleep).

The oob gpio interrupt to be used for wakeup on the CPU side, is
read from the device tree node, (using standard interrupt descriptors).
A devcie tree binding document is also added for the driver.

Signed-off-by: Rajat Jain 
---
 Documentation/devicetree/bindings/net/btusb.txt | 38 
 drivers/bluetooth/btusb.c   | 82 +
 2 files changed, 120 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/btusb.txt

diff --git a/Documentation/devicetree/bindings/net/btusb.txt 
b/Documentation/devicetree/bindings/net/btusb.txt
new file mode 100644
index 000..bb27f92
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/btusb.txt
@@ -0,0 +1,38 @@
+Generic Bluetooth controller over USB (btusb driver)
+---
+
+Required properties:
+
+  - compatible : should comply with the format "usbVID,PID" specified in
+Documentation/devicetree/bindings/usb/usb-device.txt
+At the time of writing, the only OF supported devices
+(more may be added later) are:
+
+ "usb1286,204e" (Marvell 8997)
+
+Optional properties:
+
+  - interrupt-parent: phandle of the parent interrupt controller
+  - interrupts : The first interrupt specified is the interrupt that shall be
+used for out-of-band wake-on-bt. Driver will request an irq
+based on this interrupt number. During system suspend, the irq
+will be enabled so that the bluetooth chip can wakeup host
+platform out of band. During system resume, the irq will be
+disabled to make sure unnecessary interrupt is not received.
+
+Example:
+
+Following example uses irq pin number 3 of gpio0 for out of band wake-on-bt:
+
+&usb_host1_ehci {
+status = "okay";
+#address-cells = <1>;
+#size-cells = <0>;
+
+mvl_bt1: bt@1 {
+   compatible = "usb1286,204e";
+   reg = <1>;
+   interrupt-parent = <&gpio0>;
+   interrupts = <3 IRQ_TYPE_LEVEL_LOW>;
+};
+};
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index ce22cef..32a6f22 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -24,6 +24,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #include 
@@ -369,6 +371,7 @@ static const struct usb_device_id blacklist_table[] = {
 #define BTUSB_BOOTING  9
 #define BTUSB_RESET_RESUME 10
 #define BTUSB_DIAG_RUNNING 11
+#define BTUSB_OOB_WAKE_DISABLED12
 
 struct btusb_data {
struct hci_dev   *hdev;
@@ -416,6 +419,8 @@ struct btusb_data {
int (*recv_bulk)(struct btusb_data *data, void *buffer, int count);
 
int (*setup_on_usb)(struct hci_dev *hdev);
+
+   int oob_wake_irq;   /* irq for out-of-band wake-on-bt */
 };
 
 static inline void btusb_free_frags(struct btusb_data *data)
@@ -2728,6 +2733,65 @@ static int btusb_bcm_set_diag(struct hci_dev *hdev, bool 
enable)
 }
 #endif
 
+#ifdef CONFIG_PM
+static irqreturn_t btusb_oob_wake_handler(int irq, void *priv)
+{
+   struct btusb_data *data = priv;
+
+   /* Disable only if not already disabled (keep it balanced) */
+   if (!test_and_set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags)) {
+   disable_irq_wake(irq);
+   disable_irq_nosync(irq);
+   }
+   pm_wakeup_event(&data->udev->dev, 0);
+   return IRQ_HANDLED;
+}
+
+static const struct of_device_id btusb_match_table[] = {
+   { .compatible = "usb1286,204e" },
+   { }
+};
+MODULE_DEVICE_TABLE(of, btusb_match_table);
+
+/* Use an oob wakeup pin? */
+static int btusb_config_oob_wake(struct hci_dev *hdev)
+{
+   struct btusb_data *data = hci_get_drvdata(hdev);
+   struct device *dev = &data->udev->dev;
+   int irq, ret;
+
+   if (!of_match_device(btusb_match_table, dev))
+   return 0;
+
+   /* Move on if no IRQ specified */
+   irq = irq_of_parse_and_map(dev->of_node, 0);
+   if (!irq) {
+   bt_dev_dbg(hdev, "%s: no oob wake irq in DT", __func__);
+   return 0;
+   }
+
+   set_bit(BTUSB_OOB_WAKE_DISABLED, &data->flags);
+
+   ret = devm_request_irq(&hdev->dev, irq, btusb_oob_wake_handler,
+  IRQF_TRIGGER_LOW, "oob wake-on-bt", data);
+   if (ret) {
+   bt_dev_err(hdev, "%s: irq request failed", __func__);
+   return ret;
+   }
+
+   ret = device_init_wakeup(dev, true);
+   if (ret) {
+   bt_dev_err(hdev, "%s: failed to init_wakeup

Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Tom Herbert
On Wed, Dec 14, 2016 at 10:46 AM, Jason A. Donenfeld  wrote:
> SipHash is a 64-bit keyed hash function that is actually a
> cryptographically secure PRF, like HMAC. Except SipHash is super fast,
> and is meant to be used as a hashtable keyed lookup function.
>
"super fast" is relative. My quick test shows that this faster than
Toeplitz (good, but not exactly hard to achieve), but is about 4x
slower than jhash.

> SipHash isn't just some new trendy hash function. It's been around for a
> while, and there really isn't anything that comes remotely close to
> being useful in the way SipHash is. With that said, why do we need this?
>
I don't think we need advertising nor a lesson on hashing. It would be
much more useful if you just point us to the paper on siphash (which I
assume I http://cr.yp.to/siphash/siphash-20120918.pdf ?).

> There are a variety of attacks known as "hashtable poisoning" in which an
> attacker forms some data such that the hash of that data will be the
> same, and then preceeds to fill up all entries of a hashbucket. This is
> a realistic and well-known denial-of-service vector.
>
> Linux developers already seem to be aware that this is an issue, and
> various places that use hash tables in, say, a network context, use a
> non-cryptographically secure function (usually jhash) and then try to
> twiddle with the key on a time basis (or in many cases just do nothing
> and hope that nobody notices). While this is an admirable attempt at
> solving the problem, it doesn't actually fix it. SipHash fixes it.
>
Key rotation is important anyway, without any key rotation even if the
key is compromised in siphash by some external means we would have an
insecure hash until the system reboots.

> (It fixes it in such a sound way that you could even build a stream
> cipher out of SipHash that would resist the modern cryptanalysis.)
>
> There are a modicum of places in the kernel that are vulnerable to
> hashtable poisoning attacks, either via userspace vectors or network
> vectors, and there's not a reliable mechanism inside the kernel at the
> moment to fix it. The first step toward fixing these issues is actually
> getting a secure primitive into the kernel for developers to use. Then
> we can, bit by bit, port things over to it as deemed appropriate.
>
> Secondly, a few places are using MD5 for creating secure sequence
> numbers, port numbers, or fast random numbers. Siphash is a faster, more
> fittting, and more secure replacement for MD5 in those situations.
>
> Dozens of languages are already using this internally for their hash
> tables. Some of the BSDs already use this in their kernels. SipHash is
> a widely known high-speed solution to a widely known problem, and it's
> time we catch-up.
>
Maybe so, but we need to do due diligence before considering adopting
siphash as the primary hashing in the network stack. Consider that we
may very well perform a hash over L4 tuples on _every_ packet. We've
done a good job at limiting this to be at most one hash per packet,
but nevertheless the performance of the hash function must be take
into account.

A few points:

1) My quick test shows siphash is about four times more expensive than
jhash. On my test system, computing a hash over IPv4 tuple (two 32 bit
addresses and 2 16 bit source ports) is 6.9 nsecs in Jenkins hash, 33
nsecs with siphash. Given that we have eliminated most of the packet
header hashes this might be tolerable, but still should be looking at
ways to optimize.
2) I like moving to use u64 (quad words) in the hash, this is an
improvement over Jenkins which is based on 32 bit words. If we put
this in the kernel we probably want to have several variants of
siphash for specific sizes (e.g. siphash1, siphash2, siphash2,
siphashn for hash over one, two, three, or n sixty four bit words).
3) I also tested siphash distribution and Avalanche Effect (these
really should have been covered in the paper :-( ). Both of these are
good with siphash, in fact almost have identical characteristics to
Jenkins hash.

Tom

> Signed-off-by: Jason A. Donenfeld 
> Cc: Jean-Philippe Aumasson 
> Cc: Daniel J. Bernstein 
> Cc: Linus Torvalds 
> Cc: Eric Biggers 
> Cc: David Laight 
> ---
> Changes from v2->v3:
>
>   - There is now a fast aligned version of the function and a not-as-fast
> unaligned version. The requirements for each have been documented in
> a docbook-style comment. As well, the header now contains a constant
> for the expected alignment.
>
>   - The test suite has been updated to check both the unaligned and aligned
> version of the function.
>
>  include/linux/siphash.h |  30 ++
>  lib/Kconfig.debug   |   6 +-
>  lib/Makefile|   5 +-
>  lib/siphash.c   | 153 
> 
>  lib/test_siphash.c  |  85 +++
>  5 files changed, 274 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/siphash.h
>  create mode 100644 lib/sipha

Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Hannes Frederic Sowa
On 14.12.2016 19:06, Jason A. Donenfeld wrote:
> Hi David,
> 
> On Wed, Dec 14, 2016 at 6:56 PM, David Miller  wrote:
>> Just marking the structure __packed, whether necessary or not, makes
>> the compiler assume that the members are not aligned and causes
>> byte-by-byte accesses to be performed for words.
>> Never, _ever_, use __packed unless absolutely necessary, it pessimizes
>> the code on cpus that require proper alignment of types.
> 
> Oh, jimminy cricket, I did not realize that it made assignments
> byte-by-byte *always*. So what options am I left with? What
> immediately comes to mind are:
> 
> 1)
> 
> struct {
> u64 a;
> u32 b;
> u32 c;
> u16 d;
> u8 end[];

I don't think this helps. Did you test it? I don't see reason why
padding could be left out between `d' and `end' because of the flexible
array member?

Bye,
Hannes




Re: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Jason A. Donenfeld
Hi Tom,

On Wed, Dec 14, 2016 at 8:18 PM, Tom Herbert  wrote:
> "super fast" is relative. My quick test shows that this faster than
> Toeplitz (good, but not exactly hard to achieve), but is about 4x
> slower than jhash.

Fast relative to other cryptographically secure PRFs.

>> SipHash isn't just some new trendy hash function. It's been around for a
>> while, and there really isn't anything that comes remotely close to
>> being useful in the way SipHash is. With that said, why do we need this?
> I don't think we need advertising nor a lesson on hashing. It would be
> much more useful if you just point us to the paper on siphash (which I
> assume I http://cr.yp.to/siphash/siphash-20120918.pdf ?).

Ugh. Sorry. It definitely wasn't my intention to give an uninvited
lesson or an annoying advert. For the former, I didn't want to make
any expectations about fields of knowledge, because I honest have no
idea. For the latter, I wrote that sentence to indicate that siphash
isn't just some newfangled hipster function, but something useful and
well established. I didn't mean it as a form of advertising. My
apologies if I've offended your sensibilities.

That cr.yp.to link is fine, or https://131002.net/siphash/siphash.pdf I believe.

> Key rotation is important anyway, without any key rotation even if the
> key is compromised in siphash by some external means we would have an
> insecure hash until the system reboots.

I'm a bit surprised to read this. I've never designed a system to be
secure even in the event of remote arbitrary kernel memory disclosure,
and I wasn't aware this was generally considered an architectural
requirement or Linux.

In any case, if you want this, I suppose you can have it with siphash too.

> Maybe so, but we need to do due diligence before considering adopting
> siphash as the primary hashing in the network stack. Consider that we
> may very well perform a hash over L4 tuples on _every_ packet. We've
> done a good job at limiting this to be at most one hash per packet,
> but nevertheless the performance of the hash function must be take
> into account.

I agree with you. It seems like each case is going to needed to be
measured on a case by case basis. In this series I make the first use
of siphash in the secure sequence generation and get_random_int/long,
where siphash replaces md5, so there's a pretty clear performance in.
But for the jhash replacements indeed things are going to need to be
individually evaluated.

> 1) My quick test shows siphash is about four times more expensive than
> jhash. On my test system, computing a hash over IPv4 tuple (two 32 bit
> addresses and 2 16 bit source ports) is 6.9 nsecs in Jenkins hash, 33
> nsecs with siphash. Given that we have eliminated most of the packet
> header hashes this might be tolerable, but still should be looking at
> ways to optimize.
> 2) I like moving to use u64 (quad words) in the hash, this is an
> improvement over Jenkins which is based on 32 bit words. If we put
> this in the kernel we probably want to have several variants of
> siphash for specific sizes (e.g. siphash1, siphash2, siphash2,
> siphashn for hash over one, two, three, or n sixty four bit words).

I think your suggestion for (2) will contribute to further
optimizations for (1). In v2, I had another patch in there adding
siphash_1word, siphash_2words, etc, like jhash, but I implemented it
by taking u32 variables and then just concatenating these into a
buffer and passing them to the main siphash function. I removed it
from v3 because I thought that these kind of missed the whole point.
In particular:

a) siphash24_1word, siphash24_2words, siphash24_3words, etc should
take u64, not u32, since that's what siphash operates on natively
b) Rather than concatenating them in a buffer, I should write
specializations of the siphash24 function _especially_ for these size
inputs to avoid the copy and to reduce the book keeping.

I'll add these functions to v4 implemented like that.

Thanks for the useful feedback and benchmarks!

Jason


Re: [PATCH v2 3/4] secure_seq: use siphash24 instead of md5_transform

2016-12-14 Thread Jason A. Donenfeld
Hi Hannes,

On Wed, Dec 14, 2016 at 8:22 PM, Hannes Frederic Sowa
 wrote:
> I don't think this helps. Did you test it? I don't see reason why
> padding could be left out between `d' and `end' because of the flexible
> array member?

Because the type u8 doesn't require any alignment requirements, it can
nestle right up there cozy with the u16:

zx2c4@thinkpad ~ $ cat a.c
#include 
#include 
#include 
int main()
{
   struct {
   uint64_t a;
   uint32_t b;
   uint32_t c;
   uint16_t d;
   char x[];
   } a;
   printf("%zu\n", sizeof(a));
   printf("%zu\n", offsetof(typeof(a), x));
   return 0;
}
zx2c4@thinkpad ~ $ gcc a.c
zx2c4@thinkpad ~ $ ./a.out
24
18

Jason


RE: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-14 Thread Christoph Lameter
On Wed, 14 Dec 2016, David Laight wrote:

> If the kernel is doing ANY validation on the frames it must copy the
> data to memory the application cannot modify before doing the validation.
> Otherwise the application could change the data afterwards.

The application is not allowed to change the data after a work request has
been submitted to send the frame. Changes are possible after the
completion request has been received.

The kernel can enforce that by making the frame(s) readonly and thus
getting a page fault if the app would do such a thing.



Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-14 Thread Jason A. Donenfeld
Hi Hannes,

On Wed, Dec 14, 2016 at 4:09 PM, Hannes Frederic Sowa
 wrote:
> Yes, numbers would be very usable here. I am mostly concerned about
> small plastic router cases. E.g. assume you double packet processing
> time with a change of the hashing function at what point is the actual
> packet processing more of an attack vector than the hashtable?

I agree. Looks like Tom did some very quick benchmarks. I'll do some
more precise benchmarks myself when we graduate from looking at md5
replacement (the easy case) to looking at jhash replacement (the
harder case).

>> With that said, siphash is here to replace uses of jhash where
>> hashtable poisoning vulnerabilities make it necessary. Where there's
>> no significant security improvement, if there's no speed improvement
>> either, then of course nothing's going to change.
>
> It still changes currently well working source. ;-)

I mean if siphash doesn't make things better in someway, we'll just
continue using jhash, so no source change or anything. In other words:
evolutionary conservative approach rather than hasty "replace 'em
all!" tomfoolery.

> MD5 is considered broken because its collision resistance is broken?
> SipHash doesn't even claim to have collision resistance (which we don't
> need here)?

Not just that, but it's not immediately clear to me that using MD5 as
a PRF the way it is now with md5_transform is even a straightforwardly
good idea.

> But I agree, certainly it could be a nice speed-up!

The benchmarks for the secure sequence number generation and the rng
are indeed really promising.

> I think you mean non-linearity.

Yea of course, editing typo, sorry.

> In general I am in favor to switch to siphash, but it would be nice to
> see some benchmarks with the specific kernel implementation also on some
> smaller 32 bit CPUs and especially without using any SIMD instructions
> (which might have been used in paper comparison).

Sure, agreed. Each proposed jhash replacement will need to be
benchmarked on little MIPS machines and x86 monsters alike, with
patches indicating PPS before and after.

Jason


  1   2   >