[net-next] arp: add macro to get drop_gratuitous_arp setting

2016-03-07 Thread Zhang Shengju
Add macro IN_DEV_DROP_GRATUITOUS_ARP to facilitate getting
drop_gratuitous_arp value.

Signed-off-by: Zhang Shengju 
---
 include/linux/inetdevice.h | 3 +++
 net/ipv4/arp.c | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ee971f3..9d1dd2c 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -128,6 +128,9 @@ static inline void ipv4_devconf_setall(struct in_device 
*in_dev)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)
 #define IN_DEV_ARP_IGNORE(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_IGNORE)
 #define IN_DEV_ARP_NOTIFY(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_NOTIFY)
+#define IN_DEV_DROP_GRATUITOUS_ARP(in_dev) \
+   IN_DEV_ORCONF((in_dev), \
+ DROP_GRATUITOUS_ARP)
 
 struct in_ifaddr {
struct hlist_node   hash;
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index c34c754..0bf5cca 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -740,7 +740,7 @@ static int arp_process(struct net *net, struct sock *sk, 
struct sk_buff *skb)
   *there will be an ARP proxy and gratuitous ARP frames are attacks
   *and thus should not be accepted.
   */
-   if (sip == tip && IN_DEV_ORCONF(in_dev, DROP_GRATUITOUS_ARP))
+   if (sip == tip && IN_DEV_DROP_GRATUITOUS_ARP(in_dev))
goto out_free_skb;
 
 /*
-- 
1.8.3.1





Re: [PATCH v2] can: rcar_canfd: Add Renesas R-Car CAN FD driver

2016-03-07 Thread Oliver Hartkopp


On 03/07/2016 09:32 AM, Ramesh Shanmugasundaram wrote:

> + /* Ensure channel starts in FD mode */
> + if (!(priv->can.ctrlmode & CAN_CTRLMODE_FD)) {
> + netdev_err(ndev, "enable can fd mode for channel %d\n", ch);
> + goto fail_mode;
> + }

 What's the reason behind this check?

 A CAN FD capable CAN controller can be either configured to run as
 CAN 2.0 (Classic CAN) or as CAN FD controller.

 So why are to throwing an error here and produce an initialization
 failure?
>>>
>>> When this controller is configured in FD mode and used only with CAN
>>> 2.0 nodes, it still expects a DTSEG (data bitrate) configuration same
>>> as NTSEG (nominal bitrate). This check, specifically in ndo_open,
>>> ensures both are configured and will work fine with CAN 2.0 nodes
>>> (e.g.)
>>>
>>> "ip link set can0 up type can bitrate 100 dbitrate 100 fd on"
>>>
>>> If I don't have this check, a configuration like this
>>>
>>> "ip link set can0 up type can bitrate 100"
>>>
>>> will bring up the controller without DTSEG configured.

What about spending some status flag or setting the data bitrate equal to the
nominal bitrate unless a data bitrate is provided?

>>
>> That should bring up the controller in CAN 2.0 mode.
> 
> Yes, that's the user's intention but the manual states DTSEG still need to be 
> configured. In the above configuration, it will not be.
> Besides, this will not be a "pure" CAN 2.0 node (i.e.) if a frame with length 
> > 8 bytes is received the controller will "ACK" because in FD mode it can 
> receive up to 64 bytes frame.

Oh. We probably mix something up here (CAN frame formats & bitrates).

A CAN2.0 frame and a CAN FD frame have very different representations on the
wire! So if you see a FDF (former EDL) bit this is a CAN FD frame, which
requires two bitrates (nominal/data bitrate) where the data bitrate has to be
greater or equal the nominal bitrate.

The fact that the data bitrate is equal the nominal/arbitration bitrate has
nothing to do with CAN2.0 then. Regarding your answer this is not even "a pure
CAN2.0" node - it still looks like a CAN FD node with equal data/nominal 
bitrates.

The fact that a CAN FD frame has a size of 8 bytes doesn't make it a CAN2.0
frame :-)

> 
> The controller does support a "pure" classical CAN mode with a different set 
> of register map itself.

Is this a can_rcar controller register mapping then?

> Do you think a pure CAN 2.0 mode support would be beneficial? I can submit 
> this in coming days on top of current submission.
> 
> The current submission status is:
>  - Controller operates in CAN FD mode only.
>  - If needed to interoperate with CAN 2.0 nodes, data bitrate still need to 
> be configured and it will work perfectly. However, it is not a "pure" CAN 2.0 
> node as mentioned above.

When you have a CAN FD /capable/ controller the idea is:

"ip link set can0 up type can bitrate 100"

The controller is in CAN2.0 mode:

1. It can send and receive CAN2.0 frames @1MBit/s.
2. The MTU is set to 16 (sizeof(struct can_frame)) ; CAN_CTRLMODE_FD is unset.
3. The CAN controller is not CAN FD tolerant (will produce error frames)

"ip link set can0 up type can bitrate 100 dbitrate 100 fd on"

1. It can send and receive CAN2.0 frames @1MBit/s.
2. It can send and receive CAN FD frames @1MBit/s (arbitration bitrate).
3. The MTU is set to 72 (sizeof(struct canfd_frame)) ; CAN_CTRLMODE_FD is set.

For CAN FD frames the data bitrate can be increased like:
"ip link set can0 up type can bitrate 100 dbitrate 400 fd on"

So when CAN_CTRLMODE_FD is unset the controller should act like a "pure
CAN2.0" node. When people configure a CAN FD controller with "fd on" and use
CAN2.0 frames all the time this is ok either - but the controller is able to
process CAN FD frames with the correct bitrate too.

Regards,
Oliver



Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Michal Kazior
On 7 March 2016 at 19:28, Dave Taht  wrote:
> On Mon, Mar 7, 2016 at 9:14 AM, Avery Pennarun  wrote:
>> On Mon, Mar 7, 2016 at 11:54 AM, Dave Taht  wrote:
[...]
>>> the underlying code needs to be striving successfully for per-station
>>> airtime fairness for this to work at all, and the driver/card
>>> interface nearly as tight as BQL is for the fq portion to behave
>>> sanely. I'd configure codel at a higher target and try to observe what
>>> is going on at the fq level til that got saner.
>>
>> That seems like two good goals.  So Emmanuel's BQL-like thing seems
>> like we'll need it soon.
>>
>> As for per-station airtime fairness, what's a good approximation of
>> that?  Perhaps round-robin between stations, one aggregate per turn,
>> where each aggregate has a maximum allowed latency?
>
> Strict round robin is a start, and simplest, yes. Sure.
>
> "Oldest station queues first" on a round (probably) has higher
> potential for maximizing txops, but requires more overhead. (shortest
> queue first would be bad). There's another algo based on last received
> packets from a station possibly worth fiddling with in the long run...
>
> as "maximum allowed latency" - well, to me that is eventually also a
> variable, based on the number of stations that have to be scheduled on
> that round. Trying to get away from 10 stations eating 5.7ms each +
> return traffic on a round would be nicer. If you want a constant, for
> now, aim for 2048us or 1TU.

The "one aggregate per turn" is a tricky.

I guess you can guarantee this sort of thing on ath9k/mt76.

This isn't the case for other drivers, e.g. ath10k has a flat tx
queue. You don't really know if the 40 frames you submitted will be
sent with 1, 2 or 3 aggregates. They might not be aggregated at all.

Best thing you can do is to estimate how much bytes can you fit into a
txop on target sta-tid/ac assuming you can get last/avg tx rate to
given station (should be doable on ath10k at least).

Moreover, for MU-MIMO you typically want to burst a few aggregates in
txop to make the sounding to pay off. And this is again tricky on flat
tx queue where you don't really know if target stations can do an
efficient MU transmission and worst case you'll end up with stacking
up 3 txops worth of data in queues.


Oh, and the unfortunate thing is ath10k does offloaded powersaving
which means some frames can clog up tx queues unnecessarily until next
TBTT. This is something that will need to be addressed as well in tx
scheduling. Not sure how yet.

A quick idea - perhaps we could unify ps_tx_buf with txqs and make use
of txqs internally regardless of wake_tx_queue implementation?


[...]
> [1] I've published a lot of stuff showing how damaging 802.11e's edca
> scheduling can be - I lean towards, at most, 2-3 aggregates being in
> the hardware, essentially disabling the VO queue on 802.11n (not sure
> on ac), in favor of VI, promoting or demoting an assembled aggregate
> from BE to BK or VI as needed at the last second before submitting it
> to the hardware, trying harder to only have one aggregate outstanding
> to one station at a time, etc.

Makes sense, but (again) tricky for drivers such as ath10k which have
a flat tx queue. Perhaps I could maintain a simulation of aggregates
or some sort of barriers and hope it's "good enough".


Michał


Re: [PATCH v3 0/8] arm64: rockchip: Initial GeekBox enablement

2016-03-07 Thread Giuseppe CAVALLARO

Hi Dinh,

On 3/8/2016 12:22 AM, Dinh Nguyen wrote:
[snip]


I'm seeing the same issue on the SoCFPGA platform:

libphy: PHY stmmac-0: not found
eth0: Could not attach to PHY
stmmac_open: Cannot attach to PHY (error: -19)

If I just revert:

  "stmmac: Fix 'eth0: No PHY found' regression"

then the issue goes away.


do you have this patch "stmmac: first frame prep at the end of xmit 
routine" ? Or you just reverted

  "stmmac: Fix 'eth0: No PHY found' regression"

maybe we can sum:

  "stmmac: Fix 'eth0: No PHY found' regression"
   that is introducing a problem when a box
   has a transceiver

 "stmmac: first frame prep at the end of xmit routine"
   That is breaking the tx path on arm64

I will check both patches on my side and let you know

peppe



Thanks,
Dinh





Re: [PATCH net 4/4] net: hns: remove useless head=ring->next_to_clean

2016-03-07 Thread Andy Shevchenko
On Tue, 2016-03-08 at 11:52 +0800, Lisheng wrote:
> From: Qianqian Xie 
> 
> The variable head in hns_nic_tx_fini_pro has read a value,
> but the value is not used. The patch will solve it.
> 
> Signed-off-by: Qianqian Xie 
> ---
>  drivers/net/ethernet/hisilicon/hns/hns_enet.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> index 3f77ff7..7b4ec2f 100644
> --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
> @@ -901,10 +901,8 @@ static int hns_nic_tx_poll_one(struct
> hns_nic_ring_data *ring_data,
>  static void hns_nic_tx_fini_pro(struct hns_nic_ring_data *ring_data)
>  {
>   struct hnae_ring *ring = ring_data->ring;
> - int head = ring->next_to_clean;
>  

This empty line is a leftover, I prefer to just cut assignment.

> - /* for hardware bug fixed */

This is not good. Explanation is needed why you removed this comment.

> - head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
> + int head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
>  
>   if (head != ring->next_to_clean) {
>   ring_data->ring->q->handle->dev->ops-
> >toggle_ring_irq(

-- 
Andy Shevchenko 
Intel Finland Oy



Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Michal Kazior
On 8 March 2016 at 00:06, Dave Taht  wrote:
> Dear Michal:
>
> Going through this patchset... (while watching it compile)
>
>
> +   if (!local->hw.txq_cparams.target)
> +   local->hw.txq_cparams.target = MS2TIME(5);
>
> MS2TIME(20) for now and/or add something saner to this than !*backlog

Yep, makes sense.


> target will not be a constant in the long run.
>
> +   if (now - custom_codel_get_enqueue_time(skb) < p->target ||
> +   !*backlog) {
> +   /* went below - stay below for at least interval */
> +   vars->first_above_time = 0;
> +   return false;
> +   }
>
> *backlog < some_sane_value_for_an_aggregate_for_this_station
>
> Unlike regular codel *backlog should be a ptr to the queuesize for
> this station, not the total queue.
>
> regular codel, by using the shared backlog for all queues, is trying
> to get to a 1 packet depth for all queues, (see commit:
> 865ec5523dadbedefbc5710a68969f686a28d928 ), and store the flow in the
> network, not the queue...
>
> BUT in wifi's case you want to provide good service to all stations,
> which is filling up an aggregate
> for each... (and varying the "sane_value_for_the_aggregate" to suit
> per sta service time requirements in a given round of all stations).

This is tricky. Drivers that use minstrel can probably do the estimate
rather easily.

However other drivers (e.g. ath10k) have offloaded rate control on
device. There's currently no way of doing this calculation. I was
thinking of drivers exporting tx_rate to mac80211 in some way - either
via a simple sta->tx_rate scalar that the driver is responsible for
updating, or a EWMA that driver updates (hopefully) periodically and
often enough. This should in theory at least allow an estimate how
much data on average you can fit into given time frame (e.g. txop, or
hardcoded 1ms).

I'll try looking more into this. Any hints/suggestion welcome.


>
> ...
>
> +   fq->flows_cnt = 4096;
>
> regular fq_codel uses 1024 and there has not been much reason to
> change it. In the case of an AP which has more limited memory, 256 or
> 1024 would be a good setting, per station. I'd stick to 1024 for now.

Do note that the 4096 is shared _across_ station-tid queues. It is not
per-station. If you have 10 stations you still have 4096 flows
(actually 4096 + 16*10, because each tid - and there are 16 - has it's
own fallback flow in case of hash collision on the global flowmap to
maintain per-sta-tid queuing).

With that in mind do you still think 1024 is enough?


> With large values for flows_cnt, fq, dominates, for small values, aqm
> does. We did quite a lot of testing at 16 and 32 queues in the early
> days, with pretty good results, except when we didn't. Cake went whole
> hog with an 8 way set associative hash leading to "near perfect" fq,
> which, at the cost of more cpu overhead, could cut the number of
> queues down by a lot, also. Eric did "perfect" fq with sch_fq...

Out of curiosity - do you have any numbers to compare against
fq_codel? Like hash collision probability vs number of active flows?


> (btw: another way to test how codel is working is to set flows_cnt to
> 1. I will probably end up doing that at some point)
>
> +   fq->perturbation = prandom_u32();
> +   fq->quantum = 300;
>
> quantum 300 is a good compromise to maximize delivery of small packets
> from different flows. Probably the right thing on a station.
>
> It also has cpu overhead. Quantum=1514 is more of the right thing on an AP.
>
> (but to wax philosophical, per packet fairness rather than byte
> fairness probably spreads errors across more flows in a wifi aggregate
> than byte fairness, thus 300 remains a decent compromise if you can
> spare the cpu)
>
> ...
>
> where would be a suitable place to make (count, ecn_marks, drops)
> visible in this subsystem?

I was thinking of using debugfs for starters and then, once things
settle down, move to nl80211. Maybe we could re-use some enums like
TCA_FQ_CODEL_TARGET? Not sure if that's the best idea though. We might
need extra knobs to control and/or have in mind we might want to
replace the queuing algorithm at some point in the future as well
(which will need a different set of knobs).


>
> ...
>
> Is this "per station" or per station, per 802.11e queue?

What do you have in mind by "this"?



Michał

>
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> https://www.gofundme.com/savewifi
>
>
> On Fri, Feb 26, 2016 at 5:09 AM, Michal Kazior  
> wrote:
>> Since 11n aggregation become important to get the
>> best out of txops. However aggregation inherently
>> requires buffering and queuing. Once variable
>> medium conditions to different associated stations
>> is considered it became apparent that bufferbloat
>> can't be simply fought with qdiscs for wireless
>> drivers. 11ac with MU-MIMO makes the problem
>> worse because the bandwidth-delay product 

RE: [net-next:master 1058/1060] qede_main.c:undefined reference to `tcp_gro_complete'

2016-03-07 Thread Manish Chopra
> -Original Message-
> From: kbuild test robot [mailto:fengguang...@intel.com]
> Sent: Tuesday, March 08, 2016 6:45 AM
> To: Manish Chopra 
> Cc: kbuild-...@01.org; netdev ; Yuval Mintz
> 
> Subject: [net-next:master 1058/1060] qede_main.c:undefined reference to
> `tcp_gro_complete'
> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
> master
> head:   d66ab51442211158b677c2f12310c314d9587f74
> commit: 55482edc25f0606851de42e73618f813f310d009 [1058/1060] qede: Add
> slowpath/fastpath support and enable hardware GRO
> config: x86_64-randconfig-s0-03080757 (attached as .config)
> reproduce:
> git checkout 55482edc25f0606851de42e73618f813f310d009
> # save the attached .config to linux build tree
> make ARCH=x86_64
> 
> All errors (new ones prefixed by >>):
> 
>drivers/built-in.o: In function `qede_rx_int':
> >> qede_main.c:(.text+0x6101a0): undefined reference to `tcp_gro_complete'
> 
> ---

Seems to be a false alarm ??
I tried compiling net-next with "make ARCH=x86_64" using attached ".config" 
after David applied the series.
I didn't see any error or warning as reported above.

Thanks,
Manish


Re: [PATCH 0/3] net: macb: Fix coding style issues

2016-03-07 Thread Alexander Stein
Hi,

On Monday 07 March 2016 08:17:36, Moritz Fischer wrote:
> this series deals with most of the checkpatch warnings
> generated for macb. There are two BUG_ON()'s that I didn't touch, yet,
> that were suggested by checkpatch, that I can address in a follow up
> commit if needed.

I think addressing those BUG_ON() warnings would be nice as they can affect a 
running system pretty bad.

Best regards,
Alexander



Re: [PATCH net-next 2/3] ipv6: per netns fib6 walkers

2016-03-07 Thread Michal Kubecek
On Mon, Mar 07, 2016 at 04:28:26PM -0800, Cong Wang wrote:
> On Mon, Mar 7, 2016 at 4:26 PM, Cong Wang  wrote:
> > On Fri, Mar 4, 2016 at 2:59 AM, Michal Kubecek  wrote:
> >>  static void ipv6_route_seq_setup_walk(struct ipv6_route_iter *iter)
> >>  {
> >> +#ifdef CONFIG_NET_NS
> >> +   struct net *net = iter->p.net;
> >> +#else
> >> +   struct net *net = _net;
> >> +#endif
> >> +
> >
> > You should pass the struct net pointer to ipv6_route_seq_setup_walk()
> > instead of reading it by yourself.

I considered this. While it probably wouldn't bring any extra overhead
as the function is going to be inlined anyway, it didn't look really
nice. I gues I'll use read_pnet() as David suggested; I just didn't
realize the reason it's a macro in !CONFIG_NET_NS case is to allow
passing a pointer to non-existent struct member.

> > I don't find anyone actually using iter->p, it probably can be just removed.
> 
> Er, seq_file_net() uses it... but callers already call it.

Not only seq_file_net(). The whole infrastructure assumes private data
start with an instance of struct seq_net_private and seq_open_net()
initializes it.

Michal Kubecek


[PATCH v2 net-next 05/12] bpf: convert stackmap to pre-allocation

2016-03-07 Thread Alexei Starovoitov
It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
 cpu0  cpu1
   
user does lookup(stackidX)
starts copying ips into buffer
   delete(stackidX)
   calls bpf_get_stackid()
   which recyles the element and
   overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h   |  1 +
 kernel/bpf/stackmap.c | 86 ---
 kernel/bpf/syscall.c  |  2 ++
 3 files changed, 71 insertions(+), 18 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index efd1d4ca95c6..21ee41b92e8a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -195,6 +195,7 @@ int bpf_percpu_hash_update(struct bpf_map *map, void *key, 
void *value,
   u64 flags);
 int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
u64 flags);
+int bpf_stackmap_copy(struct bpf_map *map, void *key, void *value);
 
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
  * forced to use 'long' read/writes to try to atomically copy long counters.
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index f0a02c344358..499d9e933f8e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -10,9 +10,10 @@
 #include 
 #include 
 #include 
+#include "percpu_freelist.h"
 
 struct stack_map_bucket {
-   struct rcu_head rcu;
+   struct pcpu_freelist_node fnode;
u32 hash;
u32 nr;
u64 ip[];
@@ -20,10 +21,34 @@ struct stack_map_bucket {
 
 struct bpf_stack_map {
struct bpf_map map;
+   void *elems;
+   struct pcpu_freelist freelist;
u32 n_buckets;
-   struct stack_map_bucket __rcu *buckets[];
+   struct stack_map_bucket *buckets[];
 };
 
+static int prealloc_elems_and_freelist(struct bpf_stack_map *smap)
+{
+   u32 elem_size = sizeof(struct stack_map_bucket) + smap->map.value_size;
+   int err;
+
+   smap->elems = vzalloc(elem_size * smap->map.max_entries);
+   if (!smap->elems)
+   return -ENOMEM;
+
+   err = pcpu_freelist_init(>freelist);
+   if (err)
+   goto free_elems;
+
+   pcpu_freelist_populate(>freelist, smap->elems, elem_size,
+  smap->map.max_entries);
+   return 0;
+
+free_elems:
+   vfree(smap->elems);
+   return err;
+}
+
 /* Called from syscall */
 static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
 {
@@ -70,12 +95,22 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
smap->n_buckets = n_buckets;
smap->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
 
+   err = bpf_map_precharge_memlock(smap->map.pages);
+   if (err)
+   goto free_smap;
+
err = get_callchain_buffers();
if (err)
goto free_smap;
 
+   err = prealloc_elems_and_freelist(smap);
+   if (err)
+   goto put_buffers;
+
return >map;
 
+put_buffers:
+   put_callchain_buffers();
 free_smap:
kvfree(smap);
return ERR_PTR(err);
@@ -121,7 +156,7 @@ static u64 bpf_get_stackid(u64 r1, u64 r2, u64 flags, u64 
r4, u64 r5)
ips = trace->ip + skip + init_nr;
hash = jhash2((u32 *)ips, trace_len / sizeof(u32), 0);
id = hash & (smap->n_buckets - 1);
-   bucket = rcu_dereference(smap->buckets[id]);
+   bucket = READ_ONCE(smap->buckets[id]);
 
if (bucket && bucket->hash == hash) {
if 

[PATCH v2 net-next 04/12] bpf: check for reserved flag bits in array and stack maps

2016-03-07 Thread Alexei Starovoitov
Suggested-by: Daniel Borkmann 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/arraymap.c | 2 +-
 kernel/bpf/stackmap.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index bd3bdf2486a7..76d5a794e426 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -53,7 +53,7 @@ static struct bpf_map *array_map_alloc(union bpf_attr *attr)
 
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
-   attr->value_size == 0)
+   attr->value_size == 0 || attr->map_flags)
return ERR_PTR(-EINVAL);
 
if (attr->value_size >= 1 << (KMALLOC_SHIFT_MAX - 1))
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 8a60ee14a977..f0a02c344358 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -35,6 +35,9 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
if (!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
 
+   if (attr->map_flags)
+   return ERR_PTR(-EINVAL);
+
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
value_size < 8 || value_size % 8 ||
-- 
2.8.0.rc1



[PATCH v2 net-next 03/12] bpf: pre-allocate hash map elements

2016-03-07 Thread Alexei Starovoitov
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab->count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida2.1M
pcpu_ida nolock 2.3M
bt  2.4M
kmalloc 1.8M
hlist+spinlock  2.3M
pcpu_freelist   2.6M

4 cpu:
pcpu_ida1.5M
pcpu_ida nolock 1.8M
bt w/smp_align  1.7M
bt no/smp_align 1.1M
kmalloc 0.7M
hlist+spinlock  0.2M
pcpu_freelist   2.0M

8 cpu:
pcpu_ida0.7M
bt w/smp_align  0.8M
kmalloc 0.4M
pcpu_freelist   1.5M

32 cpu:
kmalloc 0.13M
pcpu_freelist   0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map->max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |   2 +
 include/uapi/linux/bpf.h |   3 +
 kernel/bpf/hashtab.c | 240 +--
 kernel/bpf/syscall.c |  15 ++-
 4 files changed, 186 insertions(+), 74 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4b070827200d..efd1d4ca95c6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -37,6 +37,7 @@ struct bpf_map {
u32 key_size;
u32 value_size;
u32 max_entries;
+   u32 map_flags;
u32 pages;
struct user_struct *user;
const struct bpf_map_ops *ops;
@@ -178,6 +179,7 @@ struct bpf_map *__bpf_map_get(struct fd f);
 void bpf_map_inc(struct bpf_map *map, bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
+int bpf_map_precharge_memlock(u32 pages);
 
 extern int sysctl_unprivileged_bpf_disabled;
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6496f98d3d68..7b4535e6325c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -101,12 +101,15 @@ enum bpf_prog_type {
 #define BPF_NOEXIST1 /* create new element if it didn't exist */
 #define BPF_EXIST  2 /* update existing element */
 
+#define BPF_F_NO_PREALLOC  (1U << 0)
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
__u32   key_size;   /* size of key in bytes */
__u32   value_size; /* size of value in bytes */
__u32   max_entries;/* max number 

[PATCH v2 net-next 09/12] samples/bpf: test both pre-alloc and normal maps

2016-03-07 Thread Alexei Starovoitov
extend test coveraged to include pre-allocated and run-time alloc maps

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/test_maps.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/test_maps.c b/samples/bpf/test_maps.c
index 7bd9edd02d9b..47bf0858f9e4 100644
--- a/samples/bpf/test_maps.c
+++ b/samples/bpf/test_maps.c
@@ -468,7 +468,7 @@ static void test_map_parallel(void)
assert(bpf_get_next_key(map_fd, , ) == -1 && errno == ENOENT);
 }
 
-int main(void)
+static void run_all_tests(void)
 {
test_hashmap_sanity(0, NULL);
test_percpu_hashmap_sanity(0, NULL);
@@ -479,6 +479,14 @@ int main(void)
test_map_large();
test_map_parallel();
test_map_stress();
+}
+
+int main(void)
+{
+   map_flags = 0;
+   run_all_tests();
+   map_flags = BPF_F_NO_PREALLOC;
+   run_all_tests();
printf("test_maps: OK\n");
return 0;
 }
-- 
2.8.0.rc1



[PATCH v2 net-next 01/12] bpf: prevent kprobe+bpf deadlocks

2016-03-07 Thread Alexei Starovoitov
if kprobe is placed within update or delete hash map helpers
that hold bucket spin lock and triggered bpf program is trying to
grab the spinlock for the same bucket on the same cpu, it will
deadlock.
Fix it by extending existing recursion prevention mechanism.

Note, map_lookup and other tracing helpers don't have this problem,
since they don't hold any locks and don't modify global data.
bpf_trace_printk has its own recursive check and ok as well.

Signed-off-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/bpf.h  |  3 +++
 kernel/bpf/syscall.c | 13 +
 kernel/trace/bpf_trace.c |  2 --
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 51e498e5470e..4b070827200d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct bpf_map;
 
@@ -163,6 +164,8 @@ bool bpf_prog_array_compatible(struct bpf_array *array, 
const struct bpf_prog *f
 const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
 
 #ifdef CONFIG_BPF_SYSCALL
+DECLARE_PER_CPU(int, bpf_prog_active);
+
 void bpf_register_prog_type(struct bpf_prog_type_list *tl);
 void bpf_register_map_type(struct bpf_map_type_list *tl);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c95a753c2007..dc99f6a000f5 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 
+DEFINE_PER_CPU(int, bpf_prog_active);
+
 int sysctl_unprivileged_bpf_disabled __read_mostly;
 
 static LIST_HEAD(bpf_map_types);
@@ -347,6 +349,11 @@ static int map_update_elem(union bpf_attr *attr)
if (copy_from_user(value, uvalue, value_size) != 0)
goto free_value;
 
+   /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+* inside bpf map update or delete otherwise deadlocks are possible
+*/
+   preempt_disable();
+   __this_cpu_inc(bpf_prog_active);
if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH) {
err = bpf_percpu_hash_update(map, key, value, attr->flags);
} else if (map->map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
@@ -356,6 +363,8 @@ static int map_update_elem(union bpf_attr *attr)
err = map->ops->map_update_elem(map, key, value, attr->flags);
rcu_read_unlock();
}
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
 
 free_value:
kfree(value);
@@ -394,9 +403,13 @@ static int map_delete_elem(union bpf_attr *attr)
if (copy_from_user(key, ukey, map->key_size) != 0)
goto free_key;
 
+   preempt_disable();
+   __this_cpu_inc(bpf_prog_active);
rcu_read_lock();
err = map->ops->map_delete_elem(map, key);
rcu_read_unlock();
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
 
 free_key:
kfree(key);
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 4b8caa392b86..3e4ffb3ace5f 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -13,8 +13,6 @@
 #include 
 #include "trace.h"
 
-static DEFINE_PER_CPU(int, bpf_prog_active);
-
 /**
  * trace_call_bpf - invoke BPF program
  * @prog: BPF program
-- 
2.8.0.rc1



[PATCH v2 net-next 02/12] bpf: introduce percpu_freelist

2016-03-07 Thread Alexei Starovoitov
Introduce simple percpu_freelist to keep single list of elements
spread across per-cpu singly linked lists.

/* push element into the list */
void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

/* pop element from the list */
struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

The object is pushed to the current cpu list.
Pop first trying to get the object from the current cpu list,
if it's empty goes to the neigbour cpu list.

For bpf program usage pattern the collision rate is very low,
since programs push and pop the objects typically on the same cpu.

Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/percpu_freelist.c | 100 +++
 kernel/bpf/percpu_freelist.h |  31 ++
 3 files changed, 132 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/percpu_freelist.c
 create mode 100644 kernel/bpf/percpu_freelist.h

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 8a932d079c24..eed911d091da 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,7 +1,7 @@
 obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
new file mode 100644
index ..5c51d1985b51
--- /dev/null
+++ b/kernel/bpf/percpu_freelist.c
@@ -0,0 +1,100 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include "percpu_freelist.h"
+
+int pcpu_freelist_init(struct pcpu_freelist *s)
+{
+   int cpu;
+
+   s->freelist = alloc_percpu(struct pcpu_freelist_head);
+   if (!s->freelist)
+   return -ENOMEM;
+
+   for_each_possible_cpu(cpu) {
+   struct pcpu_freelist_head *head = per_cpu_ptr(s->freelist, cpu);
+
+   raw_spin_lock_init(>lock);
+   head->first = NULL;
+   }
+   return 0;
+}
+
+void pcpu_freelist_destroy(struct pcpu_freelist *s)
+{
+   free_percpu(s->freelist);
+}
+
+static inline void __pcpu_freelist_push(struct pcpu_freelist_head *head,
+   struct pcpu_freelist_node *node)
+{
+   raw_spin_lock(>lock);
+   node->next = head->first;
+   head->first = node;
+   raw_spin_unlock(>lock);
+}
+
+void pcpu_freelist_push(struct pcpu_freelist *s,
+   struct pcpu_freelist_node *node)
+{
+   struct pcpu_freelist_head *head = this_cpu_ptr(s->freelist);
+
+   __pcpu_freelist_push(head, node);
+}
+
+void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size,
+   u32 nr_elems)
+{
+   struct pcpu_freelist_head *head;
+   unsigned long flags;
+   int i, cpu, pcpu_entries;
+
+   pcpu_entries = nr_elems / num_possible_cpus() + 1;
+   i = 0;
+
+   /* disable irq to workaround lockdep false positive
+* in bpf usage pcpu_freelist_populate() will never race
+* with pcpu_freelist_push()
+*/
+   local_irq_save(flags);
+   for_each_possible_cpu(cpu) {
+again:
+   head = per_cpu_ptr(s->freelist, cpu);
+   __pcpu_freelist_push(head, buf);
+   i++;
+   buf += elem_size;
+   if (i == nr_elems)
+   break;
+   if (i % pcpu_entries)
+   goto again;
+   }
+   local_irq_restore(flags);
+}
+
+struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *s)
+{
+   struct pcpu_freelist_head *head;
+   struct pcpu_freelist_node *node;
+   int orig_cpu, cpu;
+
+   orig_cpu = cpu = raw_smp_processor_id();
+   while (1) {
+   head = per_cpu_ptr(s->freelist, cpu);
+   raw_spin_lock(>lock);
+   node = head->first;
+   if (node) {
+   head->first = node->next;
+   raw_spin_unlock(>lock);
+   return node;
+   }
+   raw_spin_unlock(>lock);
+   cpu = cpumask_next(cpu, cpu_possible_mask);
+   if (cpu >= nr_cpu_ids)
+   cpu = 0;
+   if (cpu == orig_cpu)
+   return NULL;
+   }
+}
diff --git a/kernel/bpf/percpu_freelist.h b/kernel/bpf/percpu_freelist.h
new file mode 100644
index ..3049aae8ea1e
--- /dev/null
+++ b/kernel/bpf/percpu_freelist.h
@@ -0,0 +1,31 @@
+/* Copyright (c) 2016 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU 

[PATCH v2 net-next 08/12] samples/bpf: add map_flags to bpf loader

2016-03-07 Thread Alexei Starovoitov
note old loader is compatible with new kernel.
map_flags are optional

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/bpf_helpers.h   |  1 +
 samples/bpf/bpf_load.c  |  3 ++-
 samples/bpf/fds_example.c   |  2 +-
 samples/bpf/libbpf.c|  5 +++--
 samples/bpf/libbpf.h|  2 +-
 samples/bpf/sock_example.c  |  2 +-
 samples/bpf/test_maps.c | 19 ---
 samples/bpf/test_verifier.c |  4 ++--
 8 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 811bcca0f29d..9363500131a7 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -61,6 +61,7 @@ struct bpf_map_def {
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
+   unsigned int map_flags;
 };
 
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int 
flags) =
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index d16864293c00..58f86bd11b3d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -157,7 +157,8 @@ static int load_maps(struct bpf_map_def *maps, int len)
map_fd[i] = bpf_create_map(maps[i].type,
   maps[i].key_size,
   maps[i].value_size,
-  maps[i].max_entries);
+  maps[i].max_entries,
+  maps[i].map_flags);
if (map_fd[i] < 0) {
printf("failed to create a map: %d %s\n",
   errno, strerror(errno));
diff --git a/samples/bpf/fds_example.c b/samples/bpf/fds_example.c
index e2fd16c3d0f0..625e797be6ef 100644
--- a/samples/bpf/fds_example.c
+++ b/samples/bpf/fds_example.c
@@ -44,7 +44,7 @@ static void usage(void)
 static int bpf_map_create(void)
 {
return bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(uint32_t),
- sizeof(uint32_t), 1024);
+ sizeof(uint32_t), 1024, 0);
 }
 
 static int bpf_prog_create(const char *object)
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 65a8d48d2799..9969e35550c3 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -19,13 +19,14 @@ static __u64 ptr_to_u64(void *ptr)
 }
 
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-  int max_entries)
+  int max_entries, int map_flags)
 {
union bpf_attr attr = {
.map_type = map_type,
.key_size = key_size,
.value_size = value_size,
-   .max_entries = max_entries
+   .max_entries = max_entries,
+   .map_flags = map_flags,
};
 
return syscall(__NR_bpf, BPF_MAP_CREATE, , sizeof(attr));
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 014aacf916e4..364582b77888 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -5,7 +5,7 @@
 struct bpf_insn;
 
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-  int max_entries);
+  int max_entries, int map_flags);
 int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
 int bpf_lookup_elem(int fd, void *key, void *value);
 int bpf_delete_elem(int fd, void *key);
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index a0ce251c5390..28b60baa9fa8 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c
@@ -34,7 +34,7 @@ static int test_sock(void)
long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;
 
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
-   256);
+   256, 0);
if (map_fd < 0) {
printf("failed to create map '%s'\n", strerror(errno));
goto cleanup;
diff --git a/samples/bpf/test_maps.c b/samples/bpf/test_maps.c
index ad466ed33093..7bd9edd02d9b 100644
--- a/samples/bpf/test_maps.c
+++ b/samples/bpf/test_maps.c
@@ -2,6 +2,7 @@
  * Testsuite for eBPF maps
  *
  * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
+ * Copyright (c) 2016 Facebook
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of version 2 of the GNU General Public
@@ -17,13 +18,16 @@
 #include 
 #include "libbpf.h"
 
+static int map_flags;
+
 /* sanity tests for map API */
 static void test_hashmap_sanity(int i, void *data)
 {
long long key, next_key, value;
int map_fd;
 
-   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 
2);
+   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
+   2, map_flags);
if (map_fd < 0) {
printf("failed to create hashmap '%s'\n", strerror(errno));
  

[PATCH v2 net-next 06/12] samples/bpf: make map creation more verbose

2016-03-07 Thread Alexei Starovoitov
map creation is typically the first one to fail when rlimits are
too low, not enough memory, etc
Make this failure scenario more verbose

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/bpf_load.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index da86a8e0a95a..816bca5760a0 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -158,8 +158,11 @@ static int load_maps(struct bpf_map_def *maps, int len)
   maps[i].key_size,
   maps[i].value_size,
   maps[i].max_entries);
-   if (map_fd[i] < 0)
+   if (map_fd[i] < 0) {
+   printf("failed to create a map: %d %s\n",
+  errno, strerror(errno));
return 1;
+   }
 
if (maps[i].type == BPF_MAP_TYPE_PROG_ARRAY)
prog_array_fd = map_fd[i];
-- 
2.8.0.rc1



[PATCH v2 net-next 07/12] samples/bpf: move ksym_search() into library

2016-03-07 Thread Alexei Starovoitov
move ksym search from offwaketime into library to be reused
in other tests

Signed-off-by: Alexei Starovoitov 
---
 samples/bpf/bpf_load.c | 62 ++
 samples/bpf/bpf_load.h |  6 
 samples/bpf/offwaketime_user.c | 67 +-
 3 files changed, 69 insertions(+), 66 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 816bca5760a0..d16864293c00 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -346,3 +346,65 @@ void read_trace_pipe(void)
}
}
 }
+
+#define MAX_SYMS 30
+static struct ksym syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+   return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+int load_kallsyms(void)
+{
+   FILE *f = fopen("/proc/kallsyms", "r");
+   char func[256], buf[256];
+   char symbol;
+   void *addr;
+   int i = 0;
+
+   if (!f)
+   return -ENOENT;
+
+   while (!feof(f)) {
+   if (!fgets(buf, sizeof(buf), f))
+   break;
+   if (sscanf(buf, "%p %c %s", , , func) != 3)
+   break;
+   if (!addr)
+   continue;
+   syms[i].addr = (long) addr;
+   syms[i].name = strdup(func);
+   i++;
+   }
+   sym_cnt = i;
+   qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+   return 0;
+}
+
+struct ksym *ksym_search(long key)
+{
+   int start = 0, end = sym_cnt;
+   int result;
+
+   while (start < end) {
+   size_t mid = start + (end - start) / 2;
+
+   result = key - syms[mid].addr;
+   if (result < 0)
+   end = mid;
+   else if (result > 0)
+   start = mid + 1;
+   else
+   return [mid];
+   }
+
+   if (start >= 1 && syms[start - 1].addr < key &&
+   key < syms[start].addr)
+   /* valid ksym */
+   return [start - 1];
+
+   /* out of range. return _stext */
+   return [0];
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index cbd7c2b532b9..dfa57fe65c8e 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -23,5 +23,11 @@ extern int event_fd[MAX_PROGS];
 int load_bpf_file(char *path);
 
 void read_trace_pipe(void);
+struct ksym {
+   long addr;
+   char *name;
+};
 
+int load_kallsyms(void);
+struct ksym *ksym_search(long key);
 #endif
diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c
index 17cf3024e22c..6f002a9c24fa 100644
--- a/samples/bpf/offwaketime_user.c
+++ b/samples/bpf/offwaketime_user.c
@@ -18,80 +18,15 @@
 #include "libbpf.h"
 #include "bpf_load.h"
 
-#define MAX_SYMS 30
 #define PRINT_RAW_ADDR 0
 
-static struct ksym {
-   long addr;
-   char *name;
-} syms[MAX_SYMS];
-static int sym_cnt;
-
-static int ksym_cmp(const void *p1, const void *p2)
-{
-   return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
-}
-
-static int load_kallsyms(void)
-{
-   FILE *f = fopen("/proc/kallsyms", "r");
-   char func[256], buf[256];
-   char symbol;
-   void *addr;
-   int i = 0;
-
-   if (!f)
-   return -ENOENT;
-
-   while (!feof(f)) {
-   if (!fgets(buf, sizeof(buf), f))
-   break;
-   if (sscanf(buf, "%p %c %s", , , func) != 3)
-   break;
-   if (!addr)
-   continue;
-   syms[i].addr = (long) addr;
-   syms[i].name = strdup(func);
-   i++;
-   }
-   sym_cnt = i;
-   qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
-   return 0;
-}
-
-static void *search(long key)
-{
-   int start = 0, end = sym_cnt;
-   int result;
-
-   while (start < end) {
-   size_t mid = start + (end - start) / 2;
-
-   result = key - syms[mid].addr;
-   if (result < 0)
-   end = mid;
-   else if (result > 0)
-   start = mid + 1;
-   else
-   return [mid];
-   }
-
-   if (start >= 1 && syms[start - 1].addr < key &&
-   key < syms[start].addr)
-   /* valid ksym */
-   return [start - 1];
-
-   /* out of range. return _stext */
-   return [0];
-}
-
 static void print_ksym(__u64 addr)
 {
struct ksym *sym;
 
if (!addr)
return;
-   sym = search(addr);
+   sym = ksym_search(addr);
if (PRINT_RAW_ADDR)
printf("%s/%llx;", sym->name, addr);
else
-- 
2.8.0.rc1



[PATCH v2 net-next 0/12] bpf: map pre-alloc

2016-03-07 Thread Alexei Starovoitov
v1->v2:
. fix few issues spotted by Daniel
. converted stackmap into pre-allocation as well
. added a workaround for lockdep false positive
. added pcpu_freelist_populate to be used by hashmap and stackmap

this path set switches bpf hash map to use pre-allocation by default
and introduces BPF_F_NO_PREALLOC flag to keep old behavior for cases
where full map pre-allocation is too memory expensive.

Some time back Daniel Wagner reported crashes when bpf hash map is
used to compute time intervals between preempt_disable->preempt_enable
and recently Tom Zanussi reported a dead lock in iovisor/bcc/funccount
tool if it's used to count the number of invocations of kernel
'*spin*' functions. Both problems are due to the recursive use of
slub and can only be solved by pre-allocating all map elements.

A lot of different solutions were considered. Many implemented,
but at the end pre-allocation seems to be the only feasible answer.
As far as pre-allocation goes it also was implemented 4 different ways:
- simple free-list with single lock
- percpu_ida with optimizations
- blk-mq-tag variant customized for bpf use case
- percpu_freelist
For bpf style of alloc/free patterns percpu_freelist is the best
and implemented in this patch set.
Detailed performance numbers in patch 3.
Patch 2 introduces percpu_freelist
Patch 1 fixes simple deadlocks due to missing recursion checks
Patch 5: converts stackmap to pre-allocation
Patches 6-9: prepare test infra
Patch 10: stress test for hash map infra. It attaches to spin_lock
functions and bpf_map_update/delete are called from different contexts
Patch 11: stress for bpf_get_stackid
Patch 12: map performance test

Reported-by: Daniel Wagner 
Reported-by: Tom Zanussi 

Alexei Starovoitov (12):
  bpf: prevent kprobe+bpf deadlocks
  bpf: introduce percpu_freelist
  bpf: pre-allocate hash map elements
  bpf: check for reserved flag bits in array and stack maps
  bpf: convert stackmap to pre-allocation
  samples/bpf: make map creation more verbose
  samples/bpf: move ksym_search() into library
  samples/bpf: add map_flags to bpf loader
  samples/bpf: test both pre-alloc and normal maps
  samples/bpf: add bpf map stress test
  samples/bpf: stress test bpf_get_stackid
  samples/bpf: add map performance test

 include/linux/bpf.h  |   6 +
 include/uapi/linux/bpf.h |   3 +
 kernel/bpf/Makefile  |   2 +-
 kernel/bpf/arraymap.c|   2 +-
 kernel/bpf/hashtab.c | 240 +++
 kernel/bpf/percpu_freelist.c | 100 
 kernel/bpf/percpu_freelist.h |  31 +
 kernel/bpf/stackmap.c|  89 ---
 kernel/bpf/syscall.c |  30 -
 kernel/trace/bpf_trace.c |   2 -
 samples/bpf/Makefile |   8 ++
 samples/bpf/bpf_helpers.h|   1 +
 samples/bpf/bpf_load.c   |  70 +++-
 samples/bpf/bpf_load.h   |   6 +
 samples/bpf/fds_example.c|   2 +-
 samples/bpf/libbpf.c |   5 +-
 samples/bpf/libbpf.h |   2 +-
 samples/bpf/map_perf_test_kern.c | 100 
 samples/bpf/map_perf_test_user.c | 155 +
 samples/bpf/offwaketime_user.c   |  67 +--
 samples/bpf/sock_example.c   |   2 +-
 samples/bpf/spintest_kern.c  |  68 +++
 samples/bpf/spintest_user.c  |  50 
 samples/bpf/test_maps.c  |  29 +++--
 samples/bpf/test_verifier.c  |   4 +-
 25 files changed, 895 insertions(+), 179 deletions(-)
 create mode 100644 kernel/bpf/percpu_freelist.c
 create mode 100644 kernel/bpf/percpu_freelist.h
 create mode 100644 samples/bpf/map_perf_test_kern.c
 create mode 100644 samples/bpf/map_perf_test_user.c
 create mode 100644 samples/bpf/spintest_kern.c
 create mode 100644 samples/bpf/spintest_user.c

-- 
2.8.0.rc1



Re: linux-next: manual merge of the crypto tree with the net-next tree

2016-03-07 Thread Stephen Rothwell
Hi David,

On Mon, 07 Mar 2016 11:08:25 + David Howells  wrote:
>
> Stephen Rothwell  wrote:
> 
> > Today's linux-next merge of the crypto tree got a conflict in:
> > 
> >   net/rxrpc/rxkad.c
> > 
> > between commit:
> > 
> >   0d12f8a4027d ("rxrpc: Keep the skb private record of the Rx header in 
> > host byte order")
> > 
> > from the net-next tree and commit:
> > 
> >   1afe593b4239 ("rxrpc: Use skcipher")
> > 
> > from the crypto tree.  
> 
> What's the best way to deal with this?  Should I take Herbert's
> 
>   [PATCH 18/26] rxrpc: Use skcipher
> 
> patch into my rxrpc tree also and pass it on to Dave?

Linus can deal with it when he merges the latter of the crypto or
net-next trees.  It might be worth a mention in the respective pull
requests.

-- 
Cheers,
Stephen Rothwell


[PATCH] include/net/inet_connection_sock.h: Use pr_devel() instead of pr_debug()

2016-03-07 Thread Nick Wang
When DYNAMIC_DEBUG enabled, pr_debug() depends on KBUILD_MODNAME which
also depends on the modules name in Makefile.
Refer to the information in "scripts/Makefile.lib":

 # $(modname_flags) #defines KBUILD_MODNAME as the name of the module it will
 # end up in (or would, if it gets compiled in)
 # Note: Files that end up in two or more modules are compiled without the
 #   KBUILD_MODNAME definition. The reason is that any made-up name would
 #   differ in different configs.

File "inet_connection_sock.h" is a common share header that not can 
be use for one module, so use pr_devel instead of pr_debug is OK.

In file included from include/linux/printk.h:277:0,
 from include/linux/kernel.h:13,
 from include/linux/list.h:8,
 from include/linux/kobject.h:20,
 from include/linux/device.h:17,
 from include/rdma/ib_verbs.h:43,
 from 
/usr/src/drbd-9.0/drbd/drbd-kernel-compat/tests/have_ib_cq_init_attr.c:1:
include/net/inet_connection_sock.h: In function ‘inet_csk_clear_xmit_timer’:
include/linux/dynamic_debug.h:66:14: error: ‘KBUILD_MODNAME’ undeclared (first 
use in this function)
   .modname = KBUILD_MODNAME,   \
  ^
include/linux/dynamic_debug.h:76:2: note: in expansion of macro 
‘DEFINE_DYNAMIC_DEBUG_METADATA’
  DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt);  \
  ^
include/linux/printk.h:283:2: note: in expansion of macro ‘dynamic_pr_debug’
  dynamic_pr_debug(fmt, ##__VA_ARGS__)
  ^
include/net/inet_connection_sock.h:213:3: note: in expansion of macro ‘pr_debug’
   pr_debug("%s", inet_csk_timer_bug_msg);
   ^

Signed-off-by: Nick Wang 
---
 include/net/inet_connection_sock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 49dcad4..b59be52 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -210,7 +210,7 @@ static inline void inet_csk_clear_xmit_timer(struct sock 
*sk, const int what)
}
 #ifdef INET_CSK_DEBUG
else {
-   pr_debug("%s", inet_csk_timer_bug_msg);
+   pr_devel("%s", inet_csk_timer_bug_msg);
}
 #endif
 }
@@ -226,7 +226,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock 
*sk, const int what,
 
if (when > max_when) {
 #ifdef INET_CSK_DEBUG
-   pr_debug("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",
+   pr_devel("reset_xmit_timer: sk=%p %d when=0x%lx, caller=%p\n",
 sk, what, when, current_text_addr());
 #endif
when = max_when;
@@ -244,7 +244,7 @@ static inline void inet_csk_reset_xmit_timer(struct sock 
*sk, const int what,
}
 #ifdef INET_CSK_DEBUG
else {
-   pr_debug("%s", inet_csk_timer_bug_msg);
+   pr_devel("%s", inet_csk_timer_bug_msg);
}
 #endif
 }
-- 
1.8.5.6



Re: [PATCH next 3/3] net: Use l3_dev instead of skb->dev for L3 processing

2016-03-07 Thread Cong Wang
On Fri, Mar 4, 2016 at 2:12 PM, Mahesh Bandewar  wrote:
>
> Unfortunately we don't have a way to switch to ns after L3 processing.

I am totally aware of this, this is exactly why I said this might not be easy.

The question is how hard it is to implement one? My gut feeling is we only
need to change some data in skb, something similar to skb_scrub_packet().
But I never even try.

> So I would
> argue it the other-way around. The way it is now; breaks the _isolation_ 
> model.
> If the default-ns is responsible for whole L3 (in this situation) and
> it does pretty well
> on egress but there is no way to do that in ingress path. IPtables is
> not the only thing,
> how about routing, how about IPsec? None of this will function well.
> So we need to
> have a generic solution to solve all these problems.

I don't understand why you question me this, it is you who only cares
about iptables from your cover letter for this patchset, not me.

The more subsystems involves, the more struct net pointers you
potentially need to touch, the less likely you can make it correct
by just switching skb->dev.


[PATCH net-next 4/4] cxgb4vf: Set number of queues in pci probe only

2016-03-07 Thread Hariprasad Shenai
Signed-off-by: Hariprasad Shenai 
---
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 5a3b883..1cc8a7a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -790,10 +790,6 @@ static int cxgb4vf_open(struct net_device *dev)
/*
 * Note that this interface is up and start everything up ...
 */
-   netif_set_real_num_tx_queues(dev, pi->nqsets);
-   err = netif_set_real_num_rx_queues(dev, pi->nqsets);
-   if (err)
-   goto err_unwind;
err = link_start(dev);
if (err)
goto err_unwind;
@@ -2831,10 +2827,14 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 * must register at least one net device.
 */
for_each_port(adapter, pidx) {
+   struct port_info *pi = netdev_priv(adapter->port[pidx]);
netdev = adapter->port[pidx];
if (netdev == NULL)
continue;
 
+   netif_set_real_num_tx_queues(netdev, pi->nqsets);
+   netif_set_real_num_rx_queues(netdev, pi->nqsets);
+
err = register_netdev(netdev);
if (err) {
dev_warn(>dev, "cannot register net device %s,"
-- 
1.7.3



[PATCH net-next 0/4] cxgb4vf: Interrupt and queue configuration changes

2016-03-07 Thread Hariprasad Shenai
Hi,

This series fixes some issues and some changes in the queue and interrupt
configuration for cxgb4vf driver. We need to enable interrupts before we
register our network device, so that we don't loose link up interrupts.
Allocate rx queues based on interrupt type. Set number of tx/rx queues in
probe function only. Also adds check for some invalid configurations.

This patch series has been created against net-next tree and includes
patches on cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.

Thanks

Hariprasad Shenai (4):
  cxgb4vf: Enable interrupts before we register our network devices
  cxgb4vf: Configure queue based on resource and interrupt type
  cxgb4vf: Add a couple more checks for invalid provisioning
configurations
  cxgb4vf : Set number of queues in pci probe only

 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|  221 +++-
 1 files changed, 125 insertions(+), 96 deletions(-)

-- 
1.7.3



[PATCH net-next 2/4] cxgb4vf: Configure queue based on resource and interrupt type

2016-03-07 Thread Hariprasad Shenai
The Queue Set Configuration code was always reserving room for a
Forwarded interrupt Queue even in the cases where we weren't using it.
Figure out how many Ports and Queue Sets we can support. This depends on
knowing our Virtual Function Resources and may be called a second time
if we fall back from MSI-X to MSI Interrupt Mode. This change fixes that
problem.

Signed-off-by: Hariprasad Shenai 
---
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|  165 +++-
 1 files changed, 94 insertions(+), 71 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index fcafe34..17a3153 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2176,6 +2176,73 @@ static void cleanup_debugfs(struct adapter *adapter)
/* nothing to do */
 }
 
+/* Figure out how many Ports and Queue Sets we can support.  This depends on
+ * knowing our Virtual Function Resources and may be called a second time if
+ * we fall back from MSI-X to MSI Interrupt Mode.
+ */
+static void size_nports_qsets(struct adapter *adapter)
+{
+   struct vf_resources *vfres = >params.vfres;
+   unsigned int ethqsets, pmask_nports;
+
+   /* The number of "ports" which we support is equal to the number of
+* Virtual Interfaces with which we've been provisioned.
+*/
+   adapter->params.nports = vfres->nvi;
+   if (adapter->params.nports > MAX_NPORTS) {
+   dev_warn(adapter->pdev_dev, "only using %d of %d maximum"
+" allowed virtual interfaces\n", MAX_NPORTS,
+adapter->params.nports);
+   adapter->params.nports = MAX_NPORTS;
+   }
+
+   /* We may have been provisioned with more VIs than the number of
+* ports we're allowed to access (our Port Access Rights Mask).
+* This is obviously a configuration conflict but we don't want to
+* crash the kernel or anything silly just because of that.
+*/
+   pmask_nports = hweight32(adapter->params.vfres.pmask);
+   if (pmask_nports < adapter->params.nports) {
+   dev_warn(adapter->pdev_dev, "only using %d of %d provissioned"
+" virtual interfaces; limited by Port Access Rights"
+" mask %#x\n", pmask_nports, adapter->params.nports,
+adapter->params.vfres.pmask);
+   adapter->params.nports = pmask_nports;
+   }
+
+   /* We need to reserve an Ingress Queue for the Asynchronous Firmware
+* Event Queue.  And if we're using MSI Interrupts, we'll also need to
+* reserve an Ingress Queue for a Forwarded Interrupts.
+*
+* The rest of the FL/Intr-capable ingress queues will be matched up
+* one-for-one with Ethernet/Control egress queues in order to form
+* "Queue Sets" which will be aportioned between the "ports".  For
+* each Queue Set, we'll need the ability to allocate two Egress
+* Contexts -- one for the Ingress Queue Free List and one for the TX
+* Ethernet Queue.
+*
+* Note that even if we're currently configured to use MSI-X
+* Interrupts (module variable msi == MSI_MSIX) we may get downgraded
+* to MSI Interrupts if we can't get enough MSI-X Interrupts.  If that
+* happens we'll need to adjust things later.
+*/
+   ethqsets = vfres->niqflint - 1 - (msi == MSI_MSI);
+   if (vfres->nethctrl != ethqsets)
+   ethqsets = min(vfres->nethctrl, ethqsets);
+   if (vfres->neq < ethqsets*2)
+   ethqsets = vfres->neq/2;
+   if (ethqsets > MAX_ETH_QSETS)
+   ethqsets = MAX_ETH_QSETS;
+   adapter->sge.max_ethqsets = ethqsets;
+
+   if (adapter->sge.max_ethqsets < adapter->params.nports) {
+   dev_warn(adapter->pdev_dev, "only using %d of %d available"
+" virtual interfaces (too few Queue Sets)\n",
+adapter->sge.max_ethqsets, adapter->params.nports);
+   adapter->params.nports = adapter->sge.max_ethqsets;
+   }
+}
+
 /*
  * Perform early "adapter" initialization.  This is where we discover what
  * adapter parameters we're going to be using and initialize basic adapter
@@ -2183,10 +2250,8 @@ static void cleanup_debugfs(struct adapter *adapter)
  */
 static int adap_init0(struct adapter *adapter)
 {
-   struct vf_resources *vfres = >params.vfres;
struct sge_params *sge_params = >params.sge;
struct sge *s = >sge;
-   unsigned int ethqsets;
int err;
u32 param, val = 0;
 
@@ -2295,69 +2360,18 @@ static int adap_init0(struct adapter *adapter)
return err;
}
 
-   /*
-* The number of "ports" which we support is equal to the number of
-* Virtual Interfaces with 

[PATCH net-next 3/4] cxgb4vf: Add a couple more checks for invalid provisioning configurations

2016-03-07 Thread Hariprasad Shenai
Signed-off-by: Hariprasad Shenai 
---
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 17a3153..5a3b883 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2361,6 +2361,11 @@ static int adap_init0(struct adapter *adapter)
}
 
/* Check for various parameter sanity issues */
+   if (adapter->params.vfres.pmask == 0) {
+   dev_err(adapter->pdev_dev, "no port access configured\n"
+   "usable!\n");
+   return -EINVAL;
+   }
if (adapter->params.vfres.nvi == 0) {
dev_err(adapter->pdev_dev, "no virtual interfaces configured/"
"usable!\n");
-- 
1.7.3



[PATCH net-next 1/4] cxgb4vf: Enable interrupts before we register our network devices

2016-03-07 Thread Hariprasad Shenai
This avoids a race condition where a system that has network devices set up
to be automatically configured and we get the first Port Link Status
message from the firmware on the Asynchronous Firmware Event Queue before
we've enabled interrupts. If that happens, we end up losing the interrupt
and never realizing that the links has actually come up.

Signed-off-by: Hariprasad Shenai 
---
 .../net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c|   51 ++--
 1 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 91857b8..fcafe34 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2771,6 +2771,24 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
}
}
 
+   /* See what interrupts we'll be using.  If we've been configured to
+* use MSI-X interrupts, try to enable them but fall back to using
+* MSI interrupts if we can't enable MSI-X interrupts.  If we can't
+* get MSI interrupts we bail with the error.
+*/
+   if (msi == MSI_MSIX && enable_msix(adapter) == 0)
+   adapter->flags |= USING_MSIX;
+   else {
+   err = pci_enable_msi(pdev);
+   if (err) {
+   dev_err(>dev, "Unable to allocate %s interrupts;"
+   " err=%d\n",
+   msi == MSI_MSIX ? "MSI-X or MSI" : "MSI", err);
+   goto err_free_dev;
+   }
+   adapter->flags |= USING_MSI;
+   }
+
/*
 * The "card" is now ready to go.  If any errors occur during device
 * registration we do not fail the whole "card" but rather proceed
@@ -2793,7 +2811,7 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
}
if (adapter->registered_device_map == 0) {
dev_err(>dev, "could not register any net devices\n");
-   goto err_free_dev;
+   goto err_disable_interrupts;
}
 
/*
@@ -2811,25 +2829,6 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
}
 
/*
-* See what interrupts we'll be using.  If we've been configured to
-* use MSI-X interrupts, try to enable them but fall back to using
-* MSI interrupts if we can't enable MSI-X interrupts.  If we can't
-* get MSI interrupts we bail with the error.
-*/
-   if (msi == MSI_MSIX && enable_msix(adapter) == 0)
-   adapter->flags |= USING_MSIX;
-   else {
-   err = pci_enable_msi(pdev);
-   if (err) {
-   dev_err(>dev, "Unable to allocate %s interrupts;"
-   " err=%d\n",
-   msi == MSI_MSIX ? "MSI-X or MSI" : "MSI", err);
-   goto err_free_debugfs;
-   }
-   adapter->flags |= USING_MSI;
-   }
-
-   /*
 * Now that we know how many "ports" we have and what their types are,
 * and how many Queue Sets we can support, we can configure our queue
 * resources.
@@ -2856,11 +2855,13 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 * Error recovery and exit code.  Unwind state that's been created
 * so far and return the error.
 */
-
-err_free_debugfs:
-   if (!IS_ERR_OR_NULL(adapter->debugfs_root)) {
-   cleanup_debugfs(adapter);
-   debugfs_remove_recursive(adapter->debugfs_root);
+err_disable_interrupts:
+   if (adapter->flags & USING_MSIX) {
+   pci_disable_msix(adapter->pdev);
+   adapter->flags &= ~USING_MSIX;
+   } else if (adapter->flags & USING_MSI) {
+   pci_disable_msi(adapter->pdev);
+   adapter->flags &= ~USING_MSI;
}
 
 err_free_dev:
-- 
1.7.3



Re: 4.1.12 kernel crash in rtnetlink_put_metrics

2016-03-07 Thread subashab
> Hmm, if it was 4.1.X like in original reporter case, I might have thought
> something like commit 0a1f59620068 ("ipv6: Initialize rt6_info properly
> in ip6_blackhole_route()") ... any chance on reproducing this on a latest
> kernel?
>

Unfortunately, I haven't encountered a similar crash on newer kernels as of now.



Re: [PATCH net] sctp: fix copying more bytes than expected in sctp_add_bind_addr

2016-03-07 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Mon, 7 Mar 2016 20:27:11 -0300

> Hi,
> 
> Em 07-03-2016 20:17, kbuild test robot escreveu:
>> Hi Marcelo,
>>
>> [auto build test WARNING on net/master]
>>
>> url:
>> https://github.com/0day-ci/linux/commits/Marcelo-Ricardo-Leitner/sctp-fix-copying-more-bytes-than-expected-in-sctp_add_bind_addr/20160308-052009
>>
>>
>> coccinelle warnings: (new ones prefixed by >>)
>>
 net/sctp/bind_addr.c:458:42-48: ERROR: application of sizeof to
 pointer
>>
>> Please review and possibly fold the followup patch.
> 
> Oops, nice catch, thanks.
> 
> I can fold it if Dave prefers, no problem. I'll wait for a
> confirmation.

Please respin your patch with this folded into it, thanks.


Re: [PATCH 01/11] rxrpc: Add a common object cache

2016-03-07 Thread David Miller
From: David Howells 
Date: Mon, 07 Mar 2016 22:45:14 +

> David Miller  wrote:
> 
>> I know you put a lot of time and effort into this, but I want to strongly
>> recommend against a garbage collected hash table for anything whatsoever.
>> 
>> Especially if the given objects are in some way created/destroyed/etc. by
>> operations triggerable remotely.
>> 
>> This can be DoS'd quite trivially, and that's why we have removed the ipv4
>> routing cache which did the same.
> 
> Hmmm...  You have a point.  What would you suggest instead?  At least with the
> common object cache code I have, I might be able to just change that.

Objects that are used for correct operation have no easily recyclable
property, you must hold onto them.  There has to be a set of resources
held and consumed at both endpoints for it to work properly ("I can't
DoS you without DoS'ing myself").

Where reclaimable tables work is for stuff that is near zero cost to
reconstitute.  A good example is the TCP metrics table.  When a TCP
metrics entry is reclaimed, it's not like we have to renegotiate a
security context when we try to talk to that end-host again.

If the concept of these open-ended objects is a fundamental aspect of
the protocol that's a serious shortcoming of RXRPC.


[PATCH net 0/4] net: hns: bugs fix about hns driver

2016-03-07 Thread Lisheng
From: Qianqian Xie 

There is some inappropriate application of values in hns driver,
such as cycle index, returned value, format string, useless citation.
This patch set will solve it.


Qianqian Xie (4):
  net: hns: fix a bug for cycle index
  net: hns: bug fix for format string
  net: hns: bug fix for return values
  net: hns: remove useless head=ring->next_to_clean

 drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c |  3 ++-
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 12 ++--
 drivers/net/ethernet/hisilicon/hns/hns_enet.c  |  4 +---
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c   |  4 ++--
 4 files changed, 11 insertions(+), 12 deletions(-)

-- 
1.9.1



[PATCH net 1/4] net: hns: fix a bug for cycle index

2016-03-07 Thread Lisheng
From: Qianqian Xie 

The cycle index should be varied while the variable j is a fixed value.
The patch will fix this bug.

Signed-off-by: Qianqian Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index 7fe1c1c..4583597 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -2182,17 +2182,17 @@ void hns_dsaf_get_regs(struct dsaf_device *ddev, u32 
port, void *data)
/* dsaf onode registers */
for (i = 0; i < DSAF_XOD_NUM; i++) {
p[311 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_TSA_TC0_TC3_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_TSA_TC0_TC3_CFG_0_REG + i * 0x90);
p[319 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_TSA_TC4_TC7_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_TSA_TC4_TC7_CFG_0_REG + i * 0x90);
p[327 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_BW_TC0_TC3_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_BW_TC0_TC3_CFG_0_REG + i * 0x90);
p[335 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_BW_TC4_TC7_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_BW_TC4_TC7_CFG_0_REG + i * 0x90);
p[343 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_BW_OFFSET_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_BW_OFFSET_CFG_0_REG + i * 0x90);
p[351 + i] = dsaf_read_dev(ddev,
-   DSAF_XOD_ETS_TOKEN_CFG_0_REG + j * 0x90);
+   DSAF_XOD_ETS_TOKEN_CFG_0_REG + i * 0x90);
}
 
p[359] = dsaf_read_dev(ddev, DSAF_XOD_PFS_CFG_0_0_REG + port * 0x90);
-- 
1.9.1



[PATCH net 2/4] net: hns: bug fix for format string

2016-03-07 Thread Lisheng
From: Qianqian Xie 

There is not a format string in function snprinf().
This patch will fix it.

Signed-off-by: Qianqian Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
index b8517b0..d0591d9 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c
@@ -641,7 +641,8 @@ static void hns_gmac_get_strings(u32 stringset, u8 *data)
return;
 
for (i = 0; i < ARRAY_SIZE(g_gmac_stats_string); i++) {
-   snprintf(buff, ETH_GSTRING_LEN, g_gmac_stats_string[i].desc);
+   snprintf(buff, ETH_GSTRING_LEN, "%s",
+g_gmac_stats_string[i].desc);
buff = buff + ETH_GSTRING_LEN;
}
 }
-- 
1.9.1



[PATCH net 4/4] net: hns: remove useless head=ring->next_to_clean

2016-03-07 Thread Lisheng
From: Qianqian Xie 

The variable head in hns_nic_tx_fini_pro has read a value,
but the value is not used. The patch will solve it.

Signed-off-by: Qianqian Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 3f77ff7..7b4ec2f 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -901,10 +901,8 @@ static int hns_nic_tx_poll_one(struct hns_nic_ring_data 
*ring_data,
 static void hns_nic_tx_fini_pro(struct hns_nic_ring_data *ring_data)
 {
struct hnae_ring *ring = ring_data->ring;
-   int head = ring->next_to_clean;
 
-   /* for hardware bug fixed */
-   head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
+   int head = readl_relaxed(ring->io_base + RCB_REG_HEAD);
 
if (head != ring->next_to_clean) {
ring_data->ring->q->handle->dev->ops->toggle_ring_irq(
-- 
1.9.1



[PATCH net 3/4] net: hns: bug fix for return values

2016-03-07 Thread Lisheng
From: Qianqian Xie 

The return values in the first two functions mdiobus_write()
are ignored. The patch will fix it.

Signed-off-by: Qianqian Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c 
b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
index 3df2284..cc5c545 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_ethtool.c
@@ -1000,8 +1000,8 @@ int hns_phy_led_set(struct net_device *netdev, int value)
struct phy_device *phy_dev = priv->phy;
 
retval = phy_write(phy_dev, HNS_PHY_PAGE_REG, HNS_PHY_PAGE_LED);
-   retval = phy_write(phy_dev, HNS_LED_FC_REG, value);
-   retval = phy_write(phy_dev, HNS_PHY_PAGE_REG, HNS_PHY_PAGE_COPPER);
+   retval |= phy_write(phy_dev, HNS_LED_FC_REG, value);
+   retval |= phy_write(phy_dev, HNS_PHY_PAGE_REG, HNS_PHY_PAGE_COPPER);
if (retval) {
netdev_err(netdev, "mdiobus_write fail !\n");
return retval;
-- 
1.9.1



[PATCH net] net: hns: set xge statistic reg as read only

2016-03-07 Thread Lisheng
From: Qianqian Xie 

As the user manual describs, XGE_DFX_CTRL_CFG.xge_dfx_ctrl_cfg should be
configed as zero if we want xge statistic reg to be read only. But the
older hisilicon arm64 process gets the other meanings. So it seems that
we need to identify the process and then config it rightly.

Signed-off-by: Qianqian Xie 
---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
index 9439f04..7fe1c1c 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_main.c
@@ -714,8 +714,9 @@ static void hns_dsaf_tbl_stat_en(struct dsaf_device 
*dsaf_dev)
  */
 static void hns_dsaf_rocee_bp_en(struct dsaf_device *dsaf_dev)
 {
-   dsaf_set_dev_bit(dsaf_dev, DSAF_XGE_CTRL_SIG_CFG_0_REG,
-DSAF_FC_XGE_TX_PAUSE_S, 1);
+   if (AE_IS_VER1(dsaf_dev->dsaf_ver))
+   dsaf_set_dev_bit(dsaf_dev, DSAF_XGE_CTRL_SIG_CFG_0_REG,
+DSAF_FC_XGE_TX_PAUSE_S, 1);
 }
 
 /* set msk for dsaf exception irq*/
-- 
1.9.1



[ethtool PATCH v4 08/11] kernel-copy.h: import kernel.h from net-next and use it

2016-03-07 Thread David Decotigny
From: David Decotigny 

This is required for recent version of ethtool.h .

This covers kernel.h up to:

  commit b5d3755a22e0cc4c369c0985aef0c52c2477c1e7
  Author: Nicolas Dichtel 
  Date:   Fri Mar 4 11:52:16 2016 +0100

  uapi: define DIV_ROUND_UP for userland


Signed-off-by: David Decotigny 
---
 ethtool.c |  3 ++-
 internal.h|  4 ++--
 kernel-copy.h | 14 ++
 3 files changed, 18 insertions(+), 3 deletions(-)
 create mode 100644 kernel-copy.h

diff --git a/ethtool.c b/ethtool.c
index d349bee..47f0259 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -227,7 +227,8 @@ struct feature_defs {
struct feature_def def[0];
 };
 
-#define FEATURE_BITS_TO_BLOCKS(n_bits) DIV_ROUND_UP(n_bits, 32U)
+#define FEATURE_BITS_TO_BLOCKS(n_bits) \
+   __KERNEL_DIV_ROUND_UP(n_bits, 32U)
 #define FEATURE_WORD(blocks, index, field) ((blocks)[(index) / 32U].field)
 #define FEATURE_FIELD_FLAG(index)  (1U << (index) % 32U)
 #define FEATURE_BIT_SET(blocks, index, field)  \
diff --git a/internal.h b/internal.h
index e38d305..1c64306 100644
--- a/internal.h
+++ b/internal.h
@@ -35,6 +35,7 @@ typedef uint16_t u16;
 typedef uint8_t u8;
 typedef int32_t s32;
 
+#include "kernel-copy.h"
 #include "ethtool-copy.h"
 #include "net_tstamp-copy.h"
 
@@ -71,8 +72,7 @@ static inline u64 cpu_to_be64(u64 value)
 
 #define BITS_PER_BYTE  8
 #define BITS_PER_LONG  (BITS_PER_BYTE * sizeof(long))
-#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
-#define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_LONG)
+#define BITS_TO_LONGS(nr)  __KERNEL_DIV_ROUND_UP(nr, BITS_PER_LONG)
 
 static inline void set_bit(unsigned int nr, unsigned long *addr)
 {
diff --git a/kernel-copy.h b/kernel-copy.h
new file mode 100644
index 000..527549f
--- /dev/null
+++ b/kernel-copy.h
@@ -0,0 +1,14 @@
+#ifndef _LINUX_KERNEL_H
+#define _LINUX_KERNEL_H
+
+#include 
+
+/*
+ * 'kernel.h' contains some often-used function prototypes etc
+ */
+#define __ALIGN_KERNEL(x, a)   __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 
1)
+#define __ALIGN_KERNEL_MASK(x, mask)   (((x) + (mask)) & ~(mask))
+
+#define __KERNEL_DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#endif /* _LINUX_KERNEL_H */
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 10/11] ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls

2016-03-07 Thread David Decotigny
From: David Decotigny 

More info with kernel SHA1: 8d3f2806f8fbd9b22 "Merge branch
'ethtool-ksettings'".


Signed-off-by: David Decotigny 
---
 ethtool.c  | 682 +++--
 internal.h |  67 ++
 test-cmdline.c |  12 +
 3 files changed, 602 insertions(+), 159 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 47f0259..761252f 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -47,42 +47,6 @@
 #define MAX_ADDR_LEN   32
 #endif
 
-#define ALL_ADVERTISED_MODES   \
-   (ADVERTISED_10baseT_Half |  \
-ADVERTISED_10baseT_Full |  \
-ADVERTISED_100baseT_Half | \
-ADVERTISED_100baseT_Full | \
-ADVERTISED_1000baseT_Half |\
-ADVERTISED_1000baseT_Full |\
-ADVERTISED_1000baseKX_Full|\
-ADVERTISED_2500baseX_Full |\
-ADVERTISED_1baseT_Full |   \
-ADVERTISED_1baseKX4_Full | \
-ADVERTISED_1baseKR_Full |  \
-ADVERTISED_1baseR_FEC |\
-ADVERTISED_2baseMLD2_Full |\
-ADVERTISED_2baseKR2_Full | \
-ADVERTISED_4baseKR4_Full | \
-ADVERTISED_4baseCR4_Full | \
-ADVERTISED_4baseSR4_Full | \
-ADVERTISED_4baseLR4_Full | \
-ADVERTISED_56000baseKR4_Full | \
-ADVERTISED_56000baseCR4_Full | \
-ADVERTISED_56000baseSR4_Full | \
-ADVERTISED_56000baseLR4_Full)
-
-#define ALL_ADVERTISED_FLAGS   \
-   (ADVERTISED_Autoneg |   \
-ADVERTISED_TP |\
-ADVERTISED_AUI |   \
-ADVERTISED_MII |   \
-ADVERTISED_FIBRE | \
-ADVERTISED_BNC |   \
-ADVERTISED_Pause | \
-ADVERTISED_Asym_Pause |\
-ADVERTISED_Backplane | \
-ALL_ADVERTISED_MODES)
-
 #ifndef HAVE_NETIF_MSG
 enum {
NETIF_MSG_DRV   = 0x0001,
@@ -294,6 +258,45 @@ static void get_mac_addr(char *src, unsigned char *dest)
}
 }
 
+static int parse_hex_u32_bitmap(const char *s,
+   unsigned int nbits, u32 *result)
+{
+   const unsigned nwords = __KERNEL_DIV_ROUND_UP(nbits, 32);
+   size_t slen = strlen(s);
+   size_t i;
+
+   /* ignore optional '0x' prefix */
+   if ((slen > 2) && (
+   (0 == memcmp(s, "0x", 2)
+|| (0 == memcmp(s, "0X", 2) {
+   slen -= 2;
+   s += 2;
+   }
+
+   if (slen > 8*nwords)  /* up to 2 digits per byte */
+   return -1;
+
+   memset(result, 0, 4*nwords);
+   for (i = 0 ; i < slen ; ++i) {
+   const unsigned shift = (slen - 1 - i)*4;
+   u32 *dest = [shift / 32];
+   u32 nibble;
+
+   if ('a' <= s[i] && s[i] <= 'f')
+   nibble = 0xa + (s[i] - 'a');
+   else if ('A' <= s[i] && s[i] <= 'F')
+   nibble = 0xa + (s[i] - 'A');
+   else if ('0' <= s[i] && s[i] <= '9')
+   nibble = (s[i] - '0');
+   else
+   return -1;
+
+   *dest |= (nibble << (shift % 32));
+   }
+
+   return 0;
+}
+
 static void parse_generic_cmdline(struct cmd_context *ctx,
  int *changed,
  struct cmdline_info *info,
@@ -473,64 +476,157 @@ static int do_version(struct cmd_context *ctx)
return 0;
 }
 
-static void dump_link_caps(const char *prefix, const char *an_prefix, u32 mask,
-  int link_mode_only);
+/* link mode routines */
+
+static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_modes);
+static __ETHTOOL_DECLARE_LINK_MODE_MASK(all_advertised_flags);
 
-static void dump_supported(struct ethtool_cmd *ep)
+static void init_global_link_mode_masks()
 {
-   u32 mask = ep->supported;
+   static const enum ethtool_link_mode_bit_indices
+   all_advertised_modes_bits[] = {
+   ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+   ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_1000baseKX_Full_BIT,
+   ETHTOOL_LINK_MODE_2500baseX_Full_BIT,
+   ETHTOOL_LINK_MODE_1baseT_Full_BIT,
+   ETHTOOL_LINK_MODE_1baseKX4_Full_BIT,
+   ETHTOOL_LINK_MODE_1baseKR_Full_BIT,
+   

[ethtool PATCH v4 07/11] test-features.c: add braces around array initialization

2016-03-07 Thread David Decotigny
From: Maciej Żenczykowski 

This fixes:
  test-features.c:21:1: error: missing braces around initializer 
[-Werror=missing-braces]
   cmd_gssetinfo = { { ETHTOOL_GSSET_INFO, 0, 1ULL << ETH_SS_FEATURES }, 34 };
   ^


Signed-off-by: Maciej Żenczykowski 
Signed-off-by: David Decotigny 
---
 test-features.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test-features.c b/test-features.c
index d7bd994..6ebb364 100644
--- a/test-features.c
+++ b/test-features.c
@@ -18,7 +18,7 @@ static const struct {
struct ethtool_sset_info cmd;
u32 data[1];
 }
-cmd_gssetinfo = { { ETHTOOL_GSSET_INFO, 0, 1ULL << ETH_SS_FEATURES }, 34 };
+cmd_gssetinfo = { { ETHTOOL_GSSET_INFO, 0, 1ULL << ETH_SS_FEATURES }, { 34 } };
 
 static const struct ethtool_value
 cmd_grxcsum_off = { ETHTOOL_GRXCSUM, 0 },
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 06/11] test-common.c: fix test_realloc(NULL, ...)

2016-03-07 Thread David Decotigny
From: Maciej Żenczykowski 

This fixes:
  test-common.c: In function 'test_realloc':
  test-common.c:109:8: error: 'block' may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
block = realloc(block, sizeof(*block) + size);
  ^


Signed-off-by: Maciej Żenczykowski 
Signed-off-by: David Decotigny 
---
 test-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test-common.c b/test-common.c
index adc3cd4..cd63d1d 100644
--- a/test-common.c
+++ b/test-common.c
@@ -100,7 +100,7 @@ void test_free(void *ptr)
 
 void *test_realloc(void *ptr, size_t size)
 {
-   struct list_head *block;
+   struct list_head *block = NULL;
 
if (ptr) {
block = (struct list_head *)ptr - 1;
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 09/11] ethtool-copy.h: sync with net-next

2016-03-07 Thread David Decotigny
From: David Decotigny 

This cover changes up to:

  commit 14e2037902d65213842b4e40305ff54a64abbcb6
  Author: Nicolas Dichtel 
  Date:   Fri Mar 4 11:52:19 2016 +0100

  ethtool.h: define INT_MAX for userland


Signed-off-by: David Decotigny 
---
 ethtool-copy.h | 478 -
 1 file changed, 403 insertions(+), 75 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index d23ffc4..7c581ea 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -13,15 +13,19 @@
 #ifndef _LINUX_ETHTOOL_H
 #define _LINUX_ETHTOOL_H
 
+#include 
 #include 
 #include 
 
+#include  /* for INT_MAX */
+
 /* All structures exposed to userland should be defined such that they
  * have the same layout for 32-bit and 64-bit userland.
  */
 
 /**
- * struct ethtool_cmd - link control and status
+ * struct ethtool_cmd - DEPRECATED, link control and status
+ * This structure is DEPRECATED, please use struct ethtool_link_settings.
  * @cmd: Command number = %ETHTOOL_GSET or %ETHTOOL_SSET
  * @supported: Bitmask of %SUPPORTED_* flags for the link modes,
  * physical connectors and other link features for which the
@@ -31,7 +35,7 @@
  * physical connectors and other link features that are
  * advertised through autonegotiation or enabled for
  * auto-detection.
- * @speed: Low bits of the speed
+ * @speed: Low bits of the speed, 1Mb units, 0 to INT_MAX or SPEED_UNKNOWN
  * @duplex: Duplex mode; one of %DUPLEX_*
  * @port: Physical connector type; one of %PORT_*
  * @phy_address: MDIO address of PHY (transceiver); 0 or 255 if not
@@ -47,7 +51,7 @@
  * obsoleted by  ethtool_coalesce.  Read-only; deprecated.
  * @maxrxpkt: Historically used to report RX IRQ coalescing; now
  * obsoleted by  ethtool_coalesce.  Read-only; deprecated.
- * @speed_hi: High bits of the speed
+ * @speed_hi: High bits of the speed, 1Mb units, 0 to INT_MAX or SPEED_UNKNOWN
  * @eth_tp_mdix: Ethernet twisted-pair MDI(-X) status; one of
  * %ETH_TP_MDI_*.  If the status is unknown or not applicable, the
  * value will be %ETH_TP_MDI_INVALID.  Read-only.
@@ -215,6 +219,11 @@ enum tunable_id {
ETHTOOL_ID_UNSPEC,
ETHTOOL_RX_COPYBREAK,
ETHTOOL_TX_COPYBREAK,
+   /*
+* Add your fresh new tubale attribute above and remember to update
+* tunable_strings[] in net/core/ethtool.c
+*/
+   __ETHTOOL_TUNABLE_COUNT,
 };
 
 enum tunable_type_id {
@@ -537,6 +546,7 @@ struct ethtool_pauseparam {
  * now deprecated
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
+ * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -545,6 +555,8 @@ enum ethtool_stringset {
ETH_SS_NTUPLE_FILTERS,
ETH_SS_FEATURES,
ETH_SS_RSS_HASH_FUNCS,
+   ETH_SS_TUNABLES,
+   ETH_SS_PHY_STATS,
 };
 
 /**
@@ -740,6 +752,56 @@ struct ethtool_usrip4_spec {
__u8proto;
 };
 
+/**
+ * struct ethtool_tcpip6_spec - flow specification for TCP/IPv6 etc.
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @psrc: Source port
+ * @pdst: Destination port
+ * @tclass: Traffic Class
+ *
+ * This can be used to specify a TCP/IPv6, UDP/IPv6 or SCTP/IPv6 flow.
+ */
+struct ethtool_tcpip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be16  psrc;
+   __be16  pdst;
+   __u8tclass;
+};
+
+/**
+ * struct ethtool_ah_espip6_spec - flow specification for IPsec/IPv6
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @spi: Security parameters index
+ * @tclass: Traffic Class
+ *
+ * This can be used to specify an IPsec transport or tunnel over IPv6.
+ */
+struct ethtool_ah_espip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be32  spi;
+   __u8tclass;
+};
+
+/**
+ * struct ethtool_usrip6_spec - general flow specification for IPv6
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @l4_4_bytes: First 4 bytes of transport (layer 4) header
+ * @tclass: Traffic Class
+ * @l4_proto: Transport protocol number (nexthdr after any Extension Headers)
+ */
+struct ethtool_usrip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be32  l4_4_bytes;
+   __u8tclass;
+   __u8l4_proto;
+};
+
 union ethtool_flow_union {
struct ethtool_tcpip4_spec  tcp_ip4_spec;
struct ethtool_tcpip4_spec  udp_ip4_spec;
@@ -747,6 +809,12 @@ union ethtool_flow_union {
struct ethtool_ah_espip4_spec   ah_ip4_spec;
struct ethtool_ah_espip4_spec   esp_ip4_spec;
struct ethtool_usrip4_spec  usr_ip4_spec;
+   struct ethtool_tcpip6_spec  tcp_ip6_spec;
+   struct ethtool_tcpip6_spec  udp_ip6_spec;
+   struct ethtool_tcpip6_spec  sctp_ip6_spec;
+ 

[ethtool PATCH v4 03/11] ethtool.c: fix dump_regs heap corruption

2016-03-07 Thread David Decotigny
From: David Decotigny 

The 'regs' pointer is owned by do_gregs(), but updated internally inside
dump_regs() without propagating it back to do_gregs(): later free(regs)
in do_gregs() reclaims the wrong area. This commit moves the realloc()
inside do_gregs().


Signed-off-by: David Decotigny 
---
 ethtool.c | 46 +-
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 9f80d5f..7c2b5cb 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -994,7 +994,6 @@ void dump_hex(FILE *file, const u8 *data, int len, int 
offset)
 }
 
 static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
-const char *gregs_dump_file,
 struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 {
int i;
@@ -1004,25 +1003,6 @@ static int dump_regs(int gregs_dump_raw, int 
gregs_dump_hex,
return 0;
}
 
-   if (gregs_dump_file) {
-   FILE *f = fopen(gregs_dump_file, "r");
-   struct stat st;
-   size_t nread;
-
-   if (!f || fstat(fileno(f), ) < 0) {
-   fprintf(stderr, "Can't open '%s': %s\n",
-   gregs_dump_file, strerror(errno));
-   return -1;
-   }
-
-   regs = realloc(regs, sizeof(*regs) + st.st_size);
-   regs->len = st.st_size;
-   nread = fread(regs->data, regs->len, 1, f);
-   fclose(f);
-   if (nread != 1)
-   return -1;
-   }
-
if (!gregs_dump_hex)
for (i = 0; i < ARRAY_SIZE(driver_list); i++)
if (!strncmp(driver_list[i].name, info->driver,
@@ -2711,7 +2691,31 @@ static int do_gregs(struct cmd_context *ctx)
free(regs);
return 74;
}
-   if (dump_regs(gregs_dump_raw, gregs_dump_hex, gregs_dump_file,
+
+   if (!gregs_dump_raw && gregs_dump_file != NULL) {
+   /* overwrite reg values from file dump */
+   FILE *f = fopen(gregs_dump_file, "r");
+   struct stat st;
+   size_t nread;
+
+   if (!f || fstat(fileno(f), ) < 0) {
+   fprintf(stderr, "Can't open '%s': %s\n",
+   gregs_dump_file, strerror(errno));
+   free(regs);
+   return 75;
+   }
+
+   regs = realloc(regs, sizeof(*regs) + st.st_size);
+   regs->len = st.st_size;
+   nread = fread(regs->data, regs->len, 1, f);
+   fclose(f);
+   if (nread != 1) {
+   free(regs);
+   return 75;
+   }
+}
+
+   if (dump_regs(gregs_dump_raw, gregs_dump_hex,
  , regs) < 0) {
fprintf(stderr, "Cannot dump registers\n");
free(regs);
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 05/11] marvell.c: fix strict alias warnings

2016-03-07 Thread David Decotigny
From: Maciej Żenczykowski 

Addresses the following warnings:
  marvell.c:426:2: error: dereferencing type-punned pointer will break 
strict-aliasing rules [-Werror=strict-aliasing]
  marvell.c:427:2: error: dereferencing type-punned pointer will break 
strict-aliasing rules [-Werror=strict-aliasing]
  marvell.c:428:2: error: dereferencing type-punned pointer will break 
strict-aliasing rules [-Werror=strict-aliasing]

Note: code appears endian-dependent, not fixed by this commit.


Signed-off-by: Maciej Żenczykowski 
Signed-off-by: David Decotigny 
---
 marvell.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/marvell.c b/marvell.c
index e583c82..af21188 100644
--- a/marvell.c
+++ b/marvell.c
@@ -381,7 +381,8 @@ static void dump_prefetch(const char *name, const void *r)
 
 int sky2_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 {
-   const u32 *r = (const u32 *) regs->data;
+   const u16 *r16 = (const u16 *) regs->data;
+   const u32 *r32 = (const u32 *) regs->data;
int dual;
 
dump_pci(regs->data + 0x1c00);
@@ -390,15 +391,15 @@ int sky2_dump_regs(struct ethtool_drvinfo *info, struct 
ethtool_regs *regs)
 
printf("\nBus Management Unit\n");
printf("---\n");
-   printf("CSR Receive Queue 1  0x%08X\n", r[24]);
-   printf("CSR Sync Queue 1 0x%08X\n", r[26]);
-   printf("CSR Async Queue 10x%08X\n", r[27]);
+   printf("CSR Receive Queue 1  0x%08X\n", r32[24]);
+   printf("CSR Sync Queue 1 0x%08X\n", r32[26]);
+   printf("CSR Async Queue 10x%08X\n", r32[27]);
 
dual = (regs->data[0x11e] & 2) != 0;
if (dual) {
-   printf("CSR Receive Queue 2  0x%08X\n", r[25]);
-   printf("CSR Async Queue 20x%08X\n", r[29]);
-   printf("CSR Sync Queue 2 0x%08X\n", r[28]);
+   printf("CSR Receive Queue 2  0x%08X\n", r32[25]);
+   printf("CSR Async Queue 20x%08X\n", r32[29]);
+   printf("CSR Sync Queue 2 0x%08X\n", r32[28]);
}
 
dump_mac(regs->data);
@@ -423,9 +424,9 @@ int sky2_dump_regs(struct ethtool_drvinfo *info, struct 
ethtool_regs *regs)
dump_timer("TX status", regs->data + 0xec0);
dump_timer("ISR", regs->data + 0xed0);
 
-   printf("\nGMAC control 0x%04X\n", *(u32 *)(regs->data + 
0xf00));
-   printf("GPHY control 0x%04X\n", *(u32 *)(regs->data + 
0xf04));
-   printf("LINK control 0x%02hX\n", *(u16 *)(regs->data + 
0xf10));
+   printf("\nGMAC control 0x%04X\n", r32[0xf00 >> 2]);
+   printf("GPHY control 0x%04X\n", r32[0xf04 >> 2]);
+   printf("LINK control 0x%02hX\n", r16[0xf10 >> 1]);
 
dump_gmac("GMAC 1", regs->data + 0x2800);
dump_gmac_fifo("Rx GMAC 1", regs->data + 0xc40);
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 02/11] ethtool.c: don't ignore fread() return value

2016-03-07 Thread David Decotigny
From: David Decotigny 

This addresses:
  ethtool.c:1116:8: warning: ignoring return value of ‘fread’, declared with 
attribute warn_unused_result [-Wunused-result]


Signed-off-by: David Decotigny 
---
 ethtool.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/ethtool.c b/ethtool.c
index 92c40b8..9f80d5f 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1007,6 +1007,7 @@ static int dump_regs(int gregs_dump_raw, int 
gregs_dump_hex,
if (gregs_dump_file) {
FILE *f = fopen(gregs_dump_file, "r");
struct stat st;
+   size_t nread;
 
if (!f || fstat(fileno(f), ) < 0) {
fprintf(stderr, "Can't open '%s': %s\n",
@@ -1016,8 +1017,10 @@ static int dump_regs(int gregs_dump_raw, int 
gregs_dump_hex,
 
regs = realloc(regs, sizeof(*regs) + st.st_size);
regs->len = st.st_size;
-   fread(regs->data, regs->len, 1, f);
+   nread = fread(regs->data, regs->len, 1, f);
fclose(f);
+   if (nread != 1)
+   return -1;
}
 
if (!gregs_dump_hex)
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 04/11] ethtool.c: do_seeprom checks for params & stdin sanity

2016-03-07 Thread David Decotigny
From: David Decotigny 

Tested:
  On qemu e1000:
  $ dd if=/dev/zero bs=2 count=5 | /mnt/ethtool -E eth0 length 9
  too much data from stdin
  $ dd if=/dev/zero bs=2 count=5 | /mnt/ethtool -E eth0 length 11
  not enough data from stdin
  $ dd if=/dev/zero bs=2 count=5 | /mnt/ethtool -E eth0 length 10
  Cannot set EEPROM data: Bad address


Signed-off-by: David Decotigny 
---
 ethtool.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index 7c2b5cb..d349bee 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -2828,8 +2828,10 @@ static int do_seeprom(struct cmd_context *ctx)
if (seeprom_length == -1)
seeprom_length = drvinfo.eedump_len;
 
-   if (drvinfo.eedump_len < seeprom_offset + seeprom_length)
-   seeprom_length = drvinfo.eedump_len - seeprom_offset;
+   if (drvinfo.eedump_len < seeprom_offset + seeprom_length) {
+   fprintf(stderr, "offset & length out of bounds\n");
+   return 1;
+   }
 
eeprom = calloc(1, sizeof(*eeprom)+seeprom_length);
if (!eeprom) {
@@ -2844,8 +2846,18 @@ static int do_seeprom(struct cmd_context *ctx)
eeprom->data[0] = seeprom_value;
 
/* Multi-byte write: read input from stdin */
-   if (!seeprom_value_seen)
-   eeprom->len = fread(eeprom->data, 1, eeprom->len, stdin);
+   if (!seeprom_value_seen) {
+   if (1 != fread(eeprom->data, eeprom->len, 1, stdin)) {
+   fprintf(stderr, "not enough data from stdin\n");
+   free(eeprom);
+   return 75;
+   }
+   if ((fgetc(stdin) != EOF) || !feof(stdin)) {
+   fprintf(stderr, "too much data from stdin\n");
+   free(eeprom);
+   return 75;
+   }
+   }
 
err = send_ioctl(ctx, eeprom);
if (err < 0) {
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 11/11] ethtool.c: support absence of v4 sockets

2016-03-07 Thread David Decotigny
From: David Decotigny 


Signed-off-by: David Decotigny 
---
 ethtool.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ethtool.c b/ethtool.c
index 761252f..f9336e3 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4615,6 +4615,9 @@ opt_found:
/* Open control socket. */
ctx.fd = socket(AF_INET, SOCK_DGRAM, 0);
if (ctx.fd < 0) {
+   ctx.fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+   }
+   if (ctx.fd < 0) {
perror("Cannot get control socket");
return 70;
}
-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 00/11] add support for new ETHTOOL_xLINKSETTINGS ioctls

2016-03-07 Thread David Decotigny
From: David Decotigny 

This adds support for the new ETHTOOL_xLINKSETTINGS ioctls. This also
fixes a few compilation warnings as well as a heap corruption bug.

History:
  v4
review Ben Hutchings:
  using AF_UNIX instead of INET6 in the absence of v4 sockets
  use stdbool.h
  do_seeprom always fails when offset/length out of bounds
  sync to latest ethtool.h + kernel.h from net-next
  __SANE_USERSPACE_TYPES__ always defined
  cosmetic updates for var == const tests
  cosmetic updates for associativity in tests
  v3
TRUE/FALSE obvious-ification
  v2
added do_seeprom patch
added netdev@ as recipient
  v1
initial submission


# Patch Set Summary:

David Decotigny (7):
  ethtool.c: don't ignore fread() return value
  ethtool.c: fix dump_regs heap corruption
  ethtool.c: do_seeprom checks for params & stdin sanity
  kernel-copy.h: import kernel.h from net-next and use it
  ethtool-copy.h: sync with net-next
  ethtool.c: add support for ETHTOOL_xLINKSETTINGS ioctls
  ethtool.c: support absence of v4 sockets

Maciej Żenczykowski (4):
  internal.h: change to new sane kernel headers on 64-bit archs
  marvell.c: fix strict alias warnings
  test-common.c: fix test_realloc(NULL, ...)
  test-features.c: add braces around array initialization

 ethtool-copy.h  | 478 ++--
 ethtool.c   | 751 ++--
 internal.h  |  77 +-
 kernel-copy.h   |  14 ++
 marvell.c   |  21 +-
 test-cmdline.c  |  12 +
 test-common.c   |   2 +-
 test-features.c |   2 +-
 8 files changed, 1086 insertions(+), 271 deletions(-)
 create mode 100644 kernel-copy.h

-- 
2.7.0.rc3.207.g0ac5344



[ethtool PATCH v4 01/11] internal.h: change to new sane kernel headers on 64-bit archs

2016-03-07 Thread David Decotigny
From: Maciej Żenczykowski 

On ppc64, this fixes:
  In file included from ethtool-copy.h:22:0,
   from internal.h:32,
   from ethtool.c:29:
  .../include/linux/types.h:32:25: error: conflicting types for '__be64'
   typedef __u64 __bitwise __be64;
   ^
  In file included from ethtool.c:29:0:
  internal.h:23:28: note: previous declaration of '__be64' was here
   typedef unsigned long long __be64;
  ^
  ethtool.c: In function 'do_gstats':
  ethtool.c:3166:4: error: format '%llu' expects argument of type 'long long 
unsigned int', but argument 5 has type '__u64' [-Werror=format=]
  stats->data[i]);
  ^
  ethtool.c: In function 'print_indir_table':
  ethtool.c:3293:9: error: format '%llu' expects argument of type 'long long 
unsigned int', but argument 3 has type '__u64' [-Werror=format=]
   ctx->devname, ring_count->data);
   ^


Signed-off-by: Maciej Żenczykowski 
Signed-off-by: David Decotigny 
---
 internal.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/internal.h b/internal.h
index b5ef646..e38d305 100644
--- a/internal.h
+++ b/internal.h
@@ -3,6 +3,12 @@
 #ifndef ETHTOOL_INTERNAL_H__
 #define ETHTOOL_INTERNAL_H__
 
+/* Some platforms (eg. ppc64) need __SANE_USERSPACE_TYPES__ before
+ *  to select 'int-ll64.h' and avoid compile warnings
+ * when printing __u64 with %llu.
+ */
+#define __SANE_USERSPACE_TYPES__
+
 #ifdef HAVE_CONFIG_H
 #include "ethtool-config.h"
 #endif
-- 
2.7.0.rc3.207.g0ac5344



[PATCHv2 net] bridge: a netlink notification should be sent whenever those attributes change

2016-03-07 Thread Xin Long
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.

we should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.

Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.

Signed-off-by: Xin Long 
---
 net/bridge/br_ioctl.c| 40 
 net/bridge/br_sysfs_br.c |  5 +
 net/bridge/br_sysfs_if.c |  6 +-
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/net/bridge/br_ioctl.c b/net/bridge/br_ioctl.c
index 263b4de..f8fc624 100644
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -112,7 +112,9 @@ static int add_del_if(struct net_bridge *br, int ifindex, 
int isadd)
 static int old_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 {
struct net_bridge *br = netdev_priv(dev);
+   struct net_bridge_port *p = NULL;
unsigned long args[4];
+   int ret = -EOPNOTSUPP;
 
if (copy_from_user(args, rq->ifr_data, sizeof(args)))
return -EFAULT;
@@ -182,25 +184,29 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   return br_set_forward_delay(br, args[1]);
+   ret = br_set_forward_delay(br, args[1]);
+   break;
 
case BRCTL_SET_BRIDGE_HELLO_TIME:
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   return br_set_hello_time(br, args[1]);
+   ret = br_set_hello_time(br, args[1]);
+   break;
 
case BRCTL_SET_BRIDGE_MAX_AGE:
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   return br_set_max_age(br, args[1]);
+   ret = br_set_max_age(br, args[1]);
+   break;
 
case BRCTL_SET_AGEING_TIME:
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
-   return br_set_ageing_time(br, args[1]);
+   ret = br_set_ageing_time(br, args[1]);
+   break;
 
case BRCTL_GET_PORT_INFO:
{
@@ -240,20 +246,19 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
return -EPERM;
 
br_stp_set_enabled(br, args[1]);
-   return 0;
+   ret = 0;
+   break;
 
case BRCTL_SET_BRIDGE_PRIORITY:
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
br_stp_set_bridge_priority(br, args[1]);
-   return 0;
+   ret = 0;
+   break;
 
case BRCTL_SET_PORT_PRIORITY:
{
-   struct net_bridge_port *p;
-   int ret;
-
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
@@ -263,14 +268,11 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
else
ret = br_stp_set_port_priority(p, args[2]);
spin_unlock_bh(>lock);
-   return ret;
+   break;
}
 
case BRCTL_SET_PATH_COST:
{
-   struct net_bridge_port *p;
-   int ret;
-
if (!ns_capable(dev_net(dev)->user_ns, CAP_NET_ADMIN))
return -EPERM;
 
@@ -280,8 +282,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
else
ret = br_stp_set_path_cost(p, args[2]);
spin_unlock_bh(>lock);
-
-   return ret;
+   break;
}
 
case BRCTL_GET_FDB_ENTRIES:
@@ -289,7 +290,14 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
   args[2], args[3]);
}
 
-   return -EOPNOTSUPP;
+   if (!ret) {
+   if (p)
+   br_ifinfo_notify(RTM_NEWLINK, p);
+   else
+   netdev_state_change(br->dev);
+   }
+
+   return ret;
 }
 
 static int old_deviceless(struct net *net, void __user *uarg)
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 6b80914..09608e6 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -44,6 +44,11 @@ static ssize_t store_bridge_parm(struct device *d,
return -EINVAL;
 
err = (*set)(br, val);
+   if (!err) {
+   rtnl_lock();
+   

Re: [PATCH net] sctp: use gfp insteaad of GFP_NOWAIT in idr_alloc_cyclic when sctp_assoc_set_id

2016-03-07 Thread Xin Long
On Mon, Mar 7, 2016 at 7:21 AM, Eric Dumazet  wrote:
> What is the problem of being not able to allocate memory at this point ?
>
Now I think about it again,  this patch cannot work, because of:
__sctp_connect()
sctp_assoc_set_id(asoc, GFP_KERNEL)

thanks, Eric

> If really it bothers you (although we have thousands of other places it
> can happen), maybe add __GFP_NOWARN
>
> But whatever flag we use here, idr_alloc() _can_ fail.
>
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 2bf8ec92dde4..2ae3874e3696 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1606,7 +1606,8 @@ int sctp_assoc_set_id(struct sctp_association *asoc, 
> gfp_t gfp)
> idr_preload(gfp);
> spin_lock_bh(_assocs_id_lock);
> /* 0 is not a valid assoc_id, must be >= 1 */
> -   ret = idr_alloc_cyclic(_assocs_id, asoc, 1, 0, GFP_NOWAIT);
> +   ret = idr_alloc_cyclic(_assocs_id, asoc, 1, 0,
> +  (gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_NOWARN);
> spin_unlock_bh(_assocs_id_lock);
> if (preload)
> idr_preload_end();
>
>


'ip tunnel add' uses the wrong interface name on error

2016-03-07 Thread John Morrissey
'ip tunnel add' uses the wrong interface name in its error output.
For example, if the requested interface already exists:

--
[jwm@boost:pts/8 ~> sudo ip tunnel add sit1 mode sit remote 10.10.10.11 local 
10.10.10.10
add tunnel "sit0" failed: No buffer space available
--

ip/tunnel.c:tnl_add_ioctl() only uses the passed interface name on
SIOCCHGTUNNEL, not SIOCADDTUNNEL. Shouldn't it always use the passed
interface name, if present?

--
int tnl_add_ioctl(int cmd, const char *basedev, const char *name, void *p)
{
struct ifreq ifr;
int fd;
int err;

if (cmd == SIOCCHGTUNNEL && name[0])
strncpy(ifr.ifr_name, name, IFNAMSIZ);
else
strncpy(ifr.ifr_name, basedev, IFNAMSIZ);
[...]
err = ioctl(fd, cmd, );
if (err)
fprintf(stderr, "add tunnel \"%s\" failed: %s\n", ifr.ifr_name,
strerror(errno));
--

-john


[RFC PATCH net-next v2] tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In

2016-03-07 Thread Martin KaFai Lau
v2:
Rework based on recent fix by Eric:
commit a9d99ce28ed3 ("tcp: fix tcpi_segs_in after connection establishment")

v1:
Per RFC4898, they count segments sent/received
containing a positive length data segment (that includes
retransmission segments carrying data).  Unlike
tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
carrying no data (e.g. pure ack).

The patch also updates the segs_in in tcp_fastopen_add_skb()
so that segs_in >= data_segs_in property is kept.

Together with retransmission data, tcpi_data_segs_out
gives a better signal on the rxmit rate.

Signed-off-by: Martin KaFai Lau 
Cc: Chris Rapier 
Cc: Eric Dumazet 
Cc: Marcelo Ricardo Leitner 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
---
 include/linux/tcp.h  |  6 ++
 include/net/tcp.h| 10 ++
 include/uapi/linux/tcp.h |  2 ++
 net/ipv4/tcp.c   |  2 ++
 net/ipv4/tcp_fastopen.c  |  4 
 net/ipv4/tcp_ipv4.c  |  2 +-
 net/ipv4/tcp_minisocks.c |  2 +-
 net/ipv4/tcp_output.c|  4 +++-
 net/ipv6/tcp_ipv6.c  |  2 +-
 9 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bcbf51d..7be9b12 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -158,6 +158,9 @@ struct tcp_sock {
u32 segs_in;/* RFC4898 tcpEStatsPerfSegsIn
 * total number of segments in.
 */
+   u32 data_segs_in;   /* RFC4898 tcpEStatsPerfDataSegsIn
+* total number of data segments in.
+*/
u32 rcv_nxt;/* What we want to receive next */
u32 copied_seq; /* Head of yet unread data  */
u32 rcv_wup;/* rcv_nxt on last window update sent   */
@@ -165,6 +168,9 @@ struct tcp_sock {
u32 segs_out;   /* RFC4898 tcpEStatsPerfSegsOut
 * The total number of segments sent.
 */
+   u32 data_segs_out;  /* RFC4898 tcpEStatsPerfDataSegsOut
+* total number of data segments sent.
+*/
u64 bytes_acked;/* RFC4898 tcpEStatsAppHCThruOctetsAcked
 * sum(delta(snd_una)), or how many bytes
 * were acked.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e90db85..e2916cc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1816,4 +1816,14 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff 
*skb)
skb->truesize = 2;
 }
 
+static inline void tcp_segs_in(struct tcp_sock *tp, struct sk_buff *skb)
+{
+   u16 segs_in;
+
+   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+   tp->segs_in += segs_in;
+   if (skb->len > tcp_hdrlen(skb))
+   tp->data_segs_in += segs_in;
+}
+
 #endif /* _TCP_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index fe95446..53e8e3f 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -199,6 +199,8 @@ struct tcp_info {
 
__u32   tcpi_notsent_bytes;
__u32   tcpi_min_rtt;
+   __u32   tcpi_data_segs_in;  /* RFC4898 tcpEStatsDataSegsIn */
+   __u32   tcpi_data_segs_out; /* RFC4898 tcpEStatsDataSegsOut */
 };
 
 /* for TCP_MD5SIG socket option */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9faadb..6b01b48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2728,6 +2728,8 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
info->tcpi_notsent_bytes = max(0, notsent_bytes);
 
info->tcpi_min_rtt = tcp_min_rtt(tp);
+   info->tcpi_data_segs_in = tp->data_segs_in;
+   info->tcpi_data_segs_out = tp->data_segs_out;
 }
 EXPORT_SYMBOL_GPL(tcp_get_info);
 
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index fdb286d..f583c85 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -131,6 +131,7 @@ static bool tcp_fastopen_cookie_gen(struct request_sock 
*req,
 void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   u16 segs_in;
 
if (TCP_SKB_CB(skb)->end_seq == tp->rcv_nxt)
return;
@@ -154,6 +155,9 @@ void tcp_fastopen_add_skb(struct sock *sk, struct sk_buff 
*skb)
 * as we certainly are not changing upper 32bit value (0)
 */
tp->bytes_received = skb->len;
+   segs_in = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+   tp->segs_in = segs_in;
+   tp->data_segs_in = segs_in;
 
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
tcp_fin(sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4c8d58d..0b02ef7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1650,7 +1650,7 @@ process:
   

Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Linus Torvalds
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

It shouldn't be a factor, and you shouldn't check for it. My code was
self-aligning, and had at most one unaligned access at the beginnig
(the data of which was then used to align the rest).

Tom had a version that used that. Although now that I look back at it,
it seems to be broken by some confusion about the one-byte alignment
vs 8-byte alignment.

 Linus


Re: [PATCH v2 net-next 11/13] kcm: Add memory limit for receive message construction

2016-03-07 Thread Sowmini Varadhan
On (03/07/16 14:11), Tom Herbert wrote:
> 
> Message assembly is performed on the TCP socket. This is logically
> equivalent of an application that performs a peek on the socket to find
> out how much memory is needed for a receive buffer. The receive socket
> buffer also provides the maximum message size which is checked.
> 
> The receive algorithm is something like:
> 
>1) Receive the first skbuf for a message (or skbufs if multiple are
>   needed to determine message length).
>2) Check the message length against the number of bytes in the TCP
>   receive queue (tcp_inq()).
>   - If all the bytes of the message are in the queue (incluing the
> skbuf received), then proceed with message assembly (it should
> complete with the tcp_read_sock)
> - Else, mark the psock with the number of bytes needed to
> complete the message.
>3) In TCP data ready function, if the psock indicates that we are
>   waiting for the rest of the bytes of a messages, check the number
>   of queued bytes against that.
> - If there are still not enough bytes for the message, just
> return
> - Else, clear the waiting bytes and proceed to receive the
> skbufs.  The message should now be received in one
> tcp_read_sock

AIUI, the above logic will make sure that ->sk_data_ready reads the entire
message in one shot.  For "very large" messages, TCP's windowing logic
will eventually kick in, and the sender (all threads in the sender
that share the single tcp socket)  will be throttled, correct?  

I suppose that (all sender threads being blocked behind one "too large"
message) may not an unreasonable constraint, but is it possible to end up
with a deadlocked TCP (recv) socket- one for which the receiver 
closed the window (so sender TCP cannot send the remaining
bytes of the kcm message), but cannot be drained because of #3 above?

BTW there are a couple of typos above;
s/skbuf/skbuff
s/incluing/including

--Sowmini



[net-next:master 1058/1060] qede_main.c:undefined reference to `tcp_gro_complete'

2016-03-07 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   d66ab51442211158b677c2f12310c314d9587f74
commit: 55482edc25f0606851de42e73618f813f310d009 [1058/1060] qede: Add 
slowpath/fastpath support and enable hardware GRO
config: x86_64-randconfig-s0-03080757 (attached as .config)
reproduce:
git checkout 55482edc25f0606851de42e73618f813f310d009
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `qede_rx_int':
>> qede_main.c:(.text+0x6101a0): undefined reference to `tcp_gro_complete'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 4:49 PM, Alexander Duyck
 wrote:
> On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
>> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>>  wrote:
>>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
 On Mon, Mar 7, 2016 at 5:56 AM, David Laight  
 wrote:
> From: Alexander Duyck
>  ...
>> Actually probably the easiest way to go on x86 is to just replace the
>> use of len with (len >> 6) and use decl or incl instead of addl or
>> subl, and lea instead of addq for the buff address.  None of those
>> instructions effect the carry flag as this is how such loops were
>> intended to be implemented.
>>
>> I've been doing a bit of testing and that seems to work without
>> needing the adcq until after you exit the loop, but doesn't give that
>> much of a gain in speed for dropping the instruction from the
>> hot-path.  I suspect we are probably memory bottle-necked already in
>> the loop so dropping an instruction or two doesn't gain you much.
>
> Right, any superscalar architecture gives you some instructions
> 'for free' if they can execute at the same time as those on the
> critical path (in this case the memory reads and the adc).
> This is why loop unrolling can be pointless.
>
> So the loop:
> 10: addc %rax,(%rdx,%rcx,8)
> inc %rcx
> jnz 10b
> could easily be as fast as anything that doesn't use the 'new'
> instructions that use the overflow flag.
> That loop might be measurable faster for aligned buffers.

 Tested by replacing the unrolled loop in my patch with just:

 if (len >= 8) {
 asm("clc\n\t"
 "0: adcq (%[src],%%rcx,8),%[res]\n\t"
 "decl %%ecx\n\t"
 "jge 0b\n\t"
 "adcq $0, %[res]\n\t"
 : [res] "=r" (result)
 : [src] "r" (buff), "[res]" (result), "c"
 ((len >> 3) - 1));
 }

 This seems to be significantly slower:

 1400 bytes: 797 nsecs vs. 202 nsecs
 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>>
>>> You still need the loop unrolling as the decl and jge have some
>>> overhead.  You can't just get rid of it with a single call in a tight
>>> loop but it should improve things.  The gain from what I have seen
>>> ends up being minimal though.  I haven't really noticed all that much
>>> in my tests anyway.
>>>
>>> I have been doing some testing and the penalty for an unaligned
>>> checksum can get pretty big if the data-set is big enough.  I was
>>> messing around and tried doing a checksum over 32K minus some offset
>>> and was seeing a penalty of about 200 cycles per 64K frame.
>>>
>> Out of how many cycles to checksum 64K though?
>
> So the clock cycles I am seeing is ~16660 for unaligned vs 16416
> aligned.  So yeah the effect is only a 1.5% penalty for the total
> time.
>
>>> One thought I had is that we may want to look into making an inline
>>> function that we can call for compile-time defined lengths less than
>>> 64.  Maybe call it something like __csum_partial and we could then use
>>> that in place of csum_partial for all those headers that are a fixed
>>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>>> we might be able to look at taking care of alignment for csum_partial
>>> which will improve the skb_checksum() case without impacting the
>>> header pulling cases as much since that code would be inlined
>>> elsewhere.
>>>
>> As I said previously, if alignment really is a factor then we can
>> check up front if a buffer crosses a page boundary and call the slow
>> path function (original code). I'm seeing a 1 nsec hit to add this
>> check.
>
> Well I was just noticing there are a number of places we can get an
> even bigger benefit if we just bypass the need for csum_partial
> entirely.  For example the DSA code is calling csum_partial to extract
> 2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
> could catch cases like these with a __builtin_constant_p check then we
> might be able to save some significant CPU time by avoiding the
> function call entirely and just doing some inline addition on the
> input values directly.
>
Sure, we could inline a switch function for common values (0, 2, 4, 8,
14, 16, 20, 40) maybe.

> - Alex


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 4:07 PM, Tom Herbert  wrote:
> On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
>  wrote:
>> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
>>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  
>>> wrote:
 From: Alexander Duyck
  ...
> Actually probably the easiest way to go on x86 is to just replace the
> use of len with (len >> 6) and use decl or incl instead of addl or
> subl, and lea instead of addq for the buff address.  None of those
> instructions effect the carry flag as this is how such loops were
> intended to be implemented.
>
> I've been doing a bit of testing and that seems to work without
> needing the adcq until after you exit the loop, but doesn't give that
> much of a gain in speed for dropping the instruction from the
> hot-path.  I suspect we are probably memory bottle-necked already in
> the loop so dropping an instruction or two doesn't gain you much.

 Right, any superscalar architecture gives you some instructions
 'for free' if they can execute at the same time as those on the
 critical path (in this case the memory reads and the adc).
 This is why loop unrolling can be pointless.

 So the loop:
 10: addc %rax,(%rdx,%rcx,8)
 inc %rcx
 jnz 10b
 could easily be as fast as anything that doesn't use the 'new'
 instructions that use the overflow flag.
 That loop might be measurable faster for aligned buffers.
>>>
>>> Tested by replacing the unrolled loop in my patch with just:
>>>
>>> if (len >= 8) {
>>> asm("clc\n\t"
>>> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>>> "decl %%ecx\n\t"
>>> "jge 0b\n\t"
>>> "adcq $0, %[res]\n\t"
>>> : [res] "=r" (result)
>>> : [src] "r" (buff), "[res]" (result), "c"
>>> ((len >> 3) - 1));
>>> }
>>>
>>> This seems to be significantly slower:
>>>
>>> 1400 bytes: 797 nsecs vs. 202 nsecs
>>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>>
>> You still need the loop unrolling as the decl and jge have some
>> overhead.  You can't just get rid of it with a single call in a tight
>> loop but it should improve things.  The gain from what I have seen
>> ends up being minimal though.  I haven't really noticed all that much
>> in my tests anyway.
>>
>> I have been doing some testing and the penalty for an unaligned
>> checksum can get pretty big if the data-set is big enough.  I was
>> messing around and tried doing a checksum over 32K minus some offset
>> and was seeing a penalty of about 200 cycles per 64K frame.
>>
> Out of how many cycles to checksum 64K though?

So the clock cycles I am seeing is ~16660 for unaligned vs 16416
aligned.  So yeah the effect is only a 1.5% penalty for the total
time.

>> One thought I had is that we may want to look into making an inline
>> function that we can call for compile-time defined lengths less than
>> 64.  Maybe call it something like __csum_partial and we could then use
>> that in place of csum_partial for all those headers that are a fixed
>> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
>> we might be able to look at taking care of alignment for csum_partial
>> which will improve the skb_checksum() case without impacting the
>> header pulling cases as much since that code would be inlined
>> elsewhere.
>>
> As I said previously, if alignment really is a factor then we can
> check up front if a buffer crosses a page boundary and call the slow
> path function (original code). I'm seeing a 1 nsec hit to add this
> check.

Well I was just noticing there are a number of places we can get an
even bigger benefit if we just bypass the need for csum_partial
entirely.  For example the DSA code is calling csum_partial to extract
2 bytes.  Same thing for protocols such as VXLAN and the like.  If we
could catch cases like these with a __builtin_constant_p check then we
might be able to save some significant CPU time by avoiding the
function call entirely and just doing some inline addition on the
input values directly.

- Alex


linux-next: manual merge of the net-next tree with the net tree

2016-03-07 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/tipc/subscr.c

between commit:

  4de13d7ed6ff ("tipc: fix nullptr crash during subscription cancel")

from the net tree and commit:

  7c13c6224123 ("tipc: introduce tipc_subscrb_subscribe() routine")
(and following ones)

from the net-next tree.

I fixed it up (I used the net-next tree version as it is not obvious tha
the net tree patch is still needed) and can carry the fix as necessary
(no action is required).

-- 
Cheers,
Stephen Rothwell


Re: [PATCH net-next 2/3] ipv6: per netns fib6 walkers

2016-03-07 Thread Cong Wang
On Mon, Mar 7, 2016 at 4:26 PM, Cong Wang  wrote:
> On Fri, Mar 4, 2016 at 2:59 AM, Michal Kubecek  wrote:
>>  static void ipv6_route_seq_setup_walk(struct ipv6_route_iter *iter)
>>  {
>> +#ifdef CONFIG_NET_NS
>> +   struct net *net = iter->p.net;
>> +#else
>> +   struct net *net = _net;
>> +#endif
>> +
>
> You should pass the struct net pointer to ipv6_route_seq_setup_walk()
> instead of reading it by yourself.
>
> I don't find anyone actually using iter->p, it probably can be just removed.

Er, seq_file_net() uses it... but callers already call it.


Re: [PATCH net-next 2/3] ipv6: per netns fib6 walkers

2016-03-07 Thread Cong Wang
On Fri, Mar 4, 2016 at 2:59 AM, Michal Kubecek  wrote:
>  static void ipv6_route_seq_setup_walk(struct ipv6_route_iter *iter)
>  {
> +#ifdef CONFIG_NET_NS
> +   struct net *net = iter->p.net;
> +#else
> +   struct net *net = _net;
> +#endif
> +

You should pass the struct net pointer to ipv6_route_seq_setup_walk()
instead of reading it by yourself.

I don't find anyone actually using iter->p, it probably can be just removed.


[PATCH 4/5] ti: wl1251: Convert wl1251_notice to wiphy_info/pr_info

2016-03-07 Thread Joe Perches
Use the more common logging mechanisms.

Convert to wiphy_info as these wl1251_notice uses were actually
emitted at KERN_INFO.

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ti/wl1251/main.c   | 4 ++--
 drivers/net/wireless/ti/wl1251/sdio.c   | 2 +-
 drivers/net/wireless/ti/wl1251/wl1251.h | 3 ---
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/ti/wl1251/main.c 
b/drivers/net/wireless/ti/wl1251/main.c
index a0576e7..593bed9 100644
--- a/drivers/net/wireless/ti/wl1251/main.c
+++ b/drivers/net/wireless/ti/wl1251/main.c
@@ -1469,7 +1469,7 @@ static int wl1251_register_hw(struct wl1251 *wl)
 
wl->mac80211_registered = true;
 
-   wl1251_notice("loaded");
+   wiphy_info(wl->hw->wiphy, "loaded\n");
 
return 0;
 }
@@ -1503,7 +1503,7 @@ int wl1251_init_ieee80211(struct wl1251 *wl)
goto out;
 
wl1251_debugfs_init(wl);
-   wl1251_notice("initialized");
+   wiphy_info(wl->hw->wiphy, "initialized\n");
 
ret = 0;
 
diff --git a/drivers/net/wireless/ti/wl1251/sdio.c 
b/drivers/net/wireless/ti/wl1251/sdio.c
index f48e985..cc52a97 100644
--- a/drivers/net/wireless/ti/wl1251/sdio.c
+++ b/drivers/net/wireless/ti/wl1251/sdio.c
@@ -382,7 +382,7 @@ static int __init wl1251_sdio_init(void)
 static void __exit wl1251_sdio_exit(void)
 {
sdio_unregister_driver(_sdio_driver);
-   wl1251_notice("unloaded");
+   pr_info("unloaded\n");
 }
 
 module_init(wl1251_sdio_init);
diff --git a/drivers/net/wireless/ti/wl1251/wl1251.h 
b/drivers/net/wireless/ti/wl1251/wl1251.h
index 5d520d2..62a40cc 100644
--- a/drivers/net/wireless/ti/wl1251/wl1251.h
+++ b/drivers/net/wireless/ti/wl1251/wl1251.h
@@ -54,9 +54,6 @@ enum {
 
 #define DEBUG_DUMP_LIMIT 1024
 
-#define wl1251_notice(fmt, arg...) \
-   printk(KERN_INFO DRIVER_PREFIX fmt "\n", ##arg)
-
 #define wl1251_info(fmt, arg...) \
printk(KERN_DEBUG DRIVER_PREFIX fmt "\n", ##arg)
 
-- 
2.6.3.368.gf34be46



[PATCH 2/5] ti: wl1251: Convert wl1251_error to wiphy_err/pr_err

2016-03-07 Thread Joe Perches
Use the more common logging mechanisms.

Miscellanea:

o Coalesce formats
o Realign arguments
o Reflow to fit 80 columns
o Add #define pr_fmt when pr_ is used

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ti/wl1251/acx.c|  6 ++---
 drivers/net/wireless/ti/wl1251/boot.c   | 18 +++---
 drivers/net/wireless/ti/wl1251/cmd.c| 42 +
 drivers/net/wireless/ti/wl1251/event.c  |  2 +-
 drivers/net/wireless/ti/wl1251/init.c   | 13 +-
 drivers/net/wireless/ti/wl1251/io.c |  3 ++-
 drivers/net/wireless/ti/wl1251/main.c   | 41 ++--
 drivers/net/wireless/ti/wl1251/ps.c |  2 +-
 drivers/net/wireless/ti/wl1251/rx.c |  2 +-
 drivers/net/wireless/ti/wl1251/sdio.c   | 19 +--
 drivers/net/wireless/ti/wl1251/spi.c| 19 ---
 drivers/net/wireless/ti/wl1251/tx.c |  5 ++--
 drivers/net/wireless/ti/wl1251/wl1251.h |  3 ---
 13 files changed, 93 insertions(+), 82 deletions(-)

diff --git a/drivers/net/wireless/ti/wl1251/acx.c 
b/drivers/net/wireless/ti/wl1251/acx.c
index d6fbdda..23b4882 100644
--- a/drivers/net/wireless/ti/wl1251/acx.c
+++ b/drivers/net/wireless/ti/wl1251/acx.c
@@ -28,7 +28,7 @@ int wl1251_acx_frame_rates(struct wl1251 *wl, u8 ctrl_rate, 
u8 ctrl_mod,
ret = wl1251_cmd_configure(wl, ACX_FW_GEN_FRAME_RATES,
   rates, sizeof(*rates));
if (ret < 0) {
-   wl1251_error("Failed to set FW rates and modulation");
+   wiphy_err(wl->hw->wiphy, "Failed to set FW rates and 
modulation\n");
goto out;
}
 
@@ -74,7 +74,7 @@ int wl1251_acx_default_key(struct wl1251 *wl, u8 key_id)
ret = wl1251_cmd_configure(wl, DOT11_DEFAULT_KEY,
   default_key, sizeof(*default_key));
if (ret < 0) {
-   wl1251_error("Couldn't set default key");
+   wiphy_err(wl->hw->wiphy, "Couldn't set default key\n");
goto out;
}
 
@@ -208,7 +208,7 @@ int wl1251_acx_feature_cfg(struct wl1251 *wl, u32 
data_flow_options)
ret = wl1251_cmd_configure(wl, ACX_FEATURE_CFG,
   feature, sizeof(*feature));
if (ret < 0) {
-   wl1251_error("Couldn't set HW encryption");
+   wiphy_err(wl->hw->wiphy, "Couldn't set HW encryption\n");
goto out;
}
 
diff --git a/drivers/net/wireless/ti/wl1251/boot.c 
b/drivers/net/wireless/ti/wl1251/boot.c
index 2000cd5..456629a 100644
--- a/drivers/net/wireless/ti/wl1251/boot.c
+++ b/drivers/net/wireless/ti/wl1251/boot.c
@@ -53,7 +53,7 @@ int wl1251_boot_soft_reset(struct wl1251 *wl)
if (time_after(jiffies, timeout)) {
/* 1.2 check pWhalBus->uSelfClearTime if the
 * timeout was reached */
-   wl1251_error("soft reset timeout");
+   wiphy_err(wl->hw->wiphy, "soft reset timeout\n");
return -1;
}
 
@@ -231,7 +231,7 @@ int wl1251_boot_run_firmware(struct wl1251 *wl)
wl1251_debug(DEBUG_BOOT, "chip id after firmware boot: 0x%x", chip_id);
 
if (chip_id != wl->chip_id) {
-   wl1251_error("chip id doesn't match after firmware boot");
+   wiphy_err(wl->hw->wiphy, "chip id doesn't match after firmware 
boot\n");
return -EIO;
}
 
@@ -242,8 +242,7 @@ int wl1251_boot_run_firmware(struct wl1251 *wl)
acx_intr = wl1251_reg_read32(wl, ACX_REG_INTERRUPT_NO_CLEAR);
 
if (acx_intr == 0x) {
-   wl1251_error("error reading hardware complete "
-"init indication");
+   wiphy_err(wl->hw->wiphy, "error reading hardware 
complete init indication\n");
return -EIO;
}
/* check that ACX_INTR_INIT_COMPLETE is enabled */
@@ -255,8 +254,7 @@ int wl1251_boot_run_firmware(struct wl1251 *wl)
}
 
if (loop > INIT_LOOP) {
-   wl1251_error("timeout waiting for the hardware to "
-"complete initialization");
+   wiphy_err(wl->hw->wiphy, "timeout waiting for the hardware to 
complete initialization\n");
return -EIO;
}
 
@@ -304,7 +302,7 @@ int wl1251_boot_run_firmware(struct wl1251 *wl)
 
ret = wl1251_event_unmask(wl);
if (ret < 0) {
-   wl1251_error("EVENT mask setting failed");
+   wiphy_err(wl->hw->wiphy, "EVENT mask setting failed\n");
return ret;
}
 
@@ -333,13 +331,13 @@ static int wl1251_boot_upload_firmware(struct wl1251 *wl)
CHUNK_SIZE);
 
if ((fw_data_len % 4) != 0) {
-   wl1251_error("firmware length not multiple of four");
+   wiphy_err(wl->hw->wiphy, "firmware length 

[PATCH 5/5] ti: wl1251: Convert wl1251_info to wl1251_debug

2016-03-07 Thread Joe Perches
These logging messages are always emitted at KERN_DEBUG.

Add a DEBUG_ALWAYS enum to the debug type enum and convert the
macro and uses from wl1251_info( to wl1251_debug(DEBUG_ALWAYS,

Miscellanea:

o Remove the now unused wl1251_info macro

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ti/wl1251/init.c   | 10 +-
 drivers/net/wireless/ti/wl1251/main.c   |  4 ++--
 drivers/net/wireless/ti/wl1251/sdio.c   |  4 ++--
 drivers/net/wireless/ti/wl1251/wl1251.h |  6 ++
 4 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/net/wireless/ti/wl1251/init.c 
b/drivers/net/wireless/ti/wl1251/init.c
index 796ccf4..29a990d 100644
--- a/drivers/net/wireless/ti/wl1251/init.c
+++ b/drivers/net/wireless/ti/wl1251/init.c
@@ -411,11 +411,11 @@ int wl1251_hw_init(struct wl1251 *wl)
goto out_free_data_path;
 
wl_mem_map = wl->target_mem_map;
-   wl1251_info("%d tx blocks at 0x%x, %d rx blocks at 0x%x",
-   wl_mem_map->num_tx_mem_blocks,
-   wl->data_path->tx_control_addr,
-   wl_mem_map->num_rx_mem_blocks,
-   wl->data_path->rx_control_addr);
+   wl1251_debug(DEBUG_ALWAYS, "%d tx blocks at 0x%x, %d rx blocks at 0x%x",
+wl_mem_map->num_tx_mem_blocks,
+wl->data_path->tx_control_addr,
+wl_mem_map->num_rx_mem_blocks,
+wl->data_path->rx_control_addr);
 
return 0;
 
diff --git a/drivers/net/wireless/ti/wl1251/main.c 
b/drivers/net/wireless/ti/wl1251/main.c
index 593bed9..32b8c98 100644
--- a/drivers/net/wireless/ti/wl1251/main.c
+++ b/drivers/net/wireless/ti/wl1251/main.c
@@ -423,7 +423,7 @@ static int wl1251_op_start(struct ieee80211_hw *hw)
 
wl->state = WL1251_STATE_ON;
 
-   wl1251_info("firmware booted (%s)", wl->fw_ver);
+   wl1251_debug(DEBUG_ALWAYS, "firmware booted (%s)", wl->fw_ver);
 
/* update hw/fw version info in wiphy struct */
wiphy->hw_version = wl->chip_id;
@@ -442,7 +442,7 @@ static void wl1251_op_stop(struct ieee80211_hw *hw)
 {
struct wl1251 *wl = hw->priv;
 
-   wl1251_info("down");
+   wl1251_debug(DEBUG_ALWAYS, "down");
 
wl1251_debug(DEBUG_MAC80211, "mac80211 stop");
 
diff --git a/drivers/net/wireless/ti/wl1251/sdio.c 
b/drivers/net/wireless/ti/wl1251/sdio.c
index cc52a97..d2fb7d1 100644
--- a/drivers/net/wireless/ti/wl1251/sdio.c
+++ b/drivers/net/wireless/ti/wl1251/sdio.c
@@ -290,12 +290,12 @@ static int wl1251_sdio_probe(struct sdio_func *func,
wl1251_sdio_ops.enable_irq = wl1251_enable_line_irq;
wl1251_sdio_ops.disable_irq = wl1251_disable_line_irq;
 
-   wl1251_info("using dedicated interrupt line");
+   wl1251_debug(DEBUG_ALWAYS, "using dedicated interrupt line");
} else {
wl1251_sdio_ops.enable_irq = wl1251_sdio_enable_irq;
wl1251_sdio_ops.disable_irq = wl1251_sdio_disable_irq;
 
-   wl1251_info("using SDIO interrupt");
+   wl1251_debug(DEBUG_ALWAYS, "using SDIO interrupt");
}
 
ret = wl1251_init_ieee80211(wl);
diff --git a/drivers/net/wireless/ti/wl1251/wl1251.h 
b/drivers/net/wireless/ti/wl1251/wl1251.h
index 62a40cc..705573f 100644
--- a/drivers/net/wireless/ti/wl1251/wl1251.h
+++ b/drivers/net/wireless/ti/wl1251/wl1251.h
@@ -47,6 +47,7 @@ enum {
DEBUG_MAC80211  = BIT(11),
DEBUG_CMD   = BIT(12),
DEBUG_ACX   = BIT(13),
+   DEBUG_ALWAYS= BIT(31),
DEBUG_ALL   = ~0,
 };
 
@@ -54,12 +55,9 @@ enum {
 
 #define DEBUG_DUMP_LIMIT 1024
 
-#define wl1251_info(fmt, arg...) \
-   printk(KERN_DEBUG DRIVER_PREFIX fmt "\n", ##arg)
-
 #define wl1251_debug(level, fmt, arg...) \
do { \
-   if (level & DEBUG_LEVEL) \
+   if (level == DEBUG_ALWAYS || (level & DEBUG_LEVEL)) \
printk(KERN_DEBUG DRIVER_PREFIX fmt "\n", ##arg); \
} while (0)
 
-- 
2.6.3.368.gf34be46



[PATCH 3/5] ti: wl1251: Convert wl1251_warning to wiphy_warn

2016-03-07 Thread Joe Perches
Use the more common logging mechanism.

Miscellanea:

o Coalesce formats
o Realign arguments
o Reflow to fit 80 columns

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ti/wl1251/acx.c| 91 -
 drivers/net/wireless/ti/wl1251/cmd.c| 10 ++--
 drivers/net/wireless/ti/wl1251/init.c   |  4 +-
 drivers/net/wireless/ti/wl1251/main.c   | 26 ++
 drivers/net/wireless/ti/wl1251/rx.c |  4 +-
 drivers/net/wireless/ti/wl1251/tx.c |  4 +-
 drivers/net/wireless/ti/wl1251/wl1251.h |  3 --
 7 files changed, 82 insertions(+), 60 deletions(-)

diff --git a/drivers/net/wireless/ti/wl1251/acx.c 
b/drivers/net/wireless/ti/wl1251/acx.c
index 23b4882..22d3daa 100644
--- a/drivers/net/wireless/ti/wl1251/acx.c
+++ b/drivers/net/wireless/ti/wl1251/acx.c
@@ -103,7 +103,8 @@ int wl1251_acx_wake_up_conditions(struct wl1251 *wl, u8 
wake_up_event,
ret = wl1251_cmd_configure(wl, ACX_WAKE_UP_CONDITIONS,
   wake_up, sizeof(*wake_up));
if (ret < 0) {
-   wl1251_warning("could not set wake up conditions: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "could not set wake up conditions: 
%d\n",
+  ret);
goto out;
}
 
@@ -144,7 +145,7 @@ int wl1251_acx_fw_version(struct wl1251 *wl, char *buf, 
size_t len)
 
ret = wl1251_cmd_interrogate(wl, ACX_FW_REV, rev, sizeof(*rev));
if (ret < 0) {
-   wl1251_warning("ACX_FW_REV interrogate failed");
+   wiphy_warn(wl->hw->wiphy, "ACX_FW_REV interrogate failed\n");
goto out;
}
 
@@ -181,7 +182,8 @@ int wl1251_acx_tx_power(struct wl1251 *wl, int power)
 
ret = wl1251_cmd_configure(wl, DOT11_CUR_TX_PWR, acx, sizeof(*acx));
if (ret < 0) {
-   wl1251_warning("configure of tx power failed: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "configure of tx power failed: %d\n",
+  ret);
goto out;
}
 
@@ -265,10 +267,11 @@ int wl1251_acx_data_path_params(struct wl1251 *wl,
 resp, sizeof(*resp));
 
if (ret < 0) {
-   wl1251_warning("failed to read data path parameters: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to read data path parameters: 
%d\n",
+  ret);
goto out;
} else if (resp->header.cmd.status != CMD_STATUS_SUCCESS) {
-   wl1251_warning("data path parameter acx status failed");
+   wiphy_warn(wl->hw->wiphy, "data path parameter acx status 
failed\n");
ret = -EIO;
goto out;
}
@@ -293,7 +296,8 @@ int wl1251_acx_rx_msdu_life_time(struct wl1251 *wl, u32 
life_time)
ret = wl1251_cmd_configure(wl, DOT11_RX_MSDU_LIFE_TIME,
   acx, sizeof(*acx));
if (ret < 0) {
-   wl1251_warning("failed to set rx msdu life time: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to set rx msdu life time: 
%d\n",
+  ret);
goto out;
}
 
@@ -319,7 +323,7 @@ int wl1251_acx_rx_config(struct wl1251 *wl, u32 config, u32 
filter)
ret = wl1251_cmd_configure(wl, ACX_RX_CFG,
   rx_config, sizeof(*rx_config));
if (ret < 0) {
-   wl1251_warning("failed to set rx config: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to set rx config: %d\n", ret);
goto out;
}
 
@@ -343,7 +347,8 @@ int wl1251_acx_pd_threshold(struct wl1251 *wl)
 
ret = wl1251_cmd_configure(wl, ACX_PD_THRESHOLD, pd, sizeof(*pd));
if (ret < 0) {
-   wl1251_warning("failed to set pd threshold: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to set pd threshold: %d\n",
+  ret);
goto out;
}
 
@@ -368,7 +373,8 @@ int wl1251_acx_slot(struct wl1251 *wl, enum acx_slot_type 
slot_time)
 
ret = wl1251_cmd_configure(wl, ACX_SLOT, slot, sizeof(*slot));
if (ret < 0) {
-   wl1251_warning("failed to set slot time: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to set slot time: %d\n",
+  ret);
goto out;
}
 
@@ -397,7 +403,8 @@ int wl1251_acx_group_address_tbl(struct wl1251 *wl, bool 
enable,
ret = wl1251_cmd_configure(wl, DOT11_GROUP_ADDRESS_TBL,
   acx, sizeof(*acx));
if (ret < 0) {
-   wl1251_warning("failed to set group addr table: %d", ret);
+   wiphy_warn(wl->hw->wiphy, "failed to set group addr table: 
%d\n",
+  ret);
goto out;
}
 
@@ -423,8 +430,8 @@ int wl1251_acx_service_period_timeout(struct wl1251 *wl)
ret = wl1251_cmd_configure(wl, ACX_SERVICE_PERIOD_TIMEOUT,
 

[PATCH 1/5] ti: Convert wl1271_ logging macros to dev_ or pr_

2016-03-07 Thread Joe Perches
Use the more common logging mechanism passing wl->dev where
appropriate.  Remove the macros.  Add argument struct wl1271 *wl to
some functions to make these logging mechanisms work.

Miscellanea:

o Coalesce formats, add required trailing \n to formats
  Some formats already had previously incorrect \n uses
o Realign arguments
o Correct a couple typos and grammar defects
o Split a multiple line error message to multiple calls of dev_err
o Add #define pr_fmt when pr_ is used
o Remove unnecessary/duplicate pr_fmt use from wl1271_debug macro

Signed-off-by: Joe Perches 
---
 drivers/net/wireless/ti/wl12xx/acx.c  |   2 +-
 drivers/net/wireless/ti/wl12xx/cmd.c  |  20 +--
 drivers/net/wireless/ti/wl12xx/main.c |  34 ++--
 drivers/net/wireless/ti/wl12xx/scan.c |  24 +--
 drivers/net/wireless/ti/wl18xx/acx.c  |  25 +--
 drivers/net/wireless/ti/wl18xx/cmd.c  |  20 +--
 drivers/net/wireless/ti/wl18xx/debugfs.c  |   2 +-
 drivers/net/wireless/ti/wl18xx/event.c|   8 +-
 drivers/net/wireless/ti/wl18xx/main.c |  50 +++---
 drivers/net/wireless/ti/wl18xx/scan.c |  16 +-
 drivers/net/wireless/ti/wl18xx/tx.c   |   8 +-
 drivers/net/wireless/ti/wlcore/acx.c  | 132 
 drivers/net/wireless/ti/wlcore/boot.c |  45 +++---
 drivers/net/wireless/ti/wlcore/cmd.c  | 103 +++--
 drivers/net/wireless/ti/wlcore/debug.h|  14 +-
 drivers/net/wireless/ti/wlcore/debugfs.c  |  54 +++
 drivers/net/wireless/ti/wlcore/event.c|  14 +-
 drivers/net/wireless/ti/wlcore/main.c | 248 --
 drivers/net/wireless/ti/wlcore/ps.c   |  15 +-
 drivers/net/wireless/ti/wlcore/rx.c   |  26 ++--
 drivers/net/wireless/ti/wlcore/scan.c |   4 +-
 drivers/net/wireless/ti/wlcore/sysfs.c|   8 +-
 drivers/net/wireless/ti/wlcore/testmode.c |  14 +-
 drivers/net/wireless/ti/wlcore/tx.c   |  14 +-
 drivers/net/wireless/ti/wlcore/wlcore_i.h |   3 -
 25 files changed, 464 insertions(+), 439 deletions(-)

diff --git a/drivers/net/wireless/ti/wl12xx/acx.c 
b/drivers/net/wireless/ti/wl12xx/acx.c
index bea06b2..4a11158 100644
--- a/drivers/net/wireless/ti/wl12xx/acx.c
+++ b/drivers/net/wireless/ti/wl12xx/acx.c
@@ -42,7 +42,7 @@ int wl1271_acx_host_if_cfg_bitmap(struct wl1271 *wl, u32 
host_cfg_bitmap)
ret = wl1271_cmd_configure(wl, ACX_HOST_IF_CFG_BITMAP,
   bitmap_conf, sizeof(*bitmap_conf));
if (ret < 0) {
-   wl1271_warning("wl1271 bitmap config opt failed: %d", ret);
+   dev_warn(wl->dev, "wl1271 bitmap config opt failed: %d\n", ret);
goto out;
}
 
diff --git a/drivers/net/wireless/ti/wl12xx/cmd.c 
b/drivers/net/wireless/ti/wl12xx/cmd.c
index 7485dba..8f358d3 100644
--- a/drivers/net/wireless/ti/wl12xx/cmd.c
+++ b/drivers/net/wireless/ti/wl12xx/cmd.c
@@ -54,7 +54,7 @@ int wl1271_cmd_ext_radio_parms(struct wl1271 *wl)
 
ret = wl1271_cmd_test(wl, ext_radio_parms, sizeof(*ext_radio_parms), 0);
if (ret < 0)
-   wl1271_warning("TEST_CMD_INI_FILE_RF_EXTENDED_PARAM failed");
+   dev_warn(wl->dev, "TEST_CMD_INI_FILE_RF_EXTENDED_PARAM 
failed\n");
 
kfree(ext_radio_parms);
return ret;
@@ -73,7 +73,7 @@ int wl1271_cmd_general_parms(struct wl1271 *wl)
return -ENODEV;
 
if (gp->tx_bip_fem_manufacturer >= WL1271_INI_FEM_MODULE_COUNT) {
-   wl1271_warning("FEM index from INI out of bounds");
+   dev_warn(wl->dev, "FEM index from INI out of bounds\n");
return -EINVAL;
}
 
@@ -97,7 +97,7 @@ int wl1271_cmd_general_parms(struct wl1271 *wl)
 
ret = wl1271_cmd_test(wl, gen_parms, sizeof(*gen_parms), answer);
if (ret < 0) {
-   wl1271_warning("CMD_INI_FILE_GENERAL_PARAM failed");
+   dev_warn(wl->dev, "CMD_INI_FILE_GENERAL_PARAM failed\n");
goto out;
}
 
@@ -105,7 +105,7 @@ int wl1271_cmd_general_parms(struct wl1271 *wl)
gen_parms->general_params.tx_bip_fem_manufacturer;
 
if (gp->tx_bip_fem_manufacturer >= WL1271_INI_FEM_MODULE_COUNT) {
-   wl1271_warning("FEM index from FW out of bounds");
+   dev_warn(wl->dev, "FEM index from FW out of bounds\n");
ret = -EINVAL;
goto out;
}
@@ -140,7 +140,7 @@ int wl128x_cmd_general_parms(struct wl1271 *wl)
return -ENODEV;
 
if (gp->tx_bip_fem_manufacturer >= WL1271_INI_FEM_MODULE_COUNT) {
-   wl1271_warning("FEM index from ini out of bounds");
+   dev_warn(wl->dev, "FEM index from ini out of bounds\n");
return -EINVAL;
}
 
@@ -165,7 +165,7 @@ int wl128x_cmd_general_parms(struct wl1271 *wl)
 
ret = wl1271_cmd_test(wl, gen_parms, sizeof(*gen_parms), answer);
if (ret < 0) {
-   wl1271_warning("CMD_INI_FILE_GENERAL_PARAM failed");
+ 

[PATCH 0/5] wireless: ti: Convert specialized logging macros to kernel style

2016-03-07 Thread Joe Perches
Using the normal kernel logging mechanisms makes this code
a bit more like other wireless drivers.

Joe Perches (5):
  ti: Convert wl1271_ logging macros to dev_ or pr_
  ti: wl1251: Convert wl1251_error to wiphy_err/pr_err
  ti: wl1251: Convert wl1251_warning to wiphy_warn
  ti: wl1251: Convert wl1251_notice to wiphy_info/pr_info
  ti: wl1251: Convert wl1251_info to wl1251_debug

 drivers/net/wireless/ti/wl1251/acx.c  |  97 +++-
 drivers/net/wireless/ti/wl1251/boot.c |  18 +--
 drivers/net/wireless/ti/wl1251/cmd.c  |  52 ---
 drivers/net/wireless/ti/wl1251/event.c|   2 +-
 drivers/net/wireless/ti/wl1251/init.c |  27 ++--
 drivers/net/wireless/ti/wl1251/io.c   |   3 +-
 drivers/net/wireless/ti/wl1251/main.c |  75 +
 drivers/net/wireless/ti/wl1251/ps.c   |   2 +-
 drivers/net/wireless/ti/wl1251/rx.c   |   6 +-
 drivers/net/wireless/ti/wl1251/sdio.c |  25 +--
 drivers/net/wireless/ti/wl1251/spi.c  |  19 +--
 drivers/net/wireless/ti/wl1251/tx.c   |   9 +-
 drivers/net/wireless/ti/wl1251/wl1251.h   |  15 +-
 drivers/net/wireless/ti/wl12xx/acx.c  |   2 +-
 drivers/net/wireless/ti/wl12xx/cmd.c  |  20 +--
 drivers/net/wireless/ti/wl12xx/main.c |  34 ++--
 drivers/net/wireless/ti/wl12xx/scan.c |  24 +--
 drivers/net/wireless/ti/wl18xx/acx.c  |  25 +--
 drivers/net/wireless/ti/wl18xx/cmd.c  |  20 +--
 drivers/net/wireless/ti/wl18xx/debugfs.c  |   2 +-
 drivers/net/wireless/ti/wl18xx/event.c|   8 +-
 drivers/net/wireless/ti/wl18xx/main.c |  50 +++---
 drivers/net/wireless/ti/wl18xx/scan.c |  16 +-
 drivers/net/wireless/ti/wl18xx/tx.c   |   8 +-
 drivers/net/wireless/ti/wlcore/acx.c  | 132 
 drivers/net/wireless/ti/wlcore/boot.c |  45 +++---
 drivers/net/wireless/ti/wlcore/cmd.c  | 103 +++--
 drivers/net/wireless/ti/wlcore/debug.h|  14 +-
 drivers/net/wireless/ti/wlcore/debugfs.c  |  54 +++
 drivers/net/wireless/ti/wlcore/event.c|  14 +-
 drivers/net/wireless/ti/wlcore/main.c | 248 --
 drivers/net/wireless/ti/wlcore/ps.c   |  15 +-
 drivers/net/wireless/ti/wlcore/rx.c   |  26 ++--
 drivers/net/wireless/ti/wlcore/scan.c |   4 +-
 drivers/net/wireless/ti/wlcore/sysfs.c|   8 +-
 drivers/net/wireless/ti/wlcore/testmode.c |  14 +-
 drivers/net/wireless/ti/wlcore/tx.c   |  14 +-
 drivers/net/wireless/ti/wlcore/wlcore_i.h |   3 -
 38 files changed, 653 insertions(+), 600 deletions(-)

-- 
2.6.3.368.gf34be46



Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Tom Herbert
On Mon, Mar 7, 2016 at 3:52 PM, Alexander Duyck
 wrote:
> On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
>> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  wrote:
>>> From: Alexander Duyck
>>>  ...
 Actually probably the easiest way to go on x86 is to just replace the
 use of len with (len >> 6) and use decl or incl instead of addl or
 subl, and lea instead of addq for the buff address.  None of those
 instructions effect the carry flag as this is how such loops were
 intended to be implemented.

 I've been doing a bit of testing and that seems to work without
 needing the adcq until after you exit the loop, but doesn't give that
 much of a gain in speed for dropping the instruction from the
 hot-path.  I suspect we are probably memory bottle-necked already in
 the loop so dropping an instruction or two doesn't gain you much.
>>>
>>> Right, any superscalar architecture gives you some instructions
>>> 'for free' if they can execute at the same time as those on the
>>> critical path (in this case the memory reads and the adc).
>>> This is why loop unrolling can be pointless.
>>>
>>> So the loop:
>>> 10: addc %rax,(%rdx,%rcx,8)
>>> inc %rcx
>>> jnz 10b
>>> could easily be as fast as anything that doesn't use the 'new'
>>> instructions that use the overflow flag.
>>> That loop might be measurable faster for aligned buffers.
>>
>> Tested by replacing the unrolled loop in my patch with just:
>>
>> if (len >= 8) {
>> asm("clc\n\t"
>> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
>> "decl %%ecx\n\t"
>> "jge 0b\n\t"
>> "adcq $0, %[res]\n\t"
>> : [res] "=r" (result)
>> : [src] "r" (buff), "[res]" (result), "c"
>> ((len >> 3) - 1));
>> }
>>
>> This seems to be significantly slower:
>>
>> 1400 bytes: 797 nsecs vs. 202 nsecs
>> 40 bytes: 6.5 nsecs vs. 26.8 nsecs
>
> You still need the loop unrolling as the decl and jge have some
> overhead.  You can't just get rid of it with a single call in a tight
> loop but it should improve things.  The gain from what I have seen
> ends up being minimal though.  I haven't really noticed all that much
> in my tests anyway.
>
> I have been doing some testing and the penalty for an unaligned
> checksum can get pretty big if the data-set is big enough.  I was
> messing around and tried doing a checksum over 32K minus some offset
> and was seeing a penalty of about 200 cycles per 64K frame.
>
Out of how many cycles to checksum 64K though?

> One thought I had is that we may want to look into making an inline
> function that we can call for compile-time defined lengths less than
> 64.  Maybe call it something like __csum_partial and we could then use
> that in place of csum_partial for all those headers that are a fixed
> length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
> we might be able to look at taking care of alignment for csum_partial
> which will improve the skb_checksum() case without impacting the
> header pulling cases as much since that code would be inlined
> elsewhere.
>
As I said previously, if alignment really is a factor then we can
check up front if a buffer crosses a page boundary and call the slow
path function (original code). I'm seeing a 1 nsec hit to add this
check.

Tom

> - Alex


Fund Transaction Proposal

2016-03-07 Thread Teresa Au


US$23,200,000.00 Million Transaction, for further detail's contact me via my 
personal e-
mail: ms_teresa_...@outlook.com



Re: Unexpected tcpv6 connection resets since linux 4.4

2016-03-07 Thread Cong Wang
Cc'ing netdev

On Sun, Mar 6, 2016 at 7:10 AM, Andreas Messer  wrote:
> Hi there,
>
> i have updated two of my machines in the last weeks to linux 4.4.1 and linux
> 4.4.3. It seems that since then i get unexpected TCPv6 connection resets when
> connecting to these machines remotely. The issue occurs with sshd and with a
> http service. /etc/hosts.deny and /etc/hosts.allow are empty on both server
> machines. I'm not so in IPv6 and have no idea whats going on. Please find
> attached a network trace from one of the machines when connecting with ssh (on
> port 23 for debugging).

Sounds like the problem fixed by the following commit:

commit 9cf7490360bf2c46a16b7525f899e4970c5fc144
Author: Eric Dumazet 
Date:   Tue Feb 2 19:31:12 2016 -0800

tcp: do not drop syn_recv on all icmp reports


Thanks.


>
> Redirects should be accepted according to settings:
>
> root@banana:/proc/sys/net/ipv6# cat conf/wlan0/forwarding
> 0
> root@banana:/proc/sys/net/ipv6# cat conf/wlan0/accept_redirects
> 1
> root@banana:/proc/sys/net/ipv6# uname -a
> Linux banana 4.4.1-banana #3 SMP Wed Feb 17 23:03:38 CET 2016 x86_64 GNU/Linux
>
> Is there some new network settings or features? My network consists of an
> ethernet/wlan router where subnet of wlan and ethernet are identical.  (forced
> by the router, but was never a problem before). The problem occurs when
> connecting from ethernet machine to wlan machine and when connecting from wlan
> machine to wlan machine. At the moment its not possible to establish a
> connection with these machines using IPv6.
>
> Thanks for help!
>
> Cheers
> Andreas


Re: [PATCH v5 net-next] net: Implement fast csum_partial for x86_64

2016-03-07 Thread Alexander Duyck
On Mon, Mar 7, 2016 at 9:33 AM, Tom Herbert  wrote:
> On Mon, Mar 7, 2016 at 5:56 AM, David Laight  wrote:
>> From: Alexander Duyck
>>  ...
>>> Actually probably the easiest way to go on x86 is to just replace the
>>> use of len with (len >> 6) and use decl or incl instead of addl or
>>> subl, and lea instead of addq for the buff address.  None of those
>>> instructions effect the carry flag as this is how such loops were
>>> intended to be implemented.
>>>
>>> I've been doing a bit of testing and that seems to work without
>>> needing the adcq until after you exit the loop, but doesn't give that
>>> much of a gain in speed for dropping the instruction from the
>>> hot-path.  I suspect we are probably memory bottle-necked already in
>>> the loop so dropping an instruction or two doesn't gain you much.
>>
>> Right, any superscalar architecture gives you some instructions
>> 'for free' if they can execute at the same time as those on the
>> critical path (in this case the memory reads and the adc).
>> This is why loop unrolling can be pointless.
>>
>> So the loop:
>> 10: addc %rax,(%rdx,%rcx,8)
>> inc %rcx
>> jnz 10b
>> could easily be as fast as anything that doesn't use the 'new'
>> instructions that use the overflow flag.
>> That loop might be measurable faster for aligned buffers.
>
> Tested by replacing the unrolled loop in my patch with just:
>
> if (len >= 8) {
> asm("clc\n\t"
> "0: adcq (%[src],%%rcx,8),%[res]\n\t"
> "decl %%ecx\n\t"
> "jge 0b\n\t"
> "adcq $0, %[res]\n\t"
> : [res] "=r" (result)
> : [src] "r" (buff), "[res]" (result), "c"
> ((len >> 3) - 1));
> }
>
> This seems to be significantly slower:
>
> 1400 bytes: 797 nsecs vs. 202 nsecs
> 40 bytes: 6.5 nsecs vs. 26.8 nsecs

You still need the loop unrolling as the decl and jge have some
overhead.  You can't just get rid of it with a single call in a tight
loop but it should improve things.  The gain from what I have seen
ends up being minimal though.  I haven't really noticed all that much
in my tests anyway.

I have been doing some testing and the penalty for an unaligned
checksum can get pretty big if the data-set is big enough.  I was
messing around and tried doing a checksum over 32K minus some offset
and was seeing a penalty of about 200 cycles per 64K frame.

One thought I had is that we may want to look into making an inline
function that we can call for compile-time defined lengths less than
64.  Maybe call it something like __csum_partial and we could then use
that in place of csum_partial for all those headers that are a fixed
length that we pull such as UDP, VXLAN, Ethernet, and the rest.  Then
we might be able to look at taking care of alignment for csum_partial
which will improve the skb_checksum() case without impacting the
header pulling cases as much since that code would be inlined
elsewhere.

- Alex


Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers

2016-03-07 Thread Jesse Gross
On Mon, Mar 7, 2016 at 3:06 PM, Alex Duyck  wrote:
> On Mon, Mar 7, 2016 at 11:09 AM, David Miller  wrote:
>> From: Or Gerlitz 
>> Date: Mon, 7 Mar 2016 20:05:20 +0200
>>
>>> On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck  wrote:
 This change enforces the IP ID verification on outer headers.  As a result
 if the DF flag is not set on the outer header we will force the flow to be
 flushed in the event that the IP ID is out of sequence with the existing
 flow.
>>>
>>> Can you please state the precise requirement for aggregation w.r.t IP
>>> IDs here? and point to where/how this is enforced, e.g for
>>> non-tunneled TCP GRO-ing?
>>
>> I also didn't see a nice "PATCH 0/4" posting explaining this series and
>> I'd really like to see that.
>
> Sorry about that.  I forgot to add the cover page when I sent this.
>
> The enforcement is coming from the IP and TCP layers.  If you take a
> look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value
> being populated based on the difference between the expected ID and
> the received one.  So for IPv4 we overwrite it, and for IPv6 we set it
> to 0.  The only consumer currently using it is TCP in tcp_gro_receive.
> The problem is with tunnels we lose the data for the outer when the
> inner overwrites it, as a result we can put whatever we want currently
> in the outer IP ID and it will be accepted.
>
> The patch set is based off of a conversation several of us had on the
> list about doing TSO for tunnels and the fact that the IP IDs for the
> outer header have to advance.  It makes it easier for me to validate
> that I am doing things properly if GRO doesn't destroy the IP ID data
> for the outer headers.

In net/ipv4/af_inet.c:inet_gro_receive() there is the following
comment above where NAPI_GRO_CB(p)->flush_id is set:

/* Save the IP ID check to be included later when we get to
* the transport layer so only the inner most IP ID is checked.
* This is because some GSO/TSO implementations do not
* correctly increment the IP ID for the outer hdrs.
*/

There was a long discussion about this a couple years ago and the
conclusion was that it is the inner IP ID is really the important one
in the case of encapsulation. Obviously, things like TCP/IP header
compression don't apply to the outer encapsulation header.


Re: 4.1.12 kernel crash in rtnetlink_put_metrics

2016-03-07 Thread Daniel Borkmann

On 03/07/2016 11:15 PM, subas...@codeaurora.org wrote:

On , Daniel Borkmann wrote:

Hi Andrew,

thanks for the report!

( Making the trace a bit more readable ... )

[41358.475254]BUG:unable to handle kernel NULL pointer dereference at (null)
[41358.475333]IP:[]rtnetlink_put_metrics+0x50/0x180
[...]
CallTrace:
[41358.476522][]?__nla_reserve+0x23/0xe0
[41358.476557][]?__nla_put+0x9/0xb0
[41358.476595][]?fib_dump_info+0x15e/0x3e0
[41358.476636][]?irq_entries_start+0x639/0x678
[41358.476671][]?fib_table_dump+0xf3/0x180
[41358.476708][]?inet_dump_fib+0x7d/0x100
[41358.476746][]?netlink_dump+0x121/0x270
[41358.476781][]?skb_free_datagram+0x12/0x40
[41358.476818][]?netlink_recvmsg+0x244/0x360
[41358.476855][]?sock_recvmsg+0x1d/0x30
[41358.476890][]?sock_recvmsg_nosec+0x30/0x30
[41358.476924][]?___sys_recvmsg+0x9c/0x120
[41358.476958][]?sock_recvmsg_nosec+0x30/0x30
[41358.476994][]?update_cfs_rq_blocked_load+0xc4/0x130
[41358.477030][]?hrtimer_forward+0xa4/0x1c0
[41358.477065][]?sockfd_lookup_light+0x1d/0x80
[41358.477099][]?__sys_recvmsg+0x3e/0x80
[41358.477134][]?SyS_socketcall+0xb1/0x2a0
[41358.477168][]?handle_irq_event+0x3c/0x60
[41358.477203][]?handle_edge_irq+0x7d/0x100
[41358.477238][]?rps_trigger_softirq+0x26/0x30
[41358.477273][]?flush_smp_call_function_queue+0x83/0x120
[41358.477307][]?syscall_call+0x7/0x7
[...]

Strange that rtnetlink_put_metrics() itself is not part of the above
call trace (it's an exported symbol).

So, your analysis suggests that metrics itself is NULL in this case?
(Can you confirm that?)

How frequently does this trigger? Are the seen call traces all the same kind?

Is there an easy way to reproduce this?

I presume you don't use any per route congestion control settings, right?

Thanks,
Daniel


Hi Daniel

I am observing a similar crash as well. This is on a 3.10 based ARM64 kernel.
Unfortunately, the crash is occurring in a regression test rack, so I am not
sure of the exact test case to reproduce this crash. This seems to have
occurred twice so far with both cases having metrics as NULL.

 |  rt_=_0xFFC012DA4300 -> (
 |dst = (
 |  callback_head = (next = 0x0, func = 0xFF800262D040),
 |  child = 0xFFC03B8BC2B0,
 |  dev = 0xFFC012DA4318,
 |  ops = 0xFFC012DA4318,
 |  _metrics = 0,
 |  expires = 0,
 |  path = 0x0,
 |  from = 0x0,
 |  xfrm = 0x0,
 |  input = 0xFFC0AD498000,
 |  output = 0x00010401C411,
 |  flags = 0,
 |  pending_confirm = 0,
 |  error = 0,
 |  obsolete = 0,
 |  header_len = 3,
 |  trailer_len = 0,
 |  __pad2 = 4096,

168539.549000:   <6> Process ip (pid: 28473, stack limit = 0xffc04b584060)
168539.549006:   <2> Call trace:
168539.549016:   <2> [] rtnetlink_put_metrics+0x4c/0xec
168539.549027:   <2> [] rt6_fill_node.isra.34+0x2b8/0x3c8
168539.549035:   <2> [] rt6_dump_route+0x68/0x7c
168539.549043:   <2> [] fib6_dump_node+0x2c/0x74
168539.549051:   <2> [] fib6_walk_continue+0xf8/0x1b4
168539.549059:   <2> [] fib6_walk+0x5c/0xb8
168539.549067:   <2> [] inet6_dump_fib+0x104/0x234
168539.549076:   <2> [] netlink_dump+0x7c/0x1cc
168539.549084:   <2> [] __netlink_dump_start+0x128/0x170
168539.549093:   <2> [] rtnetlink_rcv_msg+0x12c/0x1a0
168539.549101:   <2> [] netlink_rcv_skb+0x64/0xc8
168539.549110:   <2> [] rtnetlink_rcv+0x1c/0x2c
168539.549117:   <2> [] netlink_unicast+0x108/0x1b8
168539.549125:   <2> [] netlink_sendmsg+0x27c/0x2d4
168539.549134:   <2> [] sock_sendmsg+0x8c/0xb0
168539.549143:   <2> [] SyS_sendto+0xcc/0x110

I am using the following patch as a workaround now. I do not have any
per route congestion control settings enabled.
Any pointers to debug this would be greatly appreciated.


Hmm, if it was 4.1.X like in original reporter case, I might have thought
something like commit 0a1f59620068 ("ipv6: Initialize rt6_info properly
in ip6_blackhole_route()") ... any chance on reproducing this on a latest
kernel?


diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a67310e..c63098e 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -566,7 +566,7 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 *metrics)
 int i, valid = 0;

 mx = nla_nest_start(skb, RTA_METRICS);
-   if (mx == NULL)
+   if (mx == NULL || metrics == NULL)
 return -ENOBUFS;

 for (i = 0; i < RTAX_MAX; i++) {







Re: [PATCH net] sctp: fix copying more bytes than expected in sctp_add_bind_addr

2016-03-07 Thread Marcelo Ricardo Leitner

Hi,

Em 07-03-2016 20:17, kbuild test robot escreveu:

Hi Marcelo,

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Marcelo-Ricardo-Leitner/sctp-fix-copying-more-bytes-than-expected-in-sctp_add_bind_addr/20160308-052009


coccinelle warnings: (new ones prefixed by >>)


net/sctp/bind_addr.c:458:42-48: ERROR: application of sizeof to pointer


Please review and possibly fold the followup patch.


Oops, nice catch, thanks.

I can fold it if Dave prefers, no problem. I'll wait for a confirmation.

  Marcelo


[PATCH net-next] net: dsa: mv88e6xxx: avoid writing the same mode

2016-03-07 Thread Vivien Didelot
There is no need to change the 802.1Q port mode for the same value.
Thus avoid such message:

[  401.954836] dsa dsa@0 lan0: 802.1Q Mode: Disabled (was Disabled)

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 1aee42d..5f07524 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1765,16 +1765,21 @@ int mv88e6xxx_port_vlan_filtering(struct dsa_switch 
*ds, int port,
 
old = ret & PORT_CONTROL_2_8021Q_MASK;
 
-   ret &= ~PORT_CONTROL_2_8021Q_MASK;
-   ret |= new & PORT_CONTROL_2_8021Q_MASK;
+   if (new != old) {
+   ret &= ~PORT_CONTROL_2_8021Q_MASK;
+   ret |= new & PORT_CONTROL_2_8021Q_MASK;
 
-   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL_2, ret);
-   if (ret < 0)
-   goto unlock;
+   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL_2,
+  ret);
+   if (ret < 0)
+   goto unlock;
+
+   netdev_dbg(ds->ports[port], "802.1Q Mode %s (was %s)\n",
+  mv88e6xxx_port_8021q_mode_names[new],
+  mv88e6xxx_port_8021q_mode_names[old]);
+   }
 
-   netdev_dbg(ds->ports[port], "802.1Q Mode: %s (was %s)\n",
-  mv88e6xxx_port_8021q_mode_names[new],
-  mv88e6xxx_port_8021q_mode_names[old]);
+   ret = 0;
 unlock:
mutex_unlock(>smi_mutex);
 
-- 
2.7.2



[PATCH net-next] net: dsa: mv88e6xxx: rework port state setter

2016-03-07 Thread Vivien Didelot
Apply a few non-functional changes on the port state setter:

  * add a dynamic debug message with state names to track changes
  * explicit states checking instead of assuming their numeric values
  * lock mutex only once when changing several port states
  * use bitmap macros to declare and access port_state_update_mask

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 54 +++--
 drivers/net/dsa/mv88e6xxx.h |  2 +-
 2 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index d11c9d5..3a58a8a 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1051,39 +1051,49 @@ static int _mv88e6xxx_atu_remove(struct dsa_switch *ds, 
u16 fid, int port,
return _mv88e6xxx_atu_move(ds, fid, port, 0x0f, static_too);
 }
 
-static int mv88e6xxx_set_port_state(struct dsa_switch *ds, int port, u8 state)
+static const char * const mv88e6xxx_port_state_names[] = {
+   [PORT_CONTROL_STATE_DISABLED] = "Disabled",
+   [PORT_CONTROL_STATE_BLOCKING] = "Blocking/Listening",
+   [PORT_CONTROL_STATE_LEARNING] = "Learning",
+   [PORT_CONTROL_STATE_FORWARDING] = "Forwarding",
+};
+
+static int _mv88e6xxx_port_state(struct dsa_switch *ds, int port, u8 state)
 {
-   struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int reg, ret = 0;
u8 oldstate;
 
-   mutex_lock(>smi_mutex);
-
reg = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_CONTROL);
-   if (reg < 0) {
-   ret = reg;
-   goto abort;
-   }
+   if (reg < 0)
+   return reg;
 
oldstate = reg & PORT_CONTROL_STATE_MASK;
+
if (oldstate != state) {
/* Flush forwarding database if we're moving a port
 * from Learning or Forwarding state to Disabled or
 * Blocking or Listening state.
 */
-   if (oldstate >= PORT_CONTROL_STATE_LEARNING &&
-   state <= PORT_CONTROL_STATE_BLOCKING) {
+   if ((oldstate == PORT_CONTROL_STATE_LEARNING ||
+oldstate == PORT_CONTROL_STATE_FORWARDING)
+   && (state == PORT_CONTROL_STATE_DISABLED ||
+   state == PORT_CONTROL_STATE_BLOCKING)) {
ret = _mv88e6xxx_atu_remove(ds, 0, port, false);
if (ret)
-   goto abort;
+   return ret;
}
+
reg = (reg & ~PORT_CONTROL_STATE_MASK) | state;
ret = _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_CONTROL,
   reg);
+   if (ret)
+   return ret;
+
+   netdev_dbg(ds->ports[port], "PortState %s (was %s)\n",
+  mv88e6xxx_port_state_names[state],
+  mv88e6xxx_port_state_names[oldstate]);
}
 
-abort:
-   mutex_unlock(>smi_mutex);
return ret;
 }
 
@@ -1146,13 +1156,11 @@ int mv88e6xxx_port_stp_update(struct dsa_switch *ds, 
int port, u8 state)
break;
}
 
-   netdev_dbg(ds->ports[port], "port state %d [%d]\n", state, stp_state);
-
/* mv88e6xxx_port_stp_update may be called with softirqs disabled,
 * so we can not update the port state directly but need to schedule it.
 */
ps->ports[port].state = stp_state;
-   set_bit(port, >port_state_update_mask);
+   set_bit(port, ps->port_state_update_mask);
schedule_work(>bridge_work);
 
return 0;
@@ -2228,11 +2236,15 @@ static void mv88e6xxx_bridge_work(struct work_struct 
*work)
ps = container_of(work, struct mv88e6xxx_priv_state, bridge_work);
ds = ((struct dsa_switch *)ps) - 1;
 
-   while (ps->port_state_update_mask) {
-   port = __ffs(ps->port_state_update_mask);
-   clear_bit(port, >port_state_update_mask);
-   mv88e6xxx_set_port_state(ds, port, ps->ports[port].state);
-   }
+   mutex_lock(>smi_mutex);
+
+   for (port = 0; port < ps->num_ports; ++port)
+   if (test_and_clear_bit(port, ps->port_state_update_mask) &&
+   _mv88e6xxx_port_state(ds, port, ps->ports[port].state))
+   netdev_warn(ds->ports[port], "failed to update state to 
%s\n",
+   
mv88e6xxx_port_state_names[ps->ports[port].state]);
+
+   mutex_unlock(>smi_mutex);
 }
 
 static int mv88e6xxx_setup_port(struct dsa_switch *ds, int port)
diff --git a/drivers/net/dsa/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx.h
index d7b088d..3425616 100644
--- a/drivers/net/dsa/mv88e6xxx.h
+++ b/drivers/net/dsa/mv88e6xxx.h
@@ -426,7 +426,7 @@ struct mv88e6xxx_priv_state {
 
struct mv88e6xxx_priv_port  ports[DSA_MAX_PORTS];
 
-   unsigned long 

[PATCH net-next] net: dsa: mv88e6xxx: read then write PVID

2016-03-07 Thread Vivien Didelot
The port register 0x07 contains more options than just the default VID,
even though they are not used yet. So prefer a read then write operation
over a direct write.

This also allows to keep track of the change through dynamic debug.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx.c | 30 ++
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 3a58a8a..1aee42d 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -1166,23 +1166,45 @@ int mv88e6xxx_port_stp_update(struct dsa_switch *ds, 
int port, u8 state)
return 0;
 }
 
-static int _mv88e6xxx_port_pvid_get(struct dsa_switch *ds, int port, u16 *pvid)
+static int _mv88e6xxx_port_pvid(struct dsa_switch *ds, int port, u16 *new,
+   u16 *old)
 {
+   u16 pvid;
int ret;
 
ret = _mv88e6xxx_reg_read(ds, REG_PORT(port), PORT_DEFAULT_VLAN);
if (ret < 0)
return ret;
 
-   *pvid = ret & PORT_DEFAULT_VLAN_MASK;
+   pvid = ret & PORT_DEFAULT_VLAN_MASK;
+
+   if (new) {
+   ret &= ~PORT_DEFAULT_VLAN_MASK;
+   ret |= *new & PORT_DEFAULT_VLAN_MASK;
+
+   ret = _mv88e6xxx_reg_write(ds, REG_PORT(port),
+  PORT_DEFAULT_VLAN, ret);
+   if (ret < 0)
+   return ret;
+
+   netdev_dbg(ds->ports[port], "DefaultVID %d (was %d)\n", *new,
+  pvid);
+   }
+
+   if (old)
+   *old = pvid;
 
return 0;
 }
 
+static int _mv88e6xxx_port_pvid_get(struct dsa_switch *ds, int port, u16 *pvid)
+{
+   return _mv88e6xxx_port_pvid(ds, port, NULL, pvid);
+}
+
 static int _mv88e6xxx_port_pvid_set(struct dsa_switch *ds, int port, u16 pvid)
 {
-   return _mv88e6xxx_reg_write(ds, REG_PORT(port), PORT_DEFAULT_VLAN,
-  pvid & PORT_DEFAULT_VLAN_MASK);
+   return _mv88e6xxx_port_pvid(ds, port, , NULL);
 }
 
 static int _mv88e6xxx_vtu_wait(struct dsa_switch *ds)
-- 
2.7.2



Re: [PATCH v3 0/8] arm64: rockchip: Initial GeekBox enablement

2016-03-07 Thread Dinh Nguyen
On Mon, Mar 7, 2016 at 11:15 AM, Andreas Färber  wrote:
> Am 07.03.2016 um 16:52 schrieb Giuseppe CAVALLARO:
>> On 3/7/2016 4:46 PM, Andreas Färber wrote:
>>> Am 07.03.2016 um 16:09 schrieb Giuseppe CAVALLARO:
 On 3/7/2016 3:27 PM, Andreas Färber wrote:
> Indeed, reverting Gabriel's commit fixes the observed error messages
>>> [...]
> However, I am unable to ping any hosts on the network now.

 hmm, this could be another problem. I wonder if you can
 check which recent patch is introducing the problem on ARM64.
 For example if this depends on Oct_2015 update.
>>>
>>> I've had success reverting drivers/net/ethernet/stmicro/ up to and
>>> including "stmmac: first frame prep at the end of xmit routine", i.e.
>>> top 7 commits.
>>
>> Andreas, I will check it and let you know asap.
>
> I verified that it's just these two commits that I need to revert:
>
> "stmmac: Fix 'eth0: No PHY found' regression"
> "stmmac: first frame prep at the end of xmit routine"
>
> Those in between don't cause conflicts and seem to work okay.
>

I'm seeing the same issue on the SoCFPGA platform:

libphy: PHY stmmac-0: not found
eth0: Could not attach to PHY
stmmac_open: Cannot attach to PHY (error: -19)

If I just revert:

 "stmmac: Fix 'eth0: No PHY found' regression"

then the issue goes away.

Thanks,
Dinh


Re: [PATCH net] sctp: fix copying more bytes than expected in sctp_add_bind_addr

2016-03-07 Thread kbuild test robot
Hi Marcelo,

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Marcelo-Ricardo-Leitner/sctp-fix-copying-more-bytes-than-expected-in-sctp_add_bind_addr/20160308-052009


coccinelle warnings: (new ones prefixed by >>)

>> net/sctp/bind_addr.c:458:42-48: ERROR: application of sizeof to pointer

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH] sctp: fix noderef.cocci warnings

2016-03-07 Thread kbuild test robot
net/sctp/bind_addr.c:458:42-48: ERROR: application of sizeof to pointer

 sizeof when applied to a pointer typed expression gives the size of
 the pointer

Generated by: scripts/coccinelle/misc/noderef.cocci

CC: Marcelo Ricardo Leitner 
Signed-off-by: Fengguang Wu 
---

 bind_addr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -455,7 +455,7 @@ static int sctp_copy_one_addr(struct net
(((AF_INET6 == addr->sa.sa_family) &&
  (flags & SCTP_ADDR6_ALLOWED) &&
  (flags & SCTP_ADDR6_PEERSUPP
-   error = sctp_add_bind_addr(dest, addr, sizeof(addr),
+   error = sctp_add_bind_addr(dest, addr, sizeof(*addr),
   SCTP_ADDR_SRC, gfp);
}
 


Re: Question on switchdev

2016-03-07 Thread Murali Karicheri
On 03/03/2016 05:32 PM, Murali Karicheri wrote:
> On 02/29/2016 05:29 PM, Andrew Lunn wrote:
>> On Mon, Feb 29, 2016 at 04:43:16PM -0500, Murali Karicheri wrote:
>>
>> Hi Murali
>>
>> Please can you get your email client to wrap lines at ~ 75 characters.
> Hi Andrew,
> 
> Thanks for responding. I have tried the instruction below and
> it doesn't seem to work for me. Do you know what will I have to
> set in thunderbird to do this?
> 
> http://arapulido.com/2009/12/01/enabling-line-wrapping-in-thunderbird/
> 
>>> TI Keystone netcp h/w has a switch. It has n slave ports and 1 host
>>> port. Currently the netcp driver disables the switch functionality
>>> which makes them appear as n nic ports. However we have requirement
>>> to add switch support in the driver. I have reviewed the
>>> experimental driver documentation
>>> Documentation/networking/switchdev.txt and would like to understand
>>> it better so that I can add this support to keystone netcp driver.
>>  
>>> NetCP h/w has a 1 (host port) x n (slave port) switch. It can do
>>> layer 2 forwarding between ports. In the switch mode, host driver
>>> provides the frame to the switch and switch uses the filter data
>>> base (AKA ALE table, Address Learning Engine table) to forward the
>>> packet. There is a piece of information available per frame (meta
>>> data) to decide if frame to be forwarded to a particular port or use
>>> the fdb for forward decisions.
>>
>> This makes is sound like a good fit for DSA.
>>
>> Documentation/networking/dsa/dsa.txt.
> 
> Let me check and get back to you on this and below after reading
> the above.
> 
> Murali
> 
>>
>> You probably need to implement a new tagging protocol in
>> net/dsa/tag_*.c and a driver in drivers/net/dsa/
>>
>>> 1. How does port netdev differ from regular netdev that carries data
>>>when registering netdev? Any example you can point to?
>>
>> They don't differ at all. You consider each port of the switch to be a
>> normal Linux interface.
>>
>>> 2. I assume port netdev will appear as an interface in ifconfig -a
>>>command and it is not assigned an IP address. Correct?
>>
>> The user can assign an address, if they want. It is a normal Linux
>> interface. They can also create a bridge, and add the interface to the
>> bridge. An advanced DSA driver will keep track of which interfaces are
>> in which bridge, and if possible, offload the bridge to the hardware.
>>
>>> 3. with 1xn switch, so we have n + 1 netdev registered with net
>>>core? I assume, only 1 netdev is for data plane and the rest are
>>>control plane. Is this correct?
>>
>> No. You only have netdev devices for the external ports of the
>> switch. The other port is known as the cpu port, and does not have a
>> netdev.
>>
>>> 4. We have bunch of port specific configuration that we would like
>>> to control or configure from use space using standard tools. For
>>> example, switch port state, flow control etc. Is that possible to
>>> add using this framework? ethtool update needed for this?
>>
>> The whole idea here is that the switch ports are normal Linux
>> interface. You use normal linux APIs to configure them. You probably
>> don't need to add any new features.
>>
>> One key things to get your head around. The switch is a hardware
>> accelerator for the Linux stack. You have to think how you can make
>> your switch accelerate the Linux stack. It takes people a while to get
>> this.
>>
>>   Andrew
>>
> 
> 
Andrew,

>From the high level, it looks like netcp switch meets the specifications
of a DSA hardware. Can you point me to a specific implementation that
I can use as an example to implement this for netcp?

-- 
Murali Karicheri
Linux Kernel, Keystone


Re: [RFC/RFT] mac80211: implement fq_codel for software queuing

2016-03-07 Thread Dave Taht
Dear Michal:

Going through this patchset... (while watching it compile)


+   if (!local->hw.txq_cparams.target)
+   local->hw.txq_cparams.target = MS2TIME(5);

MS2TIME(20) for now and/or add something saner to this than !*backlog

target will not be a constant in the long run.

+   if (now - custom_codel_get_enqueue_time(skb) < p->target ||
+   !*backlog) {
+   /* went below - stay below for at least interval */
+   vars->first_above_time = 0;
+   return false;
+   }


*backlog < some_sane_value_for_an_aggregate_for_this_station

Unlike regular codel *backlog should be a ptr to the queuesize for
this station, not the total queue.

regular codel, by using the shared backlog for all queues, is trying
to get to a 1 packet depth for all queues, (see commit:
865ec5523dadbedefbc5710a68969f686a28d928 ), and store the flow in the
network, not the queue...

BUT in wifi's case you want to provide good service to all stations,
which is filling up an aggregate
for each... (and varying the "sane_value_for_the_aggregate" to suit
per sta service time requirements in a given round of all stations).

...

+   fq->flows_cnt = 4096;

regular fq_codel uses 1024 and there has not been much reason to
change it. In the case of an AP which has more limited memory, 256 or
1024 would be a good setting, per station. I'd stick to 1024 for now.

With large values for flows_cnt, fq, dominates, for small values, aqm
does. We did quite a lot of testing at 16 and 32 queues in the early
days, with pretty good results, except when we didn't. Cake went whole
hog with an 8 way set associative hash leading to "near perfect" fq,
which, at the cost of more cpu overhead, could cut the number of
queues down by a lot, also. Eric did "perfect" fq with sch_fq...

(btw: another way to test how codel is working is to set flows_cnt to
1. I will probably end up doing that at some point)

+   fq->perturbation = prandom_u32();
+   fq->quantum = 300;

quantum 300 is a good compromise to maximize delivery of small packets
from different flows. Probably the right thing on a station.

It also has cpu overhead. Quantum=1514 is more of the right thing on an AP.

(but to wax philosophical, per packet fairness rather than byte
fairness probably spreads errors across more flows in a wifi aggregate
than byte fairness, thus 300 remains a decent compromise if you can
spare the cpu)

...

where would be a suitable place to make (count, ecn_marks, drops)
visible in this subsystem?

...

Is this "per station" or per station, per 802.11e queue?

Dave Täht
Let's go make home routers and wifi faster! With better software!
https://www.gofundme.com/savewifi


On Fri, Feb 26, 2016 at 5:09 AM, Michal Kazior  wrote:
> Since 11n aggregation become important to get the
> best out of txops. However aggregation inherently
> requires buffering and queuing. Once variable
> medium conditions to different associated stations
> is considered it became apparent that bufferbloat
> can't be simply fought with qdiscs for wireless
> drivers. 11ac with MU-MIMO makes the problem
> worse because the bandwidth-delay product becomes
> even greater.
>
> This bases on codel5 and sch_fq_codel.c. It may
> not be the Right Thing yet but it should at least
> provide a framework for more improvements.
>
> I guess dropping rate could factor in per-station
> rate control info but I don't know how this should
> exactly be done. HW rate control drivers would
> need extra work to take advantage of this.
>
> This obviously works only with drivers that use
> wake_tx_queue op.
>
> Note: This uses IFF_NO_QUEUE to get rid of qdiscs
> for wireless drivers that use mac80211 and
> implement wake_tx_queue op.
>
> Moreover the current txq_limit and latency setting
> might need tweaking. Either from userspace or be
> dynamically scaled with regard to, e.g. number of
> associated stations.
>
> FWIW This already works nicely with ath10k's (not
> yey merged) pull-push congestion control for
> MU-MIMO as far as throughput is concerned.
>
> Evaluating latency improvements is a little tricky
> at this point if a driver is using more queue
> layering and/or its firmware controls tx
> scheduling - hence I don't have any solid data on
> this. I'm open for suggestions though.
>
> It might also be a good idea to do the following
> in the future:
>
>  - make generic tx scheduling which does some RR
>over per-sta-tid queues and dequeues bursts of
>packets to form a PPDU to fit into designated
>txop timeframe and bytelimit
>
>This could in theory be shared and used by
>ath9k and (future) mt76.
>
>Moreover tx scheduling could factor in rate
>control info and keep per-station number of
>queued packets at a sufficient low threshold to
>avoid queue buildup for slow stations. Emmanuel
>already did similar experiment for iwlwifi's
>station mode and got promising results.
>
>  - make 

Re: [net-next PATCH 3/4] vxlan: Enforce IP ID verification on outer headers

2016-03-07 Thread Alex Duyck
On Mon, Mar 7, 2016 at 11:09 AM, David Miller  wrote:
> From: Or Gerlitz 
> Date: Mon, 7 Mar 2016 20:05:20 +0200
>
>> On Mon, Mar 7, 2016 at 7:22 PM, Alexander Duyck  wrote:
>>> This change enforces the IP ID verification on outer headers.  As a result
>>> if the DF flag is not set on the outer header we will force the flow to be
>>> flushed in the event that the IP ID is out of sequence with the existing
>>> flow.
>>
>> Can you please state the precise requirement for aggregation w.r.t IP
>> IDs here? and point to where/how this is enforced, e.g for
>> non-tunneled TCP GRO-ing?
>
> I also didn't see a nice "PATCH 0/4" posting explaining this series and
> I'd really like to see that.

Sorry about that.  I forgot to add the cover page when I sent this.

The enforcement is coming from the IP and TCP layers.  If you take a
look in inet_gro_receive we have the NAPI_GRO_CB(p)->flush_id value
being populated based on the difference between the expected ID and
the received one.  So for IPv4 we overwrite it, and for IPv6 we set it
to 0.  The only consumer currently using it is TCP in tcp_gro_receive.
The problem is with tunnels we lose the data for the outer when the
inner overwrites it, as a result we can put whatever we want currently
in the outer IP ID and it will be accepted.

The patch set is based off of a conversation several of us had on the
list about doing TSO for tunnels and the fact that the IP IDs for the
outer header have to advance.  It makes it easier for me to validate
that I am doing things properly if GRO doesn't destroy the IP ID data
for the outer headers.

- Alex


Re: [PATCH 01/11] rxrpc: Add a common object cache

2016-03-07 Thread David Howells
David Miller  wrote:

> I know you put a lot of time and effort into this, but I want to strongly
> recommend against a garbage collected hash table for anything whatsoever.
> 
> Especially if the given objects are in some way created/destroyed/etc. by
> operations triggerable remotely.
> 
> This can be DoS'd quite trivially, and that's why we have removed the ipv4
> routing cache which did the same.

Hmmm...  You have a point.  What would you suggest instead?  At least with the
common object cache code I have, I might be able to just change that.

Some thoughts/notes:

 (1) Connection objects must have a time delay before expiry after last use.

 A connection object represents a negotiated security context (involving
 sending CHALLENGE and RESPONSE packets) and stores a certain amount of
 crypto state set up that can be reused (potentially for up to 4 billion
 calls).

 The set up cost of a connection is therefore typically non-trivial (you
 can have a connection without any security, but this can only do
 anonymous operations since the negotiated security represents
 authentication as well as data encryption).

 Once I kill off an incoming connection object, I have to set up the
 connection object anew for the next call on the same connection.  Now,
 granted, it's always possible that there will be a new incoming call the
 moment I kill off a connection - but this is much more likely if the
 connection is killed off immediately.

 Similarly, outgoing connections are meant to be reusable, given the same
 parameters - but if, say, a client program is making a series of calls
 and I kill the connection off immediately a call is dead, then I have to
 set up a new connection for each call the client makes.

 The way AF_RXRPC currently works, userspace clients don't interact
 directly with connection and peer objects - only calls.  I'd rather not
 have to expose the management of those to userspace.

 (2) A connection also retains the final state of the call recently terminated
 on that connection in each call slot (channel) until that slot is reused.
 This allows re-sending of final ACK and ABORT packets.

 If I immediately kill off a connection, I can't do this.

 (3) A local endpoint object is a purely local affair, maximum count 1 per
 open AF_RXRPC socket.  These can be destroyed the moment all pinning
 sockets and connections are gone - but these aren't really a problem.

 (4) A peer object can be disposed of when all the connections using it are
 gone - at the cost of losing the determined MTU data.  That's probably
 fine, provided connections have delay before expiry.

 (5) Call objects can be disposed of immediately that they terminate and have
 communicated their last with userspace (I have to tell userspace that the
 identifier it gave us is released).  A call's last state is transferred
 to the parent connection object until a new call displaces it from the
 channel it was using.

 (6) Call objects have to persist for a while since a call involves the
 exchange of at least three packets (a minimum call is a request DATA
 packet with just an ID, a response DATA packet with no payload and then
 an ACK packet) and some communication with userspace.

 An attacker can just send us a whole bunch of request DATA packets, each
 with a different call/connection combination and attempt to run the
 server out of memory, no matter how the persistence is managed.

 (7) Why can't I have simple counters representing the maxmimum numbers of
 peer, connection and call objects in existence at any one time and return
 a BUSY packet to a remote client or EAGAIN to a local client if the
 counters are maxed out?

 I could probably also drive gc based on counter levels as well as expiry
 time.

 (8) Should I take it that I can't use RCU either as that also has a deferred
 garbage collection mechanism and so subject to being stuffed remotely?

 I really want to get spinlocks out of the incoming packet distribution
 path as that's driven from the data_ready handler of the transport
 socket.

David


[PATCH 2/2] sh_eth: advance 'rxdesc' later in sh_eth_ring_format()

2016-03-07 Thread Sergei Shtylyov
Iff dma_map_single() fails, 'rxdesc'  should point  to the last filled RX
descriptor, so  that it can be marked as the last one, however the driver
would have  already  advanced it by that time. In order to fix that, only
fill  an RX descriptor  once all the data for it is ready.

Signed-off-by: Sergei Shtylyov 

---
 drivers/net/ethernet/renesas/sh_eth.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: net/drivers/net/ethernet/renesas/sh_eth.c
===
--- net.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net/drivers/net/ethernet/renesas/sh_eth.c
@@ -1136,11 +1136,8 @@ static void sh_eth_ring_format(struct ne
break;
sh_eth_set_receive_align(skb);
 
-   /* RX descriptor */
-   rxdesc = >rx_ring[i];
/* The size of the buffer is a multiple of 32 bytes. */
buf_len = ALIGN(mdp->rx_buf_sz, 32);
-   rxdesc->len = cpu_to_le32(buf_len << 16);
dma_addr = dma_map_single(>dev, skb->data, buf_len,
  DMA_FROM_DEVICE);
if (dma_mapping_error(>dev, dma_addr)) {
@@ -1148,6 +1145,10 @@ static void sh_eth_ring_format(struct ne
break;
}
mdp->rx_skbuff[i] = skb;
+
+   /* RX descriptor */
+   rxdesc = >rx_ring[i];
+   rxdesc->len = cpu_to_le32(buf_len << 16);
rxdesc->addr = cpu_to_le32(dma_addr);
rxdesc->status = cpu_to_le32(RD_RACT | RD_RFP);
 



[PATCH 1/2] sh_eth: fix NULL pointer dereference in sh_eth_ring_format()

2016-03-07 Thread Sergei Shtylyov
In a low memory situation, if netdev_alloc_skb() fails on a first RX ring
loop iteration  in sh_eth_ring_format(), 'rxdesc' is still NULL.  Avoid
kernel oops by adding the 'rxdesc' check after the loop.

Reported-by: Wolfram Sang 
Signed-off-by: Sergei Shtylyov 

---
 drivers/net/ethernet/renesas/sh_eth.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: net/drivers/net/ethernet/renesas/sh_eth.c
===
--- net.orig/drivers/net/ethernet/renesas/sh_eth.c
+++ net/drivers/net/ethernet/renesas/sh_eth.c
@@ -1163,7 +1163,8 @@ static void sh_eth_ring_format(struct ne
mdp->dirty_rx = (u32) (i - mdp->num_rx_ring);
 
/* Mark the last entry as wrapping the ring. */
-   rxdesc->status |= cpu_to_le32(RD_RDLE);
+   if (rxdesc)
+   rxdesc->status |= cpu_to_le32(RD_RDLE);
 
memset(mdp->tx_ring, 0, tx_ringsize);
 



[PATCH 0/2] sh_eth: fix couple of bugs in sh_eth_ring_format()

2016-03-07 Thread Sergei Shtylyov
Hello.

   Here's a set of 2 patches against DaveM's 'net.git' repo fixing two bugs
in sh_eth_.ring_format()...

[1/2] sh_eth: fix NULL pointer dereference in sh_eth_ring_format()
[2/2] sh_eth: advance 'rxdesc' later in sh_eth_ring_format()

MBR, Sergei



Re: 4.1.12 kernel crash in rtnetlink_put_metrics

2016-03-07 Thread subashab

On , Daniel Borkmann wrote:

Hi Andrew,

thanks for the report!

( Making the trace a bit more readable ... )

[41358.475254]BUG:unable to handle kernel NULL pointer dereference at 
(null)

[41358.475333]IP:[]rtnetlink_put_metrics+0x50/0x180
[...]
CallTrace:
[41358.476522][]?__nla_reserve+0x23/0xe0
[41358.476557][]?__nla_put+0x9/0xb0
[41358.476595][]?fib_dump_info+0x15e/0x3e0
[41358.476636][]?irq_entries_start+0x639/0x678
[41358.476671][]?fib_table_dump+0xf3/0x180
[41358.476708][]?inet_dump_fib+0x7d/0x100
[41358.476746][]?netlink_dump+0x121/0x270
[41358.476781][]?skb_free_datagram+0x12/0x40
[41358.476818][]?netlink_recvmsg+0x244/0x360
[41358.476855][]?sock_recvmsg+0x1d/0x30
[41358.476890][]?sock_recvmsg_nosec+0x30/0x30
[41358.476924][]?___sys_recvmsg+0x9c/0x120
[41358.476958][]?sock_recvmsg_nosec+0x30/0x30
[41358.476994][]?update_cfs_rq_blocked_load+0xc4/0x130
[41358.477030][]?hrtimer_forward+0xa4/0x1c0
[41358.477065][]?sockfd_lookup_light+0x1d/0x80
[41358.477099][]?__sys_recvmsg+0x3e/0x80
[41358.477134][]?SyS_socketcall+0xb1/0x2a0
[41358.477168][]?handle_irq_event+0x3c/0x60
[41358.477203][]?handle_edge_irq+0x7d/0x100
[41358.477238][]?rps_trigger_softirq+0x26/0x30
[41358.477273][]?flush_smp_call_function_queue+0x83/0x120
[41358.477307][]?syscall_call+0x7/0x7
[...]

Strange that rtnetlink_put_metrics() itself is not part of the above
call trace (it's an exported symbol).

So, your analysis suggests that metrics itself is NULL in this case?
(Can you confirm that?)

How frequently does this trigger? Are the seen call traces all the same 
kind?


Is there an easy way to reproduce this?

I presume you don't use any per route congestion control settings, 
right?


Thanks,
Daniel


Hi Daniel

I am observing a similar crash as well. This is on a 3.10 based ARM64 
kernel.
Unfortunately, the crash is occurring in a regression test rack, so I am 
not

sure of the exact test case to reproduce this crash. This seems to have
occurred twice so far with both cases having metrics as NULL.

|  rt_=_0xFFC012DA4300 -> (
|dst = (
|  callback_head = (next = 0x0, func = 0xFF800262D040),
|  child = 0xFFC03B8BC2B0,
|  dev = 0xFFC012DA4318,
|  ops = 0xFFC012DA4318,
|  _metrics = 0,
|  expires = 0,
|  path = 0x0,
|  from = 0x0,
|  xfrm = 0x0,
|  input = 0xFFC0AD498000,
|  output = 0x00010401C411,
|  flags = 0,
|  pending_confirm = 0,
|  error = 0,
|  obsolete = 0,
|  header_len = 3,
|  trailer_len = 0,
|  __pad2 = 4096,

168539.549000:   <6> Process ip (pid: 28473, stack limit = 
0xffc04b584060)

168539.549006:   <2> Call trace:
168539.549016:   <2> [] 
rtnetlink_put_metrics+0x4c/0xec
168539.549027:   <2> [] 
rt6_fill_node.isra.34+0x2b8/0x3c8

168539.549035:   <2> [] rt6_dump_route+0x68/0x7c
168539.549043:   <2> [] fib6_dump_node+0x2c/0x74
168539.549051:   <2> [] fib6_walk_continue+0xf8/0x1b4
168539.549059:   <2> [] fib6_walk+0x5c/0xb8
168539.549067:   <2> [] inet6_dump_fib+0x104/0x234
168539.549076:   <2> [] netlink_dump+0x7c/0x1cc
168539.549084:   <2> [] 
__netlink_dump_start+0x128/0x170

168539.549093:   <2> [] rtnetlink_rcv_msg+0x12c/0x1a0
168539.549101:   <2> [] netlink_rcv_skb+0x64/0xc8
168539.549110:   <2> [] rtnetlink_rcv+0x1c/0x2c
168539.549117:   <2> [] netlink_unicast+0x108/0x1b8
168539.549125:   <2> [] netlink_sendmsg+0x27c/0x2d4
168539.549134:   <2> [] sock_sendmsg+0x8c/0xb0
168539.549143:   <2> [] SyS_sendto+0xcc/0x110

I am using the following patch as a workaround now. I do not have any
per route congestion control settings enabled.
Any pointers to debug this would be greatly appreciated.

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a67310e..c63098e 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -566,7 +566,7 @@ int rtnetlink_put_metrics(struct sk_buff *skb, u32 
*metrics)

int i, valid = 0;

mx = nla_nest_start(skb, RTA_METRICS);
-   if (mx == NULL)
+   if (mx == NULL || metrics == NULL)
return -ENOBUFS;

for (i = 0; i < RTAX_MAX; i++) {





[PATCH v2 net-next 11/13] kcm: Add memory limit for receive message construction

2016-03-07 Thread Tom Herbert
Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.

The receive algorithm is something like:

   1) Receive the first skbuf for a message (or skbufs if multiple are
  needed to determine message length).
   2) Check the message length against the number of bytes in the TCP
  receive queue (tcp_inq()).
- If all the bytes of the message are in the queue (incluing the
  skbuf received), then proceed with message assembly (it should
  complete with the tcp_read_sock)
- Else, mark the psock with the number of bytes needed to
  complete the message.
   3) In TCP data ready function, if the psock indicates that we are
  waiting for the rest of the bytes of a messages, check the number
  of queued bytes against that.
- If there are still not enough bytes for the message, just
  return
- Else, clear the waiting bytes and proceed to receive the
  skbufs.  The message should now be received in one
  tcp_read_sock

Signed-off-by: Tom Herbert 
---
 include/net/kcm.h |  4 
 net/kcm/kcmproc.c |  6 --
 net/kcm/kcmsock.c | 44 
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 39c7abe..d892956 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -28,6 +28,7 @@ struct kcm_psock_stats {
unsigned int rx_aborts;
unsigned int rx_mem_fail;
unsigned int rx_need_more_hdr;
+   unsigned int rx_msg_too_big;
unsigned int rx_bad_hdr_len;
unsigned long long reserved;
unsigned long long unreserved;
@@ -66,6 +67,7 @@ struct kcm_rx_msg {
int full_len;
int accum_len;
int offset;
+   int early_eaten;
 };
 
 /* Socket structure for KCM client sockets */
@@ -128,6 +130,7 @@ struct kcm_psock {
struct kcm_sock *rx_kcm;
unsigned long long saved_rx_bytes;
unsigned long long saved_rx_msgs;
+   unsigned int rx_need_bytes;
 
/* Transmit */
struct kcm_sock *tx_kcm;
@@ -190,6 +193,7 @@ static inline void aggregate_psock_stats(struct 
kcm_psock_stats *stats,
SAVE_PSOCK_STATS(rx_aborts);
SAVE_PSOCK_STATS(rx_mem_fail);
SAVE_PSOCK_STATS(rx_need_more_hdr);
+   SAVE_PSOCK_STATS(rx_msg_too_big);
SAVE_PSOCK_STATS(rx_bad_hdr_len);
SAVE_PSOCK_STATS(tx_msgs);
SAVE_PSOCK_STATS(tx_bytes);
diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
index 5eb9809..7638b35 100644
--- a/net/kcm/kcmproc.c
+++ b/net/kcm/kcmproc.c
@@ -331,7 +331,7 @@ static int kcm_stats_seq_show(struct seq_file *seq, void *v)
   mux_stats.rx_ready_drops);
 
seq_printf(seq,
-  "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s 
%-10s %-10s\n",
+  "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s 
%-10s %-10s %-10s\n",
   "Psock",
   "RX-Msgs",
   "RX-Bytes",
@@ -343,10 +343,11 @@ static int kcm_stats_seq_show(struct seq_file *seq, void 
*v)
   "RX-MemFail",
   "RX-NeedMor",
   "RX-BadLen",
+  "RX-TooBig",
   "TX-Aborts");
 
seq_printf(seq,
-  "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u 
%-10u %-10u %-10u %-10u\n",
+  "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u 
%-10u %-10u %-10u %-10u %-10u\n",
   "",
   psock_stats.rx_msgs,
   psock_stats.rx_bytes,
@@ -358,6 +359,7 @@ static int kcm_stats_seq_show(struct seq_file *seq, void *v)
   psock_stats.rx_mem_fail,
   psock_stats.rx_need_more_hdr,
   psock_stats.rx_bad_hdr_len,
+  psock_stats.rx_msg_too_big,
   psock_stats.tx_aborts);
 
return 0;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 9ac2499..8bc38d3 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -375,6 +375,19 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
if (head) {
/* Message already in progress */
 
+   rxm = kcm_rx_msg(head);
+   if (unlikely(rxm->early_eaten)) {
+   /* Already some number of bytes on the receive sock
+* data saved in rx_skb_head, just indicate they
+* are consumed.
+*/
+   eaten = orig_len <= rxm->early_eaten ?
+   orig_len : rxm->early_eaten;
+   rxm->early_eaten -= eaten;
+
+  

[PATCH v2 net-next 09/13] kcm: Splice support

2016-03-07 Thread Tom Herbert
Implement kcm_splice_read. This is supported only for seqpacket.
Add kcm_seqpacket_ops and set splice read to kcm_splice_read.

Signed-off-by: Tom Herbert 
---
 net/kcm/kcmsock.c | 98 +--
 1 file changed, 96 insertions(+), 2 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index f938d7d..982ea5f 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -1256,6 +1256,76 @@ out:
return copied ? : err;
 }
 
+static ssize_t kcm_sock_splice(struct sock *sk,
+  struct pipe_inode_info *pipe,
+  struct splice_pipe_desc *spd)
+{
+   int ret;
+
+   release_sock(sk);
+   ret = splice_to_pipe(pipe, spd);
+   lock_sock(sk);
+
+   return ret;
+}
+
+static ssize_t kcm_splice_read(struct socket *sock, loff_t *ppos,
+  struct pipe_inode_info *pipe, size_t len,
+  unsigned int flags)
+{
+   struct sock *sk = sock->sk;
+   struct kcm_sock *kcm = kcm_sk(sk);
+   long timeo;
+   struct kcm_rx_msg *rxm;
+   int err = 0;
+   size_t copied;
+   struct sk_buff *skb;
+
+   /* Only support splice for SOCKSEQPACKET */
+
+   timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
+
+   lock_sock(sk);
+
+   skb = kcm_wait_data(sk, flags, timeo, );
+   if (!skb)
+   goto err_out;
+
+   /* Okay, have a message on the receive queue */
+
+   rxm = kcm_rx_msg(skb);
+
+   if (len > rxm->full_len)
+   len = rxm->full_len;
+
+   copied = skb_splice_bits(skb, sk, rxm->offset, pipe, len, flags,
+kcm_sock_splice);
+   if (copied < 0) {
+   err = copied;
+   goto err_out;
+   }
+
+   KCM_STATS_ADD(kcm->stats.rx_bytes, copied);
+
+   rxm->offset += copied;
+   rxm->full_len -= copied;
+
+   /* We have no way to return MSG_EOR. If all the bytes have been
+* read we still leave the message in the receive socket buffer.
+* A subsequent recvmsg needs to be done to return MSG_EOR and
+* finish reading the message.
+*/
+
+   release_sock(sk);
+
+   return copied;
+
+err_out:
+   release_sock(sk);
+
+   return err;
+}
+
 /* kcm sock lock held */
 static void kcm_recv_disable(struct kcm_sock *kcm)
 {
@@ -1907,7 +1977,7 @@ static int kcm_release(struct socket *sock)
return 0;
 }
 
-static const struct proto_ops kcm_ops = {
+static const struct proto_ops kcm_dgram_ops = {
.family =   PF_KCM,
.owner =THIS_MODULE,
.release =  kcm_release,
@@ -1928,6 +1998,28 @@ static const struct proto_ops kcm_ops = {
.sendpage = sock_no_sendpage,
 };
 
+static const struct proto_ops kcm_seqpacket_ops = {
+   .family =   PF_KCM,
+   .owner =THIS_MODULE,
+   .release =  kcm_release,
+   .bind = sock_no_bind,
+   .connect =  sock_no_connect,
+   .socketpair =   sock_no_socketpair,
+   .accept =   sock_no_accept,
+   .getname =  sock_no_getname,
+   .poll = datagram_poll,
+   .ioctl =kcm_ioctl,
+   .listen =   sock_no_listen,
+   .shutdown = sock_no_shutdown,
+   .setsockopt =   kcm_setsockopt,
+   .getsockopt =   kcm_getsockopt,
+   .sendmsg =  kcm_sendmsg,
+   .recvmsg =  kcm_recvmsg,
+   .mmap = sock_no_mmap,
+   .sendpage = sock_no_sendpage,
+   .splice_read =  kcm_splice_read,
+};
+
 /* Create proto operation for kcm sockets */
 static int kcm_create(struct net *net, struct socket *sock,
  int protocol, int kern)
@@ -1938,8 +2030,10 @@ static int kcm_create(struct net *net, struct socket 
*sock,
 
switch (sock->type) {
case SOCK_DGRAM:
+   sock->ops = _dgram_ops;
+   break;
case SOCK_SEQPACKET:
-   sock->ops = _ops;
+   sock->ops = _seqpacket_ops;
break;
default:
return -ESOCKTNOSUPPORT;
-- 
2.6.5



[PATCH v2 net-next 10/13] kcm: Sendpage support

2016-03-07 Thread Tom Herbert
Implement kcm_sendpage. Set in sendpage to kcm_sendpage in both
dgram and seqpacket ops.

Signed-off-by: Tom Herbert 
---
 net/kcm/kcmsock.c | 147 +-
 1 file changed, 145 insertions(+), 2 deletions(-)

diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 982ea5f..9ac2499 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -990,6 +990,149 @@ static void kcm_push(struct kcm_sock *kcm)
kcm_write_msgs(kcm);
 }
 
+static ssize_t kcm_sendpage(struct socket *sock, struct page *page,
+   int offset, size_t size, int flags)
+
+{
+   struct sock *sk = sock->sk;
+   struct kcm_sock *kcm = kcm_sk(sk);
+   struct sk_buff *skb = NULL, *head = NULL;
+   long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
+   bool eor;
+   int err = 0;
+   int i;
+
+   if (flags & MSG_SENDPAGE_NOTLAST)
+   flags |= MSG_MORE;
+
+   /* No MSG_EOR from splice, only look at MSG_MORE */
+   eor = !(flags & MSG_MORE);
+
+   lock_sock(sk);
+
+   sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+
+   err = -EPIPE;
+   if (sk->sk_err)
+   goto out_error;
+
+   if (kcm->seq_skb) {
+   /* Previously opened message */
+   head = kcm->seq_skb;
+   skb = kcm_tx_msg(head)->last_skb;
+   i = skb_shinfo(skb)->nr_frags;
+
+   if (skb_can_coalesce(skb, i, page, offset)) {
+   skb_frag_size_add(_shinfo(skb)->frags[i - 1], size);
+   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+   goto coalesced;
+   }
+
+   if (i >= MAX_SKB_FRAGS) {
+   struct sk_buff *tskb;
+
+   tskb = alloc_skb(0, sk->sk_allocation);
+   while (!tskb) {
+   kcm_push(kcm);
+   err = sk_stream_wait_memory(sk, );
+   if (err)
+   goto out_error;
+   }
+
+   if (head == skb)
+   skb_shinfo(head)->frag_list = tskb;
+   else
+   skb->next = tskb;
+
+   skb = tskb;
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   i = 0;
+   }
+   } else {
+   /* Call the sk_stream functions to manage the sndbuf mem. */
+   if (!sk_stream_memory_free(sk)) {
+   kcm_push(kcm);
+   set_bit(SOCK_NOSPACE, >sk_socket->flags);
+   err = sk_stream_wait_memory(sk, );
+   if (err)
+   goto out_error;
+   }
+
+   head = alloc_skb(0, sk->sk_allocation);
+   while (!head) {
+   kcm_push(kcm);
+   err = sk_stream_wait_memory(sk, );
+   if (err)
+   goto out_error;
+   }
+
+   skb = head;
+   i = 0;
+   }
+
+   get_page(page);
+   skb_fill_page_desc(skb, i, page, offset, size);
+   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+
+coalesced:
+   skb->len += size;
+   skb->data_len += size;
+   skb->truesize += size;
+   sk->sk_wmem_queued += size;
+   sk_mem_charge(sk, size);
+
+   if (head != skb) {
+   head->len += size;
+   head->data_len += size;
+   head->truesize += size;
+   }
+
+   if (eor) {
+   bool not_busy = skb_queue_empty(>sk_write_queue);
+
+   /* Message complete, queue it on send buffer */
+   __skb_queue_tail(>sk_write_queue, head);
+   kcm->seq_skb = NULL;
+   KCM_STATS_INCR(kcm->stats.tx_msgs);
+
+   if (flags & MSG_BATCH) {
+   kcm->tx_wait_more = true;
+   } else if (kcm->tx_wait_more || not_busy) {
+   err = kcm_write_msgs(kcm);
+   if (err < 0) {
+   /* We got a hard error in write_msgs but have
+* already queued this message. Report an error
+* in the socket, but don't affect return value
+* from sendmsg
+*/
+   pr_warn("KCM: Hard failure on 
kcm_write_msgs\n");
+   report_csk_error(>sk, -err);
+   }
+   }
+   } else {
+   /* Message not complete, save state */
+   kcm->seq_skb = head;
+   kcm_tx_msg(head)->last_skb = skb;
+   }
+
+   KCM_STATS_ADD(kcm->stats.tx_bytes, size);
+
+   

[PATCH v2 net-next 05/13] net: Walk fragments in __skb_splice_bits

2016-03-07 Thread Tom Herbert
Add walking of fragments in __skb_splice_bits.

Signed-off-by: Tom Herbert 
---
 net/core/skbuff.c | 39 ---
 1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7af7ec6..0df9f6a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1918,6 +1918,7 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct 
pipe_inode_info *pipe,
  struct splice_pipe_desc *spd, struct sock *sk)
 {
int seg;
+   struct sk_buff *iter;
 
/* map the linear part :
 * If skb->head_frag is set, this 'linear' part is backed by a
@@ -1944,6 +1945,19 @@ static bool __skb_splice_bits(struct sk_buff *skb, 
struct pipe_inode_info *pipe,
return true;
}
 
+   skb_walk_frags(skb, iter) {
+   if (*offset >= iter->len) {
+   *offset -= iter->len;
+   continue;
+   }
+   /* __skb_splice_bits() only fails if the output has no room
+* left, so no point in going over the frag_list for the error
+* case.
+*/
+   if (__skb_splice_bits(iter, pipe, offset, len, spd, sk))
+   return true;
+   }
+
return false;
 }
 
@@ -1970,9 +1984,7 @@ ssize_t skb_socket_splice(struct sock *sk,
 
 /*
  * Map data from the skb to a pipe. Should handle both the linear part,
- * the fragments, and the frag list. It does NOT handle frag lists within
- * the frag list, if such a thing exists. We'd probably need to recurse to
- * handle that cleanly.
+ * the fragments, and the frag list.
  */
 int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int tlen,
@@ -1991,29 +2003,10 @@ int skb_splice_bits(struct sk_buff *skb, struct sock 
*sk, unsigned int offset,
.ops = _pipe_buf_ops,
.spd_release = sock_spd_release,
};
-   struct sk_buff *frag_iter;
int ret = 0;
 
-   /*
-* __skb_splice_bits() only fails if the output has no room left,
-* so no point in going over the frag_list for the error case.
-*/
-   if (__skb_splice_bits(skb, pipe, , , , sk))
-   goto done;
-   else if (!tlen)
-   goto done;
+   __skb_splice_bits(skb, pipe, , , , sk);
 
-   /*
-* now see if we have a frag_list to map
-*/
-   skb_walk_frags(skb, frag_iter) {
-   if (!tlen)
-   break;
-   if (__skb_splice_bits(frag_iter, pipe, , , , 
sk))
-   break;
-   }
-
-done:
if (spd.nr_pages)
ret = splice_cb(sk, pipe, );
 
-- 
2.6.5



[PATCH v2 net-next 12/13] kcm: Add receive message timeout

2016-03-07 Thread Tom Herbert
This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.

The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.

Signed-off-by: Tom Herbert 
---
 include/net/kcm.h |  3 +++
 net/kcm/kcmproc.c |  6 --
 net/kcm/kcmsock.c | 32 
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/net/kcm.h b/include/net/kcm.h
index d892956..95c425c 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -29,6 +29,7 @@ struct kcm_psock_stats {
unsigned int rx_mem_fail;
unsigned int rx_need_more_hdr;
unsigned int rx_msg_too_big;
+   unsigned int rx_msg_timeouts;
unsigned int rx_bad_hdr_len;
unsigned long long reserved;
unsigned long long unreserved;
@@ -130,6 +131,7 @@ struct kcm_psock {
struct kcm_sock *rx_kcm;
unsigned long long saved_rx_bytes;
unsigned long long saved_rx_msgs;
+   struct timer_list rx_msg_timer;
unsigned int rx_need_bytes;
 
/* Transmit */
@@ -194,6 +196,7 @@ static inline void aggregate_psock_stats(struct 
kcm_psock_stats *stats,
SAVE_PSOCK_STATS(rx_mem_fail);
SAVE_PSOCK_STATS(rx_need_more_hdr);
SAVE_PSOCK_STATS(rx_msg_too_big);
+   SAVE_PSOCK_STATS(rx_msg_timeouts);
SAVE_PSOCK_STATS(rx_bad_hdr_len);
SAVE_PSOCK_STATS(tx_msgs);
SAVE_PSOCK_STATS(tx_bytes);
diff --git a/net/kcm/kcmproc.c b/net/kcm/kcmproc.c
index 7638b35..7380087 100644
--- a/net/kcm/kcmproc.c
+++ b/net/kcm/kcmproc.c
@@ -331,7 +331,7 @@ static int kcm_stats_seq_show(struct seq_file *seq, void *v)
   mux_stats.rx_ready_drops);
 
seq_printf(seq,
-  "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s 
%-10s %-10s %-10s\n",
+  "%-8s %-10s %-16s %-10s %-16s %-10s %-10s %-10s %-10s %-10s 
%-10s %-10s %-10s %-10s\n",
   "Psock",
   "RX-Msgs",
   "RX-Bytes",
@@ -344,10 +344,11 @@ static int kcm_stats_seq_show(struct seq_file *seq, void 
*v)
   "RX-NeedMor",
   "RX-BadLen",
   "RX-TooBig",
+  "RX-Timeout",
   "TX-Aborts");
 
seq_printf(seq,
-  "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u 
%-10u %-10u %-10u %-10u %-10u\n",
+  "%-8s %-10llu %-16llu %-10llu %-16llu %-10llu %-10llu %-10u 
%-10u %-10u %-10u %-10u %-10u %-10u\n",
   "",
   psock_stats.rx_msgs,
   psock_stats.rx_bytes,
@@ -360,6 +361,7 @@ static int kcm_stats_seq_show(struct seq_file *seq, void *v)
   psock_stats.rx_need_more_hdr,
   psock_stats.rx_bad_hdr_len,
   psock_stats.rx_msg_too_big,
+  psock_stats.rx_msg_timeouts,
   psock_stats.tx_aborts);
 
return 0;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 8bc38d3..40662d73 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -55,6 +55,8 @@ static void kcm_abort_rx_psock(struct kcm_psock *psock, int 
err,
 
/* Unrecoverable error in receive */
 
+   del_timer(>rx_msg_timer);
+
if (psock->rx_stopped)
return;
 
@@ -351,6 +353,12 @@ static void unreserve_rx_kcm(struct kcm_psock *psock,
spin_unlock_bh(>rx_lock);
 }
 
+static void kcm_start_rx_timer(struct kcm_psock *psock)
+{
+   if (psock->sk->sk_rcvtimeo)
+   mod_timer(>rx_msg_timer, psock->sk->sk_rcvtimeo);
+}
+
 /* Macro to invoke filter function. */
 #define KCM_RUN_FILTER(prog, ctx) \
(*prog->bpf_func)(ctx, prog->insnsi)
@@ -500,6 +508,10 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
 
if (!len) {
/* Need more header to determine length */
+   if (!rxm->accum_len) {
+   /* Start RX timer for new message */
+   kcm_start_rx_timer(psock);
+   }
rxm->accum_len += cand_len;
eaten += cand_len;
KCM_STATS_INCR(psock->stats.rx_need_more_hdr);
@@ -540,6 +552,11 @@ static int kcm_tcp_recv(read_descriptor_t *desc, struct 
sk_buff *orig_skb,
 * but don't consume yet per tcp_read_sock.
 */
 
+   if (!rxm->accum_len) {
+ 

[PATCH v2 net-next 08/13] kcm: Add statistics and proc interfaces

2016-03-07 Thread Tom Herbert
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.

The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.

Signed-off-by: Tom Herbert 
---
 include/net/kcm.h |  94 
 net/kcm/Makefile  |   2 +-
 net/kcm/kcmproc.c | 422 ++
 net/kcm/kcmsock.c |  80 +++
 4 files changed, 597 insertions(+), 1 deletion(-)
 create mode 100644 net/kcm/kcmproc.c

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 1bcae39..39c7abe 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -17,6 +17,42 @@
 
 extern unsigned int kcm_net_id;
 
+#define KCM_STATS_ADD(stat, count) ((stat) += (count))
+#define KCM_STATS_INCR(stat) ((stat)++)
+
+struct kcm_psock_stats {
+   unsigned long long rx_msgs;
+   unsigned long long rx_bytes;
+   unsigned long long tx_msgs;
+   unsigned long long tx_bytes;
+   unsigned int rx_aborts;
+   unsigned int rx_mem_fail;
+   unsigned int rx_need_more_hdr;
+   unsigned int rx_bad_hdr_len;
+   unsigned long long reserved;
+   unsigned long long unreserved;
+   unsigned int tx_aborts;
+};
+
+struct kcm_mux_stats {
+   unsigned long long rx_msgs;
+   unsigned long long rx_bytes;
+   unsigned long long tx_msgs;
+   unsigned long long tx_bytes;
+   unsigned int rx_ready_drops;
+   unsigned int tx_retries;
+   unsigned int psock_attach;
+   unsigned int psock_unattach_rsvd;
+   unsigned int psock_unattach;
+};
+
+struct kcm_stats {
+   unsigned long long rx_msgs;
+   unsigned long long rx_bytes;
+   unsigned long long tx_msgs;
+   unsigned long long tx_bytes;
+};
+
 struct kcm_tx_msg {
unsigned int sent;
unsigned int fragidx;
@@ -41,6 +77,8 @@ struct kcm_sock {
u32 done : 1;
struct work_struct done_work;
 
+   struct kcm_stats stats;
+
/* Transmit */
struct kcm_psock *tx_psock;
struct work_struct tx_work;
@@ -77,6 +115,8 @@ struct kcm_psock {
 
struct list_head psock_list;
 
+   struct kcm_psock_stats stats;
+
/* Receive */
struct sk_buff *rx_skb_head;
struct sk_buff **rx_skb_nextp;
@@ -86,15 +126,21 @@ struct kcm_psock {
struct delayed_work rx_delayed_work;
struct bpf_prog *bpf_prog;
struct kcm_sock *rx_kcm;
+   unsigned long long saved_rx_bytes;
+   unsigned long long saved_rx_msgs;
 
/* Transmit */
struct kcm_sock *tx_kcm;
struct list_head psock_avail_list;
+   unsigned long long saved_tx_bytes;
+   unsigned long long saved_tx_msgs;
 };
 
 /* Per net MUX list */
 struct kcm_net {
struct mutex mutex;
+   struct kcm_psock_stats aggregate_psock_stats;
+   struct kcm_mux_stats aggregate_mux_stats;
struct list_head mux_list;
int count;
 };
@@ -110,6 +156,9 @@ struct kcm_mux {
struct list_head psocks;/* List of all psocks on MUX */
int psocks_cnt; /* Total attached sockets */
 
+   struct kcm_mux_stats stats;
+   struct kcm_psock_stats aggregate_psock_stats;
+
/* Receive */
spinlock_t rx_lock cacheline_aligned_in_smp;
struct list_head kcm_rx_waiters; /* KCMs waiting for receiving */
@@ -122,4 +171,49 @@ struct kcm_mux {
struct list_head kcm_tx_waiters; /* KCMs waiting for a TX psock */
 };
 
+#ifdef CONFIG_PROC_FS
+int kcm_proc_init(void);
+void kcm_proc_exit(void);
+#else
+static int kcm_proc_init(void) { return 0; }
+static void kcm_proc_exit(void) { }
+#endif
+
+static inline void aggregate_psock_stats(struct kcm_psock_stats *stats,
+struct kcm_psock_stats *agg_stats)
+{
+   /* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_PSOCK_STATS(_stat) (agg_stats->_stat += stats->_stat)
+   SAVE_PSOCK_STATS(rx_msgs);
+   SAVE_PSOCK_STATS(rx_bytes);
+   SAVE_PSOCK_STATS(rx_aborts);
+   SAVE_PSOCK_STATS(rx_mem_fail);
+   SAVE_PSOCK_STATS(rx_need_more_hdr);
+   SAVE_PSOCK_STATS(rx_bad_hdr_len);
+   SAVE_PSOCK_STATS(tx_msgs);
+   SAVE_PSOCK_STATS(tx_bytes);
+   SAVE_PSOCK_STATS(reserved);
+   SAVE_PSOCK_STATS(unreserved);
+   SAVE_PSOCK_STATS(tx_aborts);
+#undef SAVE_PSOCK_STATS
+}
+
+static inline void aggregate_mux_stats(struct kcm_mux_stats *stats,
+  struct kcm_mux_stats *agg_stats)
+{
+   /* Save psock statistics in the mux when psock is being unattached. */
+
+#define SAVE_MUX_STATS(_stat) (agg_stats->_stat += stats->_stat)
+   SAVE_MUX_STATS(rx_msgs);
+   SAVE_MUX_STATS(rx_bytes);
+   SAVE_MUX_STATS(tx_msgs);
+ 

[PATCH v2 net-next 13/13] kcm: Add description in Documentation

2016-03-07 Thread Tom Herbert
Add kcm.txt to desribe KCM and interfaces.

Signed-off-by: Tom Herbert 
---
 Documentation/networking/kcm.txt | 285 +++
 1 file changed, 285 insertions(+)
 create mode 100644 Documentation/networking/kcm.txt

diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt
new file mode 100644
index 000..3476ede
--- /dev/null
+++ b/Documentation/networking/kcm.txt
@@ -0,0 +1,285 @@
+Kernel Connection Mulitplexor
+-
+
+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message 
based
+interface over TCP for generic application protocols. With KCM an application
+can efficiently send and receive application protocol messages over TCP using
+datagram sockets.
+
+KCM implements an NxM multiplexor in the kernel as diagrammed below:
+
+++   ++   ++   ++
+| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
+++   ++   ++   ++
+  | |   ||
+  +---+ |   | +--+
+  | |   | |
+   +--+
+   |   Multiplexor|
+   +--+
+ |   |   |   |  |
+   +-+   |   |   |  +
+   | |   |   |  |
++--+  +--+  +--+  +--+ +--+
+|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
++--+  +--+  +--+  +--+ +--+
+  |  |   || |
++--+  +--+  +--+  +--+ +--+
+| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
++--+  +--+  +--+  +--+ +--+
+
+KCM sockets
+---
+
+The KCM sockets provide the user interface to the muliplexor. All the KCM 
sockets
+bound to a multiplexor are considered to have equivalent function, and I/O
+operations in different sockets may be done in parallel without the need for
+synchronization between threads in userspace.
+
+Multiplexor
+---
+
+The multiplexor provides the message steering. In the transmit path, messages
+written on a KCM socket are sent atomically on an appropriate TCP socket.
+Similarly, in the receive path, messages are constructed on each TCP socket
+(Psock) and complete messages are steered to a KCM socket.
+
+TCP sockets & Psocks
+
+
+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
+for each bound TCP socket, this structure holds the state for constructing
+messages on receive as well as other connection specific information for KCM.
+
+Connected mode semantics
+
+
+Each multiplexor assumes that all attached TCP connections are to the same
+destination and can use the different connections for load balancing when
+transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
+can be used to send and receive messages from the KCM socket.
+
+Socket types
+
+
+KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
+
+Message delineation
+---
+
+Messages are sent over a TCP stream with some application protocol message
+format that typically includes a header which frames the messages. The length
+of a received message can be deduced from the application protocol header
+(often just a simple length field).
+
+A TCP stream must be parsed to determine message boundaries. Berkeley Packet
+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
+BPF program must be specified. The program is called at the start of receiving
+a new message and is given an skbuff that contains the bytes received so far.
+It parses the message header and returns the length of the message. Given this
+information, KCM will construct the message of the stated length and deliver it
+to a KCM socket.
+
+TCP socket management
+-
+
+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
+write space available (POLLOUT) events are handled by the multiplexor. If there
+is a state change (disconnection) or other error on a TCP socket, an error is
+posted on the TCP socket so that a POLLERR event happens and KCM discontinues
+using the socket. When the application gets the error notification for a
+TCP socket, it should unattach the socket from KCM and then handle the error
+condition (the typical response is to close the socket and create a new
+connection if necessary).
+
+KCM limits the maximum receive message size to be the size of the receive
+socket buffer on the attached TCP socket (the socket buffer size can be set by
+SO_RCVBUF). 

[PATCH v2 net-next 07/13] kcm: Kernel Connection Multiplexor module

2016-03-07 Thread Tom Herbert
This module implements the Kernel Connection Multiplexor.

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.

For more information see the included Documentation/networking/kcm.txt

Signed-off-by: Tom Herbert 
---
 include/linux/socket.h   |6 +-
 include/net/kcm.h|  125 +++
 include/uapi/linux/kcm.h |   40 +
 net/Kconfig  |1 +
 net/Makefile |1 +
 net/kcm/Kconfig  |   10 +
 net/kcm/Makefile |3 +
 net/kcm/kcmsock.c| 2016 ++
 8 files changed, 2201 insertions(+), 1 deletion(-)
 create mode 100644 include/net/kcm.h
 create mode 100644 include/uapi/linux/kcm.h
 create mode 100644 net/kcm/Kconfig
 create mode 100644 net/kcm/Makefile
 create mode 100644 net/kcm/kcmsock.c

diff --git a/include/linux/socket.h b/include/linux/socket.h
index d834af2..73bf6c6 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -200,7 +200,9 @@ struct ucred {
 #define AF_ALG 38  /* Algorithm sockets*/
 #define AF_NFC 39  /* NFC sockets  */
 #define AF_VSOCK   40  /* vSockets */
-#define AF_MAX 41  /* For now.. */
+#define AF_KCM 41  /* Kernel Connection Multiplexor*/
+
+#define AF_MAX 42  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -246,6 +248,7 @@ struct ucred {
 #define PF_ALG AF_ALG
 #define PF_NFC AF_NFC
 #define PF_VSOCK   AF_VSOCK
+#define PF_KCM AF_KCM
 #define PF_MAX AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -323,6 +326,7 @@ struct ucred {
 #define SOL_CAIF   278
 #define SOL_ALG279
 #define SOL_NFC280
+#define SOL_KCM281
 
 /* IPX options */
 #define IPX_TYPE   1
diff --git a/include/net/kcm.h b/include/net/kcm.h
new file mode 100644
index 000..1bcae39
--- /dev/null
+++ b/include/net/kcm.h
@@ -0,0 +1,125 @@
+/*
+ * Kernel Connection Multiplexor
+ *
+ * Copyright (c) 2016 Tom Herbert 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#ifndef __NET_KCM_H_
+#define __NET_KCM_H_
+
+#include 
+#include 
+#include 
+
+extern unsigned int kcm_net_id;
+
+struct kcm_tx_msg {
+   unsigned int sent;
+   unsigned int fragidx;
+   unsigned int frag_offset;
+   unsigned int msg_flags;
+   struct sk_buff *frag_skb;
+   struct sk_buff *last_skb;
+};
+
+struct kcm_rx_msg {
+   int full_len;
+   int accum_len;
+   int offset;
+};
+
+/* Socket structure for KCM client sockets */
+struct kcm_sock {
+   struct sock sk;
+   struct kcm_mux *mux;
+   struct list_head kcm_sock_list;
+   int index;
+   u32 done : 1;
+   struct work_struct done_work;
+
+   /* Transmit */
+   struct kcm_psock *tx_psock;
+   struct work_struct tx_work;
+   struct list_head wait_psock_list;
+   struct sk_buff *seq_skb;
+
+   /* Don't use bit fields here, these are set under different locks */
+   bool tx_wait;
+   bool tx_wait_more;
+
+   /* Receive */
+   struct kcm_psock *rx_psock;
+   struct list_head wait_rx_list; /* KCMs waiting for receiving */
+   bool rx_wait;
+   u32 rx_disabled : 1;
+};
+
+struct bpf_prog;
+
+/* Structure for an attached lower socket */
+struct kcm_psock {
+   struct sock *sk;
+   struct kcm_mux *mux;
+   int index;
+
+   u32 tx_stopped : 1;
+   u32 rx_stopped : 1;
+   u32 done : 1;
+   u32 unattaching : 1;
+
+   void (*save_state_change)(struct sock *sk);
+   void (*save_data_ready)(struct sock *sk);
+   void (*save_write_space)(struct sock *sk);
+
+   struct list_head psock_list;
+
+   /* Receive */
+   struct sk_buff *rx_skb_head;
+   struct sk_buff **rx_skb_nextp;
+   struct sk_buff *ready_rx_msg;
+   struct list_head psock_ready_list;
+   struct work_struct rx_work;
+   struct delayed_work rx_delayed_work;
+   struct bpf_prog *bpf_prog;
+   struct kcm_sock *rx_kcm;
+
+   /* Transmit */
+   struct kcm_sock *tx_kcm;
+   struct list_head psock_avail_list;
+};
+
+/* Per net MUX list */
+struct kcm_net {
+   struct mutex mutex;
+   struct list_head mux_list;
+   int count;
+};
+
+/* Structure for a MUX */
+struct kcm_mux {
+   struct list_head kcm_mux_list;
+   struct rcu_head rcu;
+   struct kcm_net *knet;
+
+   struct list_head kcm_socks; /* All KCM sockets on MUX */
+   int kcm_socks_cnt;

[PATCH v2 net-next 06/13] tcp: Add tcp_inq to get available receive bytes on socket

2016-03-07 Thread Tom Herbert
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.

Signed-off-by: Tom Herbert 
---
 include/net/tcp.h | 24 
 net/ipv4/tcp.c| 15 +--
 2 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e90db85..0302636 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1816,4 +1816,28 @@ static inline void skb_set_tcp_pure_ack(struct sk_buff 
*skb)
skb->truesize = 2;
 }
 
+static inline int tcp_inq(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   int answ;
+
+   if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
+   answ = 0;
+   } else if (sock_flag(sk, SOCK_URGINLINE) ||
+  !tp->urg_data ||
+  before(tp->urg_seq, tp->copied_seq) ||
+  !before(tp->urg_seq, tp->rcv_nxt)) {
+
+   answ = tp->rcv_nxt - tp->copied_seq;
+
+   /* Subtract 1, if FIN was received */
+   if (answ && sock_flag(sk, SOCK_DONE))
+   answ--;
+   } else {
+   answ = tp->urg_seq - tp->copied_seq;
+   }
+
+   return answ;
+}
+
 #endif /* _TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f9faadb..a265f00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -556,20 +556,7 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
return -EINVAL;
 
slow = lock_sock_fast(sk);
-   if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV))
-   answ = 0;
-   else if (sock_flag(sk, SOCK_URGINLINE) ||
-!tp->urg_data ||
-before(tp->urg_seq, tp->copied_seq) ||
-!before(tp->urg_seq, tp->rcv_nxt)) {
-
-   answ = tp->rcv_nxt - tp->copied_seq;
-
-   /* Subtract 1, if FIN was received */
-   if (answ && sock_flag(sk, SOCK_DONE))
-   answ--;
-   } else
-   answ = tp->urg_seq - tp->copied_seq;
+   answ = tcp_inq(sk);
unlock_sock_fast(sk, slow);
break;
case SIOCATMARK:
-- 
2.6.5



[PATCH v2 net-next 03/13] net: Allow MSG_EOR in each msghdr of sendmmsg

2016-03-07 Thread Tom Herbert
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.

Signed-off-by: Tom Herbert 
---
 net/socket.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index 38a78d4..0dd4dd8 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1875,7 +1875,8 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
 
 static int ___sys_sendmsg(struct socket *sock, struct user_msghdr __user *msg,
 struct msghdr *msg_sys, unsigned int flags,
-struct used_address *used_address)
+struct used_address *used_address,
+unsigned int allowed_msghdr_flags)
 {
struct compat_msghdr __user *msg_compat =
(struct compat_msghdr __user *)msg;
@@ -1901,6 +1902,7 @@ static int ___sys_sendmsg(struct socket *sock, struct 
user_msghdr __user *msg,
 
if (msg_sys->msg_controllen > INT_MAX)
goto out_freeiov;
+   flags |= (msg_sys->msg_flags & allowed_msghdr_flags);
ctl_len = msg_sys->msg_controllen;
if ((MSG_CMSG_COMPAT & flags) && ctl_len) {
err =
@@ -1979,7 +1981,7 @@ long __sys_sendmsg(int fd, struct user_msghdr __user 
*msg, unsigned flags)
if (!sock)
goto out;
 
-   err = ___sys_sendmsg(sock, msg, _sys, flags, NULL);
+   err = ___sys_sendmsg(sock, msg, _sys, flags, NULL, 0);
 
fput_light(sock->file, fput_needed);
 out:
@@ -2024,7 +2026,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
while (datagrams < vlen) {
if (MSG_CMSG_COMPAT & flags) {
err = ___sys_sendmsg(sock, (struct user_msghdr __user 
*)compat_entry,
-_sys, flags, _address);
+_sys, flags, _address, 
MSG_EOR);
if (err < 0)
break;
err = __put_user(err, _entry->msg_len);
@@ -2032,7 +2034,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
} else {
err = ___sys_sendmsg(sock,
 (struct user_msghdr __user *)entry,
-_sys, flags, _address);
+_sys, flags, _address, 
MSG_EOR);
if (err < 0)
break;
err = put_user(err, >msg_len);
-- 
2.6.5



[PATCH v2 net-next 04/13] net: Add MSG_BATCH flag

2016-03-07 Thread Tom Herbert
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.

MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.

Signed-off-by: Tom Herbert 
---
 include/linux/socket.h | 1 +
 net/socket.c   | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8..d834af2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -274,6 +274,7 @@ struct ucred {
 #define MSG_MORE   0x8000  /* Sender will send more */
 #define MSG_WAITFORONE 0x1 /* recvmmsg(): block until 1+ packets avail */
 #define MSG_SENDPAGE_NOTLAST 0x2 /* sendpage() internal : not the last 
page */
+#define MSG_BATCH  0x4 /* sendmmsg(): more messages coming */
 #define MSG_EOF MSG_FIN
 
 #define MSG_FASTOPEN   0x2000  /* Send data in TCP SYN */
diff --git a/net/socket.c b/net/socket.c
index 0dd4dd8..886649c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2008,6 +2008,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
struct compat_mmsghdr __user *compat_entry;
struct msghdr msg_sys;
struct used_address used_address;
+   unsigned int oflags = flags;
 
if (vlen > UIO_MAXIOV)
vlen = UIO_MAXIOV;
@@ -2022,8 +2023,12 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, 
unsigned int vlen,
entry = mmsg;
compat_entry = (struct compat_mmsghdr __user *)mmsg;
err = 0;
+   flags |= MSG_BATCH;
 
while (datagrams < vlen) {
+   if (datagrams == vlen - 1)
+   flags = oflags;
+
if (MSG_CMSG_COMPAT & flags) {
err = ___sys_sendmsg(sock, (struct user_msghdr __user 
*)compat_entry,
 _sys, flags, _address, 
MSG_EOR);
-- 
2.6.5



[PATCH v2 net-next 01/13] rcu: Add list_next_or_null_rcu

2016-03-07 Thread Tom Herbert
This is a convenience function that returns the next entry in an RCU
list or NULL if at the end of the list.

Signed-off-by: Tom Herbert 
---
 include/linux/rculist.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 14ec165..17d4f84 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -319,6 +319,27 @@ static inline void list_splice_tail_init_rcu(struct 
list_head *list,
 })
 
 /**
+ * list_next_or_null_rcu - get the first element from a list
+ * @head:  the head for the list.
+ * @ptr:the list head to take the next element from.
+ * @type:   the type of the struct this is embedded in.
+ * @member: the name of the list_head within the struct.
+ *
+ * Note that if the ptr is at the end of the list, NULL is returned.
+ *
+ * This primitive may safely run concurrently with the _rcu list-mutation
+ * primitives such as list_add_rcu() as long as it's guarded by 
rcu_read_lock().
+ */
+#define list_next_or_null_rcu(head, ptr, type, member) \
+({ \
+   struct list_head *__head = (head); \
+   struct list_head *__ptr = (ptr); \
+   struct list_head *__next = READ_ONCE(__ptr->next); \
+   likely(__next != __head) ? list_entry_rcu(__next, type, \
+ member) : NULL; \
+})
+
+/**
  * list_for_each_entry_rcu -   iterate over rcu list of given type
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
-- 
2.6.5



[PATCH v2 net-next 02/13] net: Make sock_alloc exportable

2016-03-07 Thread Tom Herbert
Export it for cases where we want to create sockets by hand.

Signed-off-by: Tom Herbert 
---
 include/linux/net.h | 1 +
 net/socket.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 0b4ac7d..49175e4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -215,6 +215,7 @@ int __sock_create(struct net *net, int family, int type, 
int proto,
 int sock_create(int family, int type, int proto, struct socket **res);
 int sock_create_kern(struct net *net, int family, int type, int proto, struct 
socket **res);
 int sock_create_lite(int family, int type, int proto, struct socket **res);
+struct socket *sock_alloc(void);
 void sock_release(struct socket *sock);
 int sock_sendmsg(struct socket *sock, struct msghdr *msg);
 int sock_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
diff --git a/net/socket.c b/net/socket.c
index c044d1e..38a78d4 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -533,7 +533,7 @@ static const struct inode_operations sockfs_inode_ops = {
  * NULL is returned.
  */
 
-static struct socket *sock_alloc(void)
+struct socket *sock_alloc(void)
 {
struct inode *inode;
struct socket *sock;
@@ -554,6 +554,7 @@ static struct socket *sock_alloc(void)
this_cpu_add(sockets_in_use, 1);
return sock;
 }
+EXPORT_SYMBOL(sock_alloc);
 
 /**
  * sock_release-   close a socket
-- 
2.6.5



  1   2   3   >