from:"Herbert Xu"

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-20 Thread Herbert Xu

On Tue, Nov 20, 2018 at 10:18:17AM -0800, Eric Dumazet wrote:
>
> > Something like this?  Is it safe to linearize here?

It looks safe to me.  It's only unsafe if your skb is shared which
from my grepping does not appear to be the case (and it cannot be
shared if you're modifying skb->csum which all the callers that
I found were doing for obvious reasons).

If it's cloned skb_linearize will unclone it.

> I guess we should dump from skb->head instead, to show all headers.
>
> Maybe dump the mac header offset as well, so that we can ignore the
> padding bytes (NET_SKB_PAD) if needed.

Yes I agree.  It doesn't hurt to dump more data.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-19 Thread Herbert Xu

On Mon, Nov 19, 2018 at 11:25:47AM -0800, Cong Wang wrote:
> 
> Hmm, it calls csum_tcpudp_magic() directly instead of
> __skb_checksum_validate_complete(), but it also sets ip_summed
> to CHECKSUM_UNNECESSARY:
> 
>  20 if ((protocol == 0 && !csum_fold(skb->csum)) ||
>  21 !csum_tcpudp_magic(iph->saddr, iph->daddr,
>  22skb->len - dataoff, protocol,
>  23skb->csum)) {
>  24 skb->ip_summed = CHECKSUM_UNNECESSARY;
>  25 break;
>  26 }
> 
> which means the rx checksum fault won't be triggered.

This is just the case where the hardware checksum is valid.

For the other case, it just ignores the hardware checksum.  Perhaps
it should call rx checksum fault if the checksum turns out to be
correct in that case but it's not really a big deal.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-11-18 Thread Herbert Xu

On Mon, Nov 19, 2018 at 11:56:35AM +0800, Herbert Xu wrote:
>
> I take that back.  Because of your shift which cancels out the
> shift in NULLS_MARKER, it would appear that this should work just
> fine with RHT_NULLS_MARRKER(0), no? IOW, it would appear that
> 
>   RHT_NULLS_MARKER(0) = RHT_NULLS_MARKER(RHT_NULLS_MARKER(0))

My emails to Neil are bouncing:

ne...@suse.com
  host smtp.glb1.softwaregrp.com [15.124.2.87]
  SMTP error from remote mail server after RCPT TO::
  550 Cannot process address

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-18 Thread Herbert Xu

On Fri, Nov 16, 2018 at 01:32:50PM -0800, Cong Wang wrote:
>
> This is true only when there is a skb_checksum_init*() or
> skb_checksum_validate*() prior to it, it seems not true for
> nf_ip_checksum() where skb->csum is correctly set to pesudo header
> checksum but there is no validation of the original skb->csum.
> So this check should be still inverted there??
> 
> Or am I still missing anything here?

What do you mean? My copy of nf_ip_checksum seems to be doing the
right thing as far as verifying CHECKSUM_COMPLETED goes.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-11-18 Thread Herbert Xu

On Mon, Nov 19, 2018 at 11:54:15AM +0800, Herbert Xu wrote:
>
> > >> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> > >> index 30526afa8343..852ffa5160f1 100644
> > >> --- a/lib/rhashtable.c
> > >> +++ b/lib/rhashtable.c
> > >> @@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const 
> > >> struct bucket_table *tbl,
> > >>  unsigned int hash)
> > >>  {
> > >>  const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
> > >> -static struct rhash_head __rcu *rhnull =
> > >> -(struct rhash_head __rcu *)NULLS_MARKER(0);
> > >> +static struct rhash_head __rcu *rhnull;
> > >
> > > I don't understand why you can't continue to do NULLS_MARKER(0) or
> > > RHT_NULLS_MARKER(0).
> > 
> > Because then the test
> > 
> > +   } while (he != RHT_NULLS_MARKER(head));
> > 
> > in __rhashtable_lookup() would always succeed, and it would loop
> > forever.
> 
> This change is only necessary because of your shifting change
> above, which AFAICS adds no real benefit.

I take that back.  Because of your shift which cancels out the
shift in NULLS_MARKER, it would appear that this should work just
fine with RHT_NULLS_MARRKER(0), no? IOW, it would appear that

RHT_NULLS_MARKER(0) = RHT_NULLS_MARKER(RHT_NULLS_MARKER(0))

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-11-18 Thread Herbert Xu

On Fri, Nov 16, 2018 at 05:59:19PM +1100, NeilBrown wrote:
>
> NULLS_MARKER assumes a hash value in which the bottom bits are most
> likely to be unique.  To convert this to a pointer which certainly not
> valid, it shifts left by 1 and sets the lsb.
> We aren't passing a hash value, but are passing an address instead.
> In this case the bottom 2 bits are certain to be 0, and the top bit
> could contain valuable information (on a 32bit system).
> The best way to turn a pointer into a certainly-invalid pointer
> is to just set the lsb.  By shifting right by one, we discard an
> uninteresting bit, preserve all the interesting bits, and effectively
> just set the lsb.
> 
> I could add a comment explaining that if you like.

The top-bit is most likely to be fixed and offer no real value.

> >> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> >> index 30526afa8343..852ffa5160f1 100644
> >> --- a/lib/rhashtable.c
> >> +++ b/lib/rhashtable.c
> >> @@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const 
> >> struct bucket_table *tbl,
> >>unsigned int hash)
> >>  {
> >>const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
> >> -  static struct rhash_head __rcu *rhnull =
> >> -  (struct rhash_head __rcu *)NULLS_MARKER(0);
> >> +  static struct rhash_head __rcu *rhnull;
> >
> > I don't understand why you can't continue to do NULLS_MARKER(0) or
> > RHT_NULLS_MARKER(0).
> 
> Because then the test
> 
> + } while (he != RHT_NULLS_MARKER(head));
> 
> in __rhashtable_lookup() would always succeed, and it would loop
> forever.

This change is only necessary because of your shifting change
above, which AFAICS adds no real benefit.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-15 Thread Herbert Xu

On Thu, Nov 15, 2018 at 08:52:23PM -0800, Eric Dumazet wrote:
>
> It is very possible NIC provides an incorrect CHECKSUM_COMPLETE, in the
> case non zero trailer bytes were added by a buggy switch (or host)

We should probably change netdev_rx_csum_fault to print out at
least one complete packet plus the hardware-generated checksum.

That would make debugging these rare hardware faults much easier.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-15 Thread Herbert Xu

On Thu, Nov 15, 2018 at 06:23:38PM -0800, Cong Wang wrote:
>
> > Normally if the hardware's partial checksum is valid then we just
> > trust it and send the packet along.  However, if the partial
> > checksum is invalid we don't trust it and we will compute the
> > whole checksum manually which is what ends up in sum.
> 
> Not sure if I understand partial checksum here, but it is the
> CHECKSUM_COMPLETE case which I am trying to fix, not
> CHECKSUM_PARTIAL.

What I meant by partial checksum is the checksum produced by the
hardware on RX.  In the kernel we call that CHECKSUM_COMPLETE.
CHECKSUM_PARTIAL is the absence of the substantial part of the
checksum which is something we use in the kernel primarily for TX.

Yes the names are confusing :)

> So, in other word, a checksum *match* is the intended to detect
> this HW RX checksum fault?

Correct.  Or more likely it's probably a bug in either the driver
or if there are overlaying code such as VLAN then in that code.

Basically if the RX checksum is buggy, it's much more likely to
cause a valid packet to be rejected than to cause an invalid packet
to be accepted, because we still verify that checksum against the
pseudoheader.  So we only attempt to catch buggy hardware/drivers
by doing a second manual verification for the case where the packet
is flagged as invalid.

> Sure, my case is nearly same with Pawel's, except I have no vlan:
> https://marc.info/?l=linux-netdev=154086647601721=2

Can you please provide your backtrace?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-15 Thread Herbert Xu

On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote:
> The following evidences indicate this check is likely wrong:
> 
> 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid 
> checksum.
> 
> 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped
>only when it returns non-zero. So non-zero indicates a failure.
> 
> 3. In __skb_checksum_validate_complete(), we have a nearly same check, where
>zero is considered as success.
> 
> 4. csum_fold() already does the one’s complement, this indicates 0 should
>be considered as a successful validation.
> 
> 5. We have triggered this fault for many times, but InCsumErrors field in
>/proc/net/snmp remains 0.
> 
> Base on the above, non-zero should be used as a checksum mismatch.
> 
> I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour.
> 
> Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly")
> Cc: Herbert Xu 
> Cc: Tom Herbert 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 
> ---
>  net/core/datagram.c | 4 ++--
>  net/core/dev.c  | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 57f3a6fcfc1e..e542a9a212a7 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff *skb, 
> int len)
>   __sum16 sum;
>  
>   sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
> - if (likely(!sum)) {
> + if (unlikely(sum)) {
>   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
>   !skb->csum_complete_sw)
>   netdev_rx_csum_fault(skb->dev);

Normally if the hardware's partial checksum is valid then we just
trust it and send the packet along.  However, if the partial
checksum is invalid we don't trust it and we will compute the
whole checksum manually which is what ends up in sum.

netdev_rx_csum_fault is meant to warn about the situation where
a packet with a valid checksum (i.e., sum == 0) was given to us
by the hardware with a partial checksum that was invalid.

So changing it to sum here is wrong.

Can you give more information as to how you got the warnings with
mlx5? It sounds like there may be a real bug there because if you
are getting the warning then it means that a packet with an invalid
hardware-computed partial checksum passed the manual check and
was actually valid.  This implies that either the hardware or the
driver is broken.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] net-xfrm: add build time cfg option to PF_KEY SHA256 to use RFC4868-compliant truncation

2018-10-17 Thread Herbert Xu

Maciej Żenczykowski  wrote:
>
> +#if IS_ENABLED(CONFIG_XFRM_HMAC_SHA256_RFC4868)
> +   .icv_truncbits = 128,
> +#else
>.icv_truncbits = 96,
> +#endif

Nack.  We don't want a build-time configuration knob for this.
This needs to be decided at run-time.

In fact you can already do this at run-time anyway through the
xfrm interface.  So please please please just ditch whatever that
you're using that's still glued to the long-obsolete (more than a
decade) af_key interface and switch it over to xfrm.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] inet: frags: rework rhashtable dismantle

2018-10-01 Thread Herbert Xu

On Mon, Oct 01, 2018 at 10:58:21AM -0700, Eric Dumazet wrote:
>
>  void inet_frags_exit_net(struct netns_frags *nf)
>  {
> + struct rhashtable_iter hti;
> + struct inet_frag_queue *fq;
> +
> + /* Since we want to cleanup the hashtable, make sure that
> +  * we wont trigger an automatic shrinking while in our
> +  * rhashtable_walk_next() loop.
> +  * Also make sure that no resize is in progress.
> +  */
>   nf->high_thresh = 0; /* prevent creation of new frags */
> + nf->rhashtable.p.automatic_shrinking = false;
> + cancel_work_sync(>rhashtable.run_work);
>  
> - rhashtable_free_and_destroy(>rhashtable, inet_frags_free_cb, NULL);
> + rhashtable_walk_enter(>rhashtable, );
> + rhashtable_walk_start();
> + while ((fq = rhashtable_walk_next()) != NULL) {
> + if (IS_ERR(fq)) /* should not happen */
> + break;
> + if (!del_timer_sync(>timer))
> + continue;
> +
> + spin_lock_bh(>lock);
> + inet_frag_kill(fq);
> + spin_unlock_bh(>lock);
> +
> + inet_frag_put(fq);
> + if (need_resched()) {
> + rhashtable_walk_stop();
> + cond_resched();
> + rhashtable_walk_start();
> + }
> + }

The walk interface was designed to handle read-only iteration
through the hash table.  While this probably works since the
actual freeing is delayed by RCU, it seems to be rather fragile.

How about using the dead flag but instead of putting it in the
rhashtable put it in netns_frags and have the timers check on that
before calling rhashtable_remove?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] rhashtable: prevent work queue schedule while dismantling

2018-10-01 Thread Herbert Xu

On Mon, Oct 01, 2018 at 06:16:27AM -0700, Eric Dumazet wrote:
> syszbot found an interesting use-after-free [1] happening
> while IPv4 fragment rhashtable was destroyed at netns dismantle.
> 
> While no insertions can possibly happen at the time a dismantling
> netns is destroying this rhashtable, timers can still fire and
> attempt to remove elements from this rhashtable.

Hmm, I think that's your real problem.  rhashtable_free_and_destroy
doesn't take any locks with respect to the normal insertion/removal
path so it definitely isn't safe to call it while you're still
invoking the normal rhashtable remove function.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH - revised] rhashtable: detect when object movement might have invalidated a lookup

2018-07-15 Thread Herbert Xu

On Mon, Jul 16, 2018 at 11:23:43AM +1000, NeilBrown wrote:
>
> kmem_cache_free() directly.  For this, I need rhashtable to be safe if
> an object is deleted and immediately re-inserted into the same hash
> chain.

This means that

rcu_read_lock();
A = rhashtable_lookup();
use(A);
rcu_read_unlock();

A can turn into object B when it is used.  That is just too strange
for words.  Can we see some actual code on how this works?

For comparison, the existing net code where this happens A doesn't
actually change and it simply moves from one hashtable to another.

I'm relucant to add semantics that would restrain on how rhashtable
works unless we have real and valid use-cases for it.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2] rhashtable: add restart routine in rhashtable_free_and_destroy()

2018-07-08 Thread Herbert Xu

On Sun, Jul 08, 2018 at 11:55:51AM +0900, Taehee Yoo wrote:
> rhashtable_free_and_destroy() cancels re-hash deferred work
> then walks and destroys elements. at this moment, some elements can be
> still in future_tbl. that elements are not destroyed.
> 
> test case:
> nft_rhash_destroy() calls rhashtable_free_and_destroy() to destroy
> all elements of sets before destroying sets and chains.
> But rhashtable_free_and_destroy() doesn't destroy elements of future_tbl.
> so that splat occurred.
> 
> test script:
>%cat test.nft
>table ip aa {
>  map map1 {
>  type ipv4_addr : verdict;
>  elements = {
>  0 : jump a0,
>  1 : jump a0,
>  2 : jump a0,
>  3 : jump a0,
>  4 : jump a0,
>  5 : jump a0,
>  6 : jump a0,
>  7 : jump a0,
>  8 : jump a0,
>  9 : jump a0,
>   }
>  }
>  chain a0 {
>  }
>}
>flush ruleset
>table ip aa {
>  map map1 {
>  type ipv4_addr : verdict;
>  elements = {
>  0 : jump a0,
>  1 : jump a0,
>  2 : jump a0,
>  3 : jump a0,
>  4 : jump a0,
>  5 : jump a0,
>  6 : jump a0,
>  7 : jump a0,
>  8 : jump a0,
>  9 : jump a0,
>  }
>  }
>  chain a0 {
>  }
>}
>flush ruleset
> 
>%while :; do nft -f test.nft; done
> 
> Splat looks like:
> [  200.795603] kernel BUG at net/netfilter/nf_tables_api.c:1363!
> [  200.806944] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> [  200.812253] CPU: 1 PID: 1582 Comm: nft Not tainted 4.17.0+ #24
> [  200.820297] Hardware name: To be filled by O.E.M. To be filled by 
> O.E.M./Aptio CRB, BIOS 5.6.5 07/08/2015
> [  200.830309] RIP: 0010:nf_tables_chain_destroy.isra.34+0x62/0x240 
> [nf_tables]
> [  200.838317] Code: 43 50 85 c0 74 26 48 8b 45 00 48 8b 4d 08 ba 54 05 00 00 
> 48 c7 c6 60 6d 29 c0 48 c7 c7 c0 65 29 c0 4c 8b 40 08 e8 58 e5 fd f8 <0f> 0b 
> 48 89 da 48 b8 00 00 00 00 00 fc ff
> [  200.860366] RSP: :880118dbf4d0 EFLAGS: 00010282
> [  200.866354] RAX: 0061 RBX: 88010cdeaf08 RCX: 
> 
> [  200.874355] RDX: 0061 RSI: 0008 RDI: 
> ed00231b7e90
> [  200.882361] RBP: 880118dbf4e8 R08: ed002373bcfb R09: 
> ed002373bcfa
> [  200.890354] R10:  R11: ed002373bcfb R12: 
> dead0200
> [  200.898356] R13: dead0100 R14: bb62af38 R15: 
> dc00
> [  200.906354] FS:  7fefc31fd700() GS:88011b80() 
> knlGS:
> [  200.915533] CS:  0010 DS:  ES:  CR0: 80050033
> [  200.922355] CR2: 557f1c8e9128 CR3: 00010688 CR4: 
> 001006e0
> [  200.930353] Call Trace:
> [  200.932351]  ? nf_tables_commit+0x26f6/0x2c60 [nf_tables]
> [  200.939525]  ? nf_tables_setelem_notify.constprop.49+0x1a0/0x1a0 
> [nf_tables]
> [  200.947525]  ? nf_tables_delchain+0x6e0/0x6e0 [nf_tables]
> [  200.952383]  ? nft_add_set_elem+0x1700/0x1700 [nf_tables]
> [  200.959532]  ? nla_parse+0xab/0x230
> [  200.963529]  ? nfnetlink_rcv_batch+0xd06/0x10d0 [nfnetlink]
> [  200.968384]  ? nfnetlink_net_init+0x130/0x130 [nfnetlink]
> [  200.975525]  ? debug_show_all_locks+0x290/0x290
> [  200.980363]  ? debug_show_all_locks+0x290/0x290
> [  200.986356]  ? sched_clock_cpu+0x132/0x170
> [  200.990352]  ? find_held_lock+0x39/0x1b0
> [  200.994355]  ? sched_clock_local+0x10d/0x130
> [  200.999531]  ? memset+0x1f/0x40
> 
> V2:
>  - free all tables requested by Herbert Xu
> 
> Signed-off-by: Taehee Yoo 

Acked-by: Herbert Xu 

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] rhashtable: add restart routine in rhashtable_free_and_destroy()

2018-07-03 Thread Herbert Xu

On Tue, Jul 03, 2018 at 10:19:09PM +0900, Taehee Yoo wrote:
>
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 0e04947..8ea27fa 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -1134,6 +1134,7 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>   mutex_lock(>mutex);
>   tbl = rht_dereference(ht->tbl, ht);
>   if (free_fn) {
> +restart:
>   for (i = 0; i < tbl->size; i++) {
>   struct rhash_head *pos, *next;
>  
> @@ -1147,9 +1148,11 @@ void rhashtable_free_and_destroy(struct rhashtable *ht,
>   rht_dereference(pos->next, ht) : NULL)
>   rhashtable_free_one(ht, pos, free_fn, arg);
>   }
> + tbl = rht_dereference(tbl->future_tbl, ht);
> + if (tbl)
> + goto restart;
>   }
> -
> - bucket_table_free(tbl);
> + bucket_table_free(rht_dereference(ht->tbl, ht));

Good catch.  But don't we need to call bucket_table_free on all
the tables too rather than just the first table?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] crypto: chtls: generic handling of data and hdr

2018-05-26 Thread Herbert Xu

On Mon, May 14, 2018 at 04:41:38PM +0530, Atul Gupta wrote:
> removed redundant check and made TLS PDU and header recv
> handling common as received from HW.
> Ensure that only tls header is read in cpl_rx_tls_cmp
> read-ahead and skb is freed when entire data is processed.
> 
> Signed-off-by: Atul Gupta <atul.gu...@chelsio.com>
> Signed-off-by: Harsh Jain <ha...@chelsio.com>

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] lib/rhashtable: reorder some inititalization sequences

2018-05-14 Thread Herbert Xu

On Mon, May 14, 2018 at 10:52:13PM -0400, David Miller wrote:
> From: Davidlohr Bueso <d...@stgolabs.net>
> Date: Mon, 14 May 2018 08:13:32 -0700
> 
> > rhashtable_init() allocates memory at the very end of the
> > call, once everything is setup; with the exception of the
> > nelems parameter. However, unless the user is doing something
> > bogus with params for which -EINVAL is returned, memory
> > allocation is the only operation that can trigger the call
> > to fail.
> > 
> > Thus move bucket_table_alloc() up such that we fail back to
> > the caller asap, instead of doing useless checks. This is
> > safe as the the table allocation isn't using the halfly
> > setup 'ht' structure and bucket_table_alloc() call chain only
> > ends up using the ht->nulls_base member in INIT_RHT_NULLS_HEAD.
> > 
> > Also move the locking initialization down to the end.
> > 
> > Signed-off-by: Davidlohr Bueso <dbu...@suse.de>
> 
> The user potentially "doing something bogus" is why the most
> expensive part of the initialization (the memory allocation)
> is done after everything else is validated.
> 
> I think it's best to keep things as-is.

I agree.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-07 Thread Herbert Xu

On Sun, May 06, 2018 at 07:50:54AM +1000, NeilBrown wrote:
>
> Do we?  How could we fix it for both rhashtable and rhltable?

Well I suggested earlier to insert the walker object into the
hash table, which would be applicable regardless of whether it
is a normal rhashtable of a rhltable.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-07 Thread Herbert Xu

On Mon, May 07, 2018 at 08:24:41AM +1000, NeilBrown wrote:
>
> This is true, but I don't see how it is relevant.
> At some point, each thread will find that the table they have just
> locked for their search key, has a NULL 'future_tbl' pointer.
> At the point, the thread can know that the key is not in any table,
> and that no other thread can add the key until the lock is released.

The updating of future_tbl is not synchronised with insert threads.
Therefore it is entirely possible that two inserters end up on
different tables as their "latest" table.  This must not be allowed
to occur.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-07 Thread Herbert Xu

On Sun, May 06, 2018 at 07:37:49AM +1000, NeilBrown wrote:
> 
> I can see no evidence that this is required for anything, as it isn't
> use and I'm fairly sure that in it's current form - it cannot be used.

Search for nulls in net/ipv4.  This is definitely used throughout
the network stack.  As the aim is to convert as many existing hash
tables to rhashtable as possible, we want to keep this.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread Herbert Xu

On Sun, May 06, 2018 at 08:00:49AM +1000, NeilBrown wrote:
>
> The insert function must (and does) take the lock on the bucket before
> testing if there is a "next" table.
> If one inserter finds that it has locked the "last" table (because there
> is no next) and successfully inserts, then the other inserter cannot
> have locked that table yet, else it would have inserted.  When it does,
> it will find what the first inserter inserted. 

If you release the lock to the first table then it may be deleted
by the resize thread.  Hence the other inserter may not have even
started from the same place.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread Herbert Xu

On Sun, May 06, 2018 at 07:48:20AM +1000, NeilBrown wrote:
>
> The spinlock protects 2 or more buckets.  The nested table contains at
> least 512 buckets, maybe more.
> It is quite possible for two insertions into 2 different buckets to both
> get their spinlock and both try to instantiate the same nested table.

I think you missed the fact that when we use nested tables the spin
lock table is limited to just a single page and hence corresponds
to the first level in the nested table.  Therefore it's always safe.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 6/8] rhashtable: further improve stability of rhashtable_walk

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If the sequence:
>obj = rhashtable_walk_next(iter);
>rhashtable_walk_stop(iter);
>rhashtable_remove_fast(ht, >head, params);
>rhashtable_walk_start(iter);
> 
>  races with another thread inserting or removing
>  an object on the same hash chain, a subsequent
>  rhashtable_walk_next() is not guaranteed to get the "next"
>  object. It is possible that an object could be
>  repeated, or missed.
> 
>  This can be made more reliable by keeping the objects in a hash chain
>  sorted by memory address.  A subsequent rhashtable_walk_next()
>  call can reliably find the correct position in the list, and thus
>  find the 'next' object.
> 
>  It is not possible (certainly not so easy) to achieve this with an
>  rhltable as keeping the hash chain in order is not so easy.  When the
>  first object with a given key is removed, it is replaced in the chain
>  with the next object with the same key, and the address of that
>  object may not be correctly ordered.
>  No current user of rhltable_walk_enter() calls
>  rhashtable_walk_start() more than once, so no current code
>  could benefit from a more reliable walk of rhltables.
> 
>  This patch only attempts to improve walks for rhashtables.
>  - a new object is always inserted after the last object with a
>smaller address, or at the start
>  - when rhashtable_walk_start() is called, it records that 'p' is not
>'safe', meaning that it cannot be dereferenced.  The revalidation
>that was previously done here is moved to rhashtable_walk_next()
>  - when rhashtable_walk_next() is called while p is not NULL and not
>safe, it walks the chain looking for the first object with an
>address greater than p and returns that.  If there is none, it moves
>to the next hash chain.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

I'm a bit torn on this.  On the hand this is definitely an improvement
over the status quo.  On the other this does not work on rhltable and
we do have a way of fixing it for both rhashtable and rhltable.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 8/8] rhashtable: don't hold lock on first table throughout insertion.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> rhashtable_try_insert() currently hold a lock on the bucket in
> the first table, while also locking buckets in subsequent tables.
> This is unnecessary and looks like a hold-over from some earlier
> version of the implementation.
> 
> As insert and remove always lock a bucket in each table in turn, and
> as insert only inserts in the final table, there cannot be any races
> that are not covered by simply locking a bucket in each table in turn.
> 
> When an insert call reaches that last table it can be sure that there
> is no match entry in any other table as it has searched them all, and
> insertion never happens anywhere but in the last table.  The fact that
> code tests for the existence of future_tbl while holding a lock on
> the relevant bucket ensures that two threads inserting the same key
> will make compatible decisions about which is the "last" table.
> 
> This simplifies the code and allows the ->rehash field to be
> discarded.
> We still need a way to ensure that a dead bucket_table is never
> re-linked by rhashtable_walk_stop().  This can be achieved by
> setting the ->size to 1.  This still allows lookup code to work (and
> correctly not find anything) but can never happen on an active bucket
> table (as the minimum size is 4).
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

I'm not convinced this is safe.  There can be multiple tables
in existence.  Unless you hold the lock on the first table, what
is to stop two parallel inserters each from inserting into their
"last" tables which may not be the same table?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/8] rhashtable: remove rhashtable_walk_peek()

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This function has a somewhat confused behavior that is not properly
> described by the documentation.
> Sometimes is returns the previous object, sometimes it returns the
> next one.
> Sometimes it changes the iterator, sometimes it doesn't.
> 
> This function is not currently used and is not worth keeping, so
> remove it.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

I don't have a problem with this.  But it would be good to hear
from Tom Herbert who added this.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/8] rhashtable: fix race in nested_table_alloc()

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> If two threads run nested_table_alloc() at the same time
> they could both allocate a new table.
> Best case is that one of them will never be freed, leaking memory.
> Worst case is hat entry get stored there before it leaks,
> and the are lost from the table.
> 
> So use cmpxchg to detect the race and free the unused table.
> 
> Fixes: da20420f83ea ("rhashtable: Add nested tables")
> Cc: sta...@vger.kernel.org # 4.11+
> Signed-off-by: NeilBrown <ne...@suse.com>

What about the spinlock that's meant to be held around this
operation?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 3/8] rhashtable: use cmpxchg() to protect ->future_tbl.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> Rather than borrowing one of the bucket locks to
> protect ->future_tbl updates, use cmpxchg().
> This gives more freedom to change how bucket locking
> is implemented.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

This looks nice.

> - spin_unlock_bh(old_tbl->locks);
> + rcu_assign_pointer(tmp, new_tbl);

Do we need this barrier since cmpxchg is supposed to provide memory
barrier semantics?

> + if (cmpxchg(_tbl->future_tbl, NULL, tmp) != NULL)
> +     return -EEXIST;

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/8] rhashtable: remove nulls_base and related code.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> This "feature" is unused, undocumented, and untested and so
> doesn't really belong.  If a use for the nulls marker
> is found, all this code would need to be reviewed to
> ensure it works as required.  It would be just as easy to
> just add the code if/when it is needed instead.
> 
> This patch actually fixes a bug too.  The table resizing allows a
> table to grow to 2^31 buckets, but the hash is truncated to 27 bits -
> any growth beyond 2^27 is wasteful an ineffective.
> 
> This patch result in NULLS_MARKER(0) being used for all chains,
> and leave the use of rht_is_a_null() to test for it.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

I disagree.  This is a fundamental requirement for the use of
rhashtable in certain networking systems such as TCP/UDP.  So
we know that there will be a use for this.

As to the bug fix, please separate it out of the patch and resubmit.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/8] rhashtable: silence RCU warning in rhashtable_test.

2018-05-05 Thread Herbert Xu

On Fri, May 04, 2018 at 01:54:14PM +1000, NeilBrown wrote:
> print_ht in rhashtable_test calls rht_dereference() with neither
> RCU protection or the mutex.  This triggers an RCU warning.
> So take the mutex to silence the warning.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

I don't think the mutex is actually needed but since we don't
have rht_dereference_raw and this is just test code:

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] net/xfrm: Fix lookups for states with spi == 0

2018-05-02 Thread Herbert Xu

On Wed, May 02, 2018 at 01:41:36PM +0100, Dmitry Safonov wrote:
>
> But still it's possible to create ipsec with zero SPI.
> And it seems not making sense to search for a state with SPI hash if
> request has zero SPI.

Fair enough.  In fact a zero SPI is legal and defined for IPcomp.

The bug arose from this patch:

commit 7b4dc3600e4877178ba94c7fbf7e520421378aa6
Author: Masahide NAKAMURA <na...@linux-ipv6.org>
Date:   Wed Sep 27 22:21:52 2006 -0700

[XFRM]: Do not add a state whose SPI is zero to the SPI hash.

SPI=0 is used for acquired IPsec SA and MIPv6 RO state.
Such state should not be added to the SPI hash
because we do not care about it on deleting path.

Signed-off-by: Masahide NAKAMURA <na...@linux-ipv6.org>
Signed-off-by: YOSHIFUJI Hideaki <yoshf...@linux-ipv6.org>

I think it would be better to revert this.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] net/xfrm: Fix lookups for states with spi == 0

2018-05-02 Thread Herbert Xu

On Wed, May 02, 2018 at 03:02:20AM +0100, Dmitry Safonov wrote:
> It seems to be a valid use case to add xfrm state without
> Security Parameter Indexes (SPI) value associated:
> ip xfrm state add src $src dst $dst proto $proto mode $mode sel src $src dst 
> $dst $algo
> 
> The bad thing is that it's currently impossible to get/delete the state
> without SPI: __xfrm_state_insert() obviously doesn't add hash for zero
> SPI in xfrm.state_byspi, and xfrm_user_state_lookup() will fail as
> xfrm_state_lookup() does lookups by hash.
> 
> It also isn't possible to workaround from userspace as
> xfrm_id_proto_match() will be always true for ah/esp/comp protos.
> 
> So, don't try looking up by hash if SPI == 0.
> 
> Cc: Steffen Klassert <steffen.klass...@secunet.com>
> Cc: Herbert Xu <herb...@gondor.apana.org.au>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Dmitry Safonov <d...@arista.com>

A zero SPI is illegal for many IPsec protocols because that value
is used for other purposes, e.g., IKE encapsulation.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-23 Thread Herbert Xu

On Thu, Apr 19, 2018 at 09:08:07AM +1000, NeilBrown wrote:
> 
> The need is essentially the same as the need for
> rhashtable_walk_peek().  The difference is that the actual behaviour can
> be documented briefly (and so understood easily) without enumerating
> multiple special cases.
> rhashtable_walk_peek() is much the same as
>  rhashtable_walk_prev() ?: rhashtable_walk_next()
> 
> The need arises when you need to "stop" a walk and then restart at the
> same object, not the next one. i.e. the last object returned before the
> "stop" wasn't acted on.
> This happens with seq_file if the buffer space for ->show() is not
> sufficient to format an object.  In the case, seq_file will stop() the
> iteration, make more space available (either by flushing or by
> reallocing) and will start again at the same location.
> If the seq_file client stored the rhashtable_iter in the seq_file
> private data, it can use rhasthable_walk_prev() to recover its position
> if that object is still in the hash table.  If it isn't still present,
> rhashtable_walk_next() can be used to get the next one.  In some cases
> it can be useful for the client to know whether it got the previous one
> or not.

I see.  I'm OK with this provided that you will also remove/convert
users of rhashtable_walk_peek.

I don't think we need two functions that do almost the same thing.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-23 Thread Herbert Xu

On Mon, Apr 23, 2018 at 11:39:17AM +1000, NeilBrown wrote:
>
> Sound fair.  Could you Ack the following?  Then I'll resend all the
> patches that have an ack.
> I've realised that the "further improve stability of rhashtable_walk"
> patch isn't actually complete, so I'll withdraw that for now.

Looks good to me.

> From e16c037398b6c057787437f3a0aaa7fd44c3bee3 Mon Sep 17 00:00:00 2001
> From: NeilBrown <ne...@suse.com>
> Date: Mon, 16 Apr 2018 11:05:39 +1000
> Subject: [PATCH] rhashtable: Revise incorrect comment on
>  r{hl,hash}table_walk_enter()
> 
> Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though
> they do take a spinlock without irq protection.
> So revise the comments to accurately state the contexts in which
> these functions can be called.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-18 Thread Herbert Xu

On Thu, Apr 19, 2018 at 08:56:28AM +1000, NeilBrown wrote:
>
> I don't want to do that - I just want the documentation to be correct
> (or at least, not be blatantly incorrect).  The function does not sleep,
> and is safe to call with spin locks held.
> Do we need to spell out when it can be called?  If so, maybe:
> 
>This function may be called from any process context, including
>non-preemptable context, but cannot be called from interrupts.

Just to make it perfectly clear, how about "cannot be called from
softirq or hardirq context"? Previously the not able to sleep part
completely ruled out any ambiguity but the new wording could confuse
people into thinking that this can be called from softirq context
where it would be unsafe if mixed with process context usage.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 6/6] rhashtable: add rhashtable_walk_prev()

2018-04-18 Thread Herbert Xu

On Wed, Apr 18, 2018 at 04:47:02PM +1000, NeilBrown wrote:
> rhashtable_walk_prev() returns the object returned by
> the previous rhashtable_walk_next(), providing it is still in the
> table (or was during this grace period).
> This works even if rhashtable_walk_stop() and rhashtable_talk_start()
> have been called since the last rhashtable_walk_next().
> 
> If there have been no calls to rhashtable_walk_next(), or if the
> object is gone from the table, then NULL is returned.
> 
> This can usefully be used in a seq_file ->start() function.
> If the pos is the same as was returned by the last ->next() call,
> then rhashtable_walk_prev() can be used to re-establish the
> current location in the table.  If it returns NULL, then
> rhashtable_walk_next() should be used.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Can you explain the need for this function and its difference
from the existing rhashtable_walk_peek?

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/6] rhashtable: remove outdated comments about grow_decision etc

2018-04-18 Thread Herbert Xu

On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
> grow_decision and shink_decision no longer exist, so remove
> the remaining references to them.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/6] rhashtable: remove incorrect comment on r{hl, hash}table_walk_enter()

2018-04-18 Thread Herbert Xu

On Wed, Apr 18, 2018 at 04:47:01PM +1000, NeilBrown wrote:
> Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, so
> remove the comments which suggest that they do.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>
> ---
>  include/linux/rhashtable.h |3 ---
>  lib/rhashtable.c   |3 ---
>  2 files changed, 6 deletions(-)
> 
> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index 87d443a5b11d..b01d88e196c2 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -1268,9 +1268,6 @@ static inline int rhashtable_walk_init(struct 
> rhashtable *ht,
>   * For a completely stable walk you should construct your own data
>   * structure outside the hash table.
>   *
> - * This function may sleep so you must not call it from interrupt
> - * context or with spin locks held.

It does a naked spin lock so even though we removed the memory
allocation you still mustn't call it from interrupt context.

Why do you need to do that anyway?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-04-05 Thread Herbert Xu

On Fri, Apr 06, 2018 at 01:11:56PM +1000, NeilBrown wrote:
>
> You don't need to handle memory allocation failures at the point where
> you insert into the table - adding to a linked list requires no new
> memory.

You do actually.  The r in rhashtable stands for resizable.  We
cannot completely rely on a background hash table resize because
the insertions can be triggered from softirq context and be without
rate-limits of any kind (e.g., incoming TCP connections).

Therefore at some point you either have to wait for the resize or
fail the insertion.  Since we cannot sleep then failing is the only
option.

> The likelihood of the error isn't really the issue - it either can
> happen or it cannot.  If it can, it requires code to handle it.

Sure, but you just have to handle it the same way you would handle
a memory allocation failure, because that's what it essentially is.

I'm sorry if that means you have to restructure your code to do that
but that's what you pay for having a resizable hash table because
at some point you just have to perform a resize.

> Ahhh... I see what you are thinking now.  You aren't suggestion a
> sharded count that is summed periodically.  Rather you are suggesting that
> we divide the hash space into N regions each with their own independent
> counter, and resize the table if any one region reaches 70% occupancy -
> on the assumption that the other regions are likely to be close.  Is
> that right?

Yes.

> It would probably be dangerous to allow automatic shrinking (when just
> one drops below 30%) in that case, but it might be OK for growing.

At the greatest granularity it would be a counter per bucket, so
clearly we need to maintain some kind of bound on the granularity
so our sample space is not too small.

Automatic shrinking shouldn't be an issue because that's the slow
kind of resizing that we punt to kthread context and it can afford
to do a careful count.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-04 Thread Herbert Xu

On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote:
>
> The patches look good. The big question is whether to add them to this
> merge window while it's still open. Opinions?

We're still hashing out the rhashtable interface so I don't think
now is the time to rush things.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-02 Thread Herbert Xu

On Tue, Apr 03, 2018 at 01:41:26PM +1000, NeilBrown wrote:
>
> Do we really need a rhashtable_walk_peek() interface?
> I imagine that a seqfile ->start function can do:
> 
>   if (*ppos == 0 && last_pos != 0) {
>   rhashtable_walk_exit();
> rhashtable_walk_enter(, );
> last_pos = 0;
>   }
>   rhashtable_walk_start();
>   if (*ppos == last_pos && iter.p)
>   return iter.p;
>   last_pos = *ppos;

We don't want users poking into the internals of iter.  If you're
suggesting we could simplify rhashtable_walk_peek to just this after
your patch then yes possibly.  You also need to ensure that not only
seqfs users continue to work but also netlink dump users which are
slightly different.

> It might be OK to have a function call instead of expecting people to
> use iter.p directly.

Yes that would definitely be the preferred option.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v3 net-next 07/12] rhashtable: add schedule points

2018-03-30 Thread Herbert Xu

On Fri, Mar 30, 2018 at 05:53:04PM -0700, Eric Dumazet wrote:
> Rehashing and destroying large hash table takes a lot of time,
> and happens in process context. It is safe to add cond_resched()
> in rhashtable_rehash_table() and rhashtable_free_and_destroy()
> 
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 net-next 07/12] rhashtable: add schedule points

2018-03-30 Thread Herbert Xu

On Fri, Mar 30, 2018 at 01:42:31PM -0700, Eric Dumazet wrote:
> Rehashing and destroying large hash table takes a lot of time,
> and happens in process context. It is safe to add cond_resched()
> in rhashtable_rehash_table() and rhashtable_free_and_destroy()
> 
> Signed-off-by: Eric Dumazet <eduma...@google.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next 4/6] inet: frags: use rhashtables for reassembly units

2018-03-30 Thread Herbert Xu

On Fri, Mar 30, 2018 at 06:30:42AM -0700, Eric Dumazet wrote:
> 
> I guess I will need to add a cond_resched() in it, right ?

I only ever run with preemption enabled :)

But yeah we should probably add some cond_rescheds to it.  While
you're at it you might want to add some to these functions too:
 - nested_table_free
 - bucket_table_alloc
 - rhashtable_rehash_table

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next 4/6] inet: frags: use rhashtables for reassembly units

2018-03-30 Thread Herbert Xu

On Thu, Mar 29, 2018 at 10:22:39PM -0700, Eric Dumazet wrote:
>
>  void inet_frags_exit_net(struct netns_frags *nf)
>  {
> - struct inet_frags *f =nf->f;
> - unsigned int seq;
> - int i;
> -
> - nf->low_thresh = 0;
> + struct rhashtable_iter hti;
> + struct inet_frag_queue *fq;
>  
> -evict_again:
> - local_bh_disable();
> - seq = read_seqbegin(>rnd_seqlock);
> + nf->low_thresh = 0; /* prevent creation of new frags */
>  
> - for (i = 0; i < INETFRAGS_HASHSZ ; i++)
> - inet_evict_bucket(f, >hash[i]);
> + rhashtable_walk_enter(>rhashtable, );
> + do {
> + rhashtable_walk_start();
>  
> - local_bh_enable();
> - cond_resched();
> + while ((fq = rhashtable_walk_next()) && !IS_ERR(fq)) {
> + if (refcount_inc_not_zero(>refcnt)) {
> + spin_lock_bh(>lock);
> + inet_frag_kill(fq);
> + spin_unlock_bh(>lock);
> + inet_frag_put(fq);
> + }
> + }
>  
> - if (read_seqretry(>rnd_seqlock, seq) ||
> - sum_frag_mem_limit(nf))
> - goto evict_again;
> + rhashtable_walk_stop();
> + } while (cond_resched(), fq == ERR_PTR(-EAGAIN));
> + rhashtable_walk_exit();
> + rhashtable_destroy(>rhashtable);
>  }
>  EXPORT_SYMBOL(inet_frags_exit_net);

Instead of using the walk interface, how about
rhashtable_free_and_destroy?

>  void inet_frag_kill(struct inet_frag_queue *fq)
>  {
>   if (del_timer(>timer))
>   refcount_dec(>refcnt);
>  
>   if (!(fq->flags & INET_FRAG_COMPLETE)) {
> - fq_unlink(fq);
> + struct netns_frags *nf = fq->net;
> +
> + fq->flags |= INET_FRAG_COMPLETE;
> + rhashtable_remove_fast(>rhashtable, >node, 
> nf->f->rhash_params);
>   refcount_dec(>refcnt);
>   }
>  }

This means that the hash won't inline properly.  Don't know big
of an issue it is to you.  But you could fix it by doing the same
hack as rhashtable by making inet_frag_kill an inline function and
take the rhash_params as an explicit argument.

> -static struct inet_frag_queue *inet_frag_create(struct netns_frags *nf,
> - struct inet_frags *f,
> - void *arg)
> -{
> - struct inet_frag_queue *q;
> + timer_setup(>timer, f->frag_expire, 0);
> + spin_lock_init(>lock);
> + refcount_set(>refcnt, 3);
> + mod_timer(>timer, jiffies + nf->timeout);
>  
> - q = inet_frag_alloc(nf, f, arg);
> - if (!q)
> + err = rhashtable_insert_fast(>rhashtable, >node,
> +  f->rhash_params);

Ditto.

> -struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
> -struct inet_frags *f, void *key,
> -unsigned int hash)
> +/* TODO : call from rcu_read_lock() and no longer use 
> refcount_inc_not_zero() */
> +struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
>  {
> - struct inet_frag_bucket *hb;
> - struct inet_frag_queue *q;
> - int depth = 0;
> -
> - if (frag_mem_limit(nf) > nf->low_thresh)
> - inet_frag_schedule_worker(f);
> -
> - hash &= (INETFRAGS_HASHSZ - 1);
> - hb = >hash[hash];
> -
> - spin_lock(>chain_lock);
> - hlist_for_each_entry(q, >chain, list) {
> - if (q->net == nf && f->match(q, key)) {
> - refcount_inc(>refcnt);
> - spin_unlock(>chain_lock);
> - return q;
> - }
> - depth++;
> - }
> - spin_unlock(>chain_lock);
> + struct inet_frag_queue *fq;
>  
> - if (depth <= INETFRAGS_MAXDEPTH)
> - return inet_frag_create(nf, f, key);
> + rcu_read_lock();
>  
> - if (inet_frag_may_rebuild(f)) {
> - if (!f->rebuild)
> - f->rebuild = true;
> - inet_frag_schedule_worker(f);
> + fq = rhashtable_lookup(>rhashtable, key, nf->f->rhash_params);

Ditto.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-03-29 Thread Herbert Xu

On Thu, Mar 29, 2018 at 06:52:34PM +0200, Andreas Gruenbacher wrote:
>
> Should rhashtable_walk_peek be kept around even if there are no more
> users? I have my doubts.

Absolutely.  All netlink dumps using rhashtable_walk_next are buggy
and need to switch over to rhashtable_walk_peek.  As otherwise
the object that triggers the out-of-space condition will be skipped
upon resumption.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-03-29 Thread Herbert Xu

On Thu, Mar 29, 2018 at 03:15:54PM +0200, Andreas Gruenbacher wrote:
>
> For all I know, Neil's latest plan is to get rhashtable_walk_peek
> replaced and removed because it is unfixable. This patch removes the
> one and only user.

His latest patch makes rhashtable_walk_peek stable in the face of
removals.

https://patchwork.ozlabs.org/patch/892534/

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-03-29 Thread Herbert Xu

On Thu, Mar 29, 2018 at 02:06:10PM +0200, Andreas Gruenbacher wrote:
> Here's a second version of the patch (now a patch set) to eliminate
> rhashtable_walk_peek in gfs2.
> 
> The first patch introduces lockref_put_not_zero, the inverse of
> lockref_get_not_zero.
> 
> The second patch eliminates rhashtable_walk_peek in gfs2.  In
> gfs2_glock_iter_next, the new lockref function from patch one is used to
> drop a lockref count as long as the count doesn't drop to zero.  This is
> almost always the case; if there is a risk of dropping the last
> reference, we must defer that to a work queue because dropping the last
> reference may sleep.

In light of Neil's latest patch, do we still need this?

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/2] rhashtable: improve rhashtable_walk stability when stop/start used.

2018-03-28 Thread Herbert Xu

On Thu, Mar 29, 2018 at 12:19:10PM +1100, NeilBrown wrote:
> When a walk of an rhashtable is interrupted with rhastable_walk_stop()
> and then rhashtable_walk_start(), the location to restart from is based
> on a 'skip' count in the current hash chain, and this can be incorrect
> if insertions or deletions have happened.  This does not happen when
> the walk is not stopped and started as iter->p is a placeholder which
> is safe to use while holding the RCU read lock.
> 
> In rhashtable_walk_start() we can revalidate that 'p' is still in the
> same hash chain.  If it isn't then the current method is still used.
> 
> With this patch, if a rhashtable walker ensures that the current
> object remains in the table over a stop/start period (possibly by
> elevating the reference count if that is sufficient), it can be sure
> that a walk will not miss objects that were in the hashtable for the
> whole time of the walk.
> 
> rhashtable_walk_start() may not find the object even though it is
> still in the hashtable if a rehash has moved it to a new table.  In
> this case it will (eventually) get -EAGAIN and will need to proceed
> through the whole table again to be sure to see everything at least
> once.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Very nice!

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/2] rhashtable: fix insertion of in rhltable when duplicate found.

2018-03-28 Thread Herbert Xu

On Thu, Mar 29, 2018 at 12:19:09PM +1100, NeilBrown wrote:
> When rhltable_insert() finds an entry with the same key,
> it splices the new entry at the start of a list of entries with the
> same key.
> It stores the address of the new object in *pprev, but in general this
> is *not* the location where the match was found (though in a common
> case where the hash chain has one element, it will be).
> To fix this, pprev should be updated every time we find an object that
> doesn't match.
> 
> This patch changes the behaviour for non-slow insertion in that now
> insertion happens at the end of a chain rather than at the head.
> I don't think this is an important change.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Thanks.  But Paul Blakey beat you to it :)

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=d3dcf8eb615537526bd42ff27a081d46d337816e
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu

On Thu, Mar 29, 2018 at 08:26:21AM +1100, NeilBrown wrote:
>
> I say "astronomically unlikely", you say "probability .. is extremely
> low".  I think we are in agreement here.
> 
> The point remains that if an error *can* be returned then I have to
> write code to handle it and test that code.  I'd rather not.

You have be able to handle errors anyway because of memory allocation
failures.  Ultimately if you keep inserting you will eventually
fail with ENOMEM.  So I don't see the issue with an additional
error value.

> > Even if it does happen we won't fail because we will perform
> > an immediate rehash.  We only fail if it happens right away
> > after the rehash (that is, at least another 16 elements have
> > been inserted and you're trying to insert a 17th element, all
> > while the new hash table has not been completely populated),
> > which means that somebody has figured out our hash secret and
> > failing in that case makes sense.

BTW, you didn't acknowledge this bit which I think is crucial to
how likely such an error is.

> I never suggested retrying, but I would have to handle it somehow.  I'd
> rather not.

...

> While I have no doubt that there are hashtables where someone could try
> to attack the hash, I am quite sure there are others where is such an
> attack is meaningless - any code which could generate the required range of
> keys, could do far worse things more easily.

Our network hashtable has to be secure against adversaries.  I
understand that this may not be important to your use-case.  However,
given the fact that the failure would only occur if an adversary
is present and actively attacking your hash table, I don't think
it has that much of a negative effect on your use-case either.

Of course if you can reproduce the EBUSY error without your
disable_count patch or even after you have fixed the issue I
have pointed out in your disable_count patch you can still reproduce
it then that would suggest a real bug and we would need to fix it,
for everyone.

> Yes, storing a sharded count in the spinlock table does seem like an
> appropriate granularity.  However that leads me to ask: why do we have
> the spinlock table?  Why not bit spinlocks in the hashchain head like
> include/linux/list_bl uses?

The spinlock table predates rhashtable.  Perhaps Thomas/Eric/Dave
can elucidate this.

> I don't understand how it can ever be "too late", though I appreciate
> that in some cases "sooner" is better than "later"
> If we give up on the single atomic_t counter, then we must accept that
> the number of elements could exceed any given value.  The only promise
> we can provide is that it wont exceed N% of the table size for more than
> T seconds.

Sure.  However, assuming we use an estimate that is hash-based,
that *should* be fairly accurate assuming that your hash function
is working in the first place.  It's completely different vs.
estimating based on a per-cpu count which could be wildly inaccurate.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-28 Thread Herbert Xu

On Wed, Mar 28, 2018 at 06:17:57PM +1100, NeilBrown wrote:
>
> Sounds like over-kill to me.
> It might be reasonable to have a CONFIG_DEBUG_RHASHTABLE which enables
> extra to code to catch misuse, but I don't see the justification for
> always performing these checks.
> The DEBUG code could just scan the chain (usually quite short) to see if
> the given element is present.  Of course it might have already been
> rehashed to the next table, so you would to allow for that possibility -
> probably check tbl->rehash.

No this is not meant to debug users incorrectly using the cursor.
This is a replacement of your continue interface by automatically
validating the cursor.

In fact we can make it even more reliable.  We can insert the walker
right into the bucket chain, that way the walking will always be
consistent.

The only problem is that we need be able to differentiate between
a walker, a normal object, and the end of the list.  I think it
should be doable.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu

On Wed, Mar 28, 2018 at 06:04:40PM +1100, NeilBrown wrote:
>
> I disagree.  My patch 6 only makes it common instead of exceedingly
> rare.  If any table in the list other than the first has a chain with 16
> elements, then trying to insert an element with a hash which matches
> that chain will fail with -EBUSY.  This is theoretically possible
> already, though astronomically unlikely.  So that case will never be
> tested for.

No that's not true.  If the table is correctly sized then the
probability of having a chain with 16 elements is extremely low.

Even if it does happen we won't fail because we will perform
an immediate rehash.  We only fail if it happens right away
after the rehash (that is, at least another 16 elements have
been inserted and you're trying to insert a 17th element, all
while the new hash table has not been completely populated),
which means that somebody has figured out our hash secret and
failing in that case makes sense.

> It is hard to know if it is necessary.  And making the new table larger
> will make the error less likely, but still won't make it impossible.  So
> callers will have to handle it - just like they currently have to handle
> -ENOMEM even though it is highly unlikely (and not strictly necessary).

Callers should not handle an ENOMEM error by retrying.  Nor should
they retry an EBUSY return value.

> Are these errors ever actually useful?  I thought I had convinced myself
> before that they were (to throttle attacks on the hash function), but
> they happen even less often than I thought.

The EBUSY error indicates that the hash table has essentially
degenereated into a linked list because somebody has worked out
our hash secret.

> Maybe. Reading a percpu counter isn't cheap.  Reading it whenever a hash
> chain reaches 16 is reasonable, but I think we would want to read it a
> lot more often than that.  So probably store the last-sampled time (with
> no locking) and only sample the counter if last-sampled is more than
>  jiffies - 10*HZ (???)

We could also take the spinlock table approach and have a counter
per bucket spinlock.  This should be sufficient as you'll contend
on the bucket spinlock table anyway.

This also allows us to estimate the total table size and not have
to always do a last-ditch growth when it's too late.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-28 Thread Herbert Xu

On Wed, Mar 28, 2018 at 08:54:41AM +1100, NeilBrown wrote:
>
> Possibly.
> I particularly want the interface to require that you pass the
> previously returned object to _continue. That makes it easy to see that
> the object is still being used.  If someone changes to code to delete
> the object before the _continue, there should be a strong hint that it
> won't work.
> 
> Maybe it would be better to make it a WARN_ON()
> 
>   if (!obj || WARN_ON(iter->p != obj))
>  iter->p = NULL;

This doesn't really protect against the case where obj is removed.
All it proves is that the user saved a copy of obj which we already
did anyway.

To detect an actual removal you'd need to traverse the list.

I have another idea: we could save insert the walkers into the
hash table chain at the end, essentially as a hidden list.  We
can mark it with a flag like rht_marker so that normal traversal
doesn't see it.

That way the removal code can simply traverse that list and inform
them that the iterator is invalid.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-28 Thread Herbert Xu

On Wed, Mar 28, 2018 at 08:34:19AM +1100, NeilBrown wrote:
>
> It is easy to get an -EBUSY insertion failure when .disable_count is
> enabled, and I did get that.  Blindly propagating that up caused lustre
> to get terribly confused - not too surprising really.

Right, so this failure mode is specific to your patch 6.

I think I see the problem.  As it currently stands, we never
need growing when we hit the bucket length limit.  We only do
rehashes.

With your patch, you must change it so that we actually try to
grow the table if necessary when we hit the bucket length limit.

Otherwise it will result in the EBUSY that you're seeing.

I laso think that we shouldn't make this a toggle.  If we're going
to do disable_count then it should be unconditionally done for
everyone.

However, I personally would prefer a percpu elem count instead of
disabling it completely.  Would that be acceptable to your use-case?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 5/6] rhashtable: support guaranteed successful insertion.

2018-03-27 Thread Herbert Xu

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The current rhashtable will fail an insertion if the hashtable
> it "too full", one of:
>  - table already has 2^31 elements (-E2BIG)
>  - a max_size was specified and table already has that
>many elements (rounded up to power of 2) (-E2BIG)
>  - a single chain has more than 16 elements (-EBUSY)
>  - table has more elements than the current table size,
>and allocating a new table fails (-ENOMEM)
>  - a new page needed to be allocated for a nested table,
>and the memory allocation failed (-ENOMEM).
> 
> A traditional hash table does not have a concept of "too full", and
> insertion only fails if the key already exists.  Many users of hash
> tables have separate means of limiting the total number of entries,
> and are not susceptible to an attack which could cause unusually large
> hash chains.  For those users, the need to check for errors when
> inserting objects to an rhashtable is an unnecessary burden and hence
> a potential source of bugs (as these failures are likely to be rare).

Did you actually encounter an insertion failure? The current code
should never fail an insertion until you actually run ouf memory.
That is unless you're using rhashtable when you should be using
rhlist instead.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-27 Thread Herbert Xu

On Tue, Mar 27, 2018 at 11:49:41AM -0400, David Miller wrote:
>
> Merely having an elevated reference count does not explicitly prevent
> the object from being removed from the hash table.
> 
> This invariant might hold for the particular user of the rhashtable
> instance, but it is not always the case.

I think having a new interface like this should work as if the
user knows that the element has not been removed from the hash
table then it should be safe to continue from it.

Of course people may misuse it but even using the current interface
correctly may result in lost objects due to concurrent removals.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 4/6] rhashtable: allow a walk of the hash table without missing objects.

2018-03-27 Thread Herbert Xu

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
>
> -int rhashtable_walk_start_check(struct rhashtable_iter *iter)
> +int rhashtable_walk_start_continue(struct rhashtable_iter *iter, struct 
> rhash_head *obj)
>   __acquires(RCU)
>  {
>   struct rhashtable *ht = iter->ht;
>  
>   rcu_read_lock();
>  
> + if (!obj || iter->p != obj)
> + iter->p = NULL;

Why bother with this check at all? Couldn't we make it so that
if you call continue then you continue with the cursor otherwise
you set it to NULL as we currently do.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 3/6] rhashtable: reset intr when rhashtable_walk_start sees new table

2018-03-27 Thread Herbert Xu

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The documentation claims that when rhashtable_walk_start_check()
> detects a resize event, it will rewind back to the beginning
> of the table.  This is not true.  We need to set ->slot and
> ->skip to be zero for it to be true.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/6] rhashtable: improve documentation for rhashtable_walk_peek()

2018-03-27 Thread Herbert Xu

On Tue, Mar 27, 2018 at 10:33:04AM +1100, NeilBrown wrote:
> The documentation for rhashtable_walk_peek() wrong.  It claims to
> return the *next* entry, whereas it in fact returns the *previous*
> entry.
> However if no entries have yet been returned - or if the iterator
> was reset due to a resize event, then rhashtable_walk_peek()
> *does* return the next entry, but also advances the iterator.
> 
> I suspect that this interface should be discarded and the one user
> should be changed to not require it.  Possibly this patch should be
> seen as a first step in that conversation.
> 
> This patch mostly corrects the documentation, but does make a
> small code change so that the documentation can be correct without
> listing too many special cases.  I don't think the one user will
> be affected by the code change.
> 
> Signed-off-by: NeilBrown <ne...@suse.com>

We should cc Tom Herbert too since he wrote this code.

> ---
>  lib/rhashtable.c |   17 +
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 3825c30aaa36..24a57ca494cb 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -853,13 +853,17 @@ void *rhashtable_walk_next(struct rhashtable_iter *iter)
>  EXPORT_SYMBOL_GPL(rhashtable_walk_next);
>  
>  /**
> - * rhashtable_walk_peek - Return the next object but don't advance the 
> iterator
> + * rhashtable_walk_peek - Return the previously returned object without 
> advancing the iterator
>   * @iter:Hash table iterator
>   *
> - * Returns the next object or NULL when the end of the table is reached.
> + * Returns the last object returned, or NULL if no object has yet been 
> returned.
> + * If the previously returned object has since been removed, then some other 
> arbitrary
> + * object maybe returned, or possibly NULL will be returned.  In that case, 
> the
> + * iterator might be advanced.
>   *
>   * Returns -EAGAIN if resize event occurred.  Note that the iterator
> - * will rewind back to the beginning and you may continue to use it.
> + * will rewind back to the beginning and rhashtable_walk_next() should be
> + * used to get the next object.
>   */
>  void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>  {
> @@ -880,7 +884,12 @@ void *rhashtable_walk_peek(struct rhashtable_iter *iter)
>* the table hasn't changed.
>*/
>   iter->skip--;
> - }
> + } else
> + /* ->skip is only zero after rhashtable_walk_start()
> +  * or when the iterator is reset.  In this case there
> +  * is no previous object to return.
> +  */
> + return NULL;
>  
>   return __rhashtable_walk_find_next(iter);
>  }
> 
> 

-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v11 crypto 00/12] Chelsio Inline TLS

2018-03-19 Thread Herbert Xu

On Sun, Mar 18, 2018 at 10:36:02AM -0400, David Miller wrote:
>
> Herbert, is it OK for this entire series to go via net-next?

Sure, although there could be conflicts since the chelsio driver
seems to be changing quite fast.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] test_rhashtable: Add missing rcu_read_lock()

2018-03-09 Thread Herbert Xu

On Thu, Mar 08, 2018 at 07:10:47PM +0200, Paul Blakey wrote:
>
> I know but I wanted to show the inner structure for the failure case, iirc
> walk doesn't provide this.

Fair enough.

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-08 Thread Herbert Xu

On Thu, Mar 08, 2018 at 08:31:47AM -0800, Eric Dumazet wrote:
>
> Because that would not be nice.
> 
> this_cpu_read() is faster than going through raw_smp_processor_id() and
> per_cpu_ptr()

I see.  In that case I have no objections to this patch.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] test_rhashtable: Add missing rcu_read_lock()

2018-03-08 Thread Herbert Xu

On Thu, Mar 08, 2018 at 01:54:57PM +0200, Paul Blakey wrote:
> Suppress "suspicious rcu_dereference_protected() usage!" on duplicate
> insertion test.
> 
> Fixes: 499ac3b60f65 ('test_rhashtable: add test case for rhl_table with 
> duplicate objects')
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

This shouldn't be doing a table walk in the first place.

You should convert it to using the table walk interface.

Direct table walks are forbidden because they will break when
you hit a resize.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 net] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-08 Thread Herbert Xu

On Wed, Mar 07, 2018 at 02:42:53PM -0800, Greg Hackmann wrote:
> f7c83bcbfaf5 ("net: xfrm: use __this_cpu_read per-cpu helper") added a
> __this_cpu_read() call inside ipcomp_alloc_tfms().
> 
> At the time, __this_cpu_read() required the caller to either not care
> about races or to handle preemption/interrupt issues.  3.15 tightened
> the rules around some per-cpu operations, and now __this_cpu_read()
> should never be used in a preemptible context.  On 3.15 and later, we
> need to use this_cpu_read() instead.
> 
> syzkaller reported this leading to the following kernel BUG while
> fuzzing sendmsg:

Please explain why we can't revert f7c83bcbfaf5 instead.

Your patch contradicts the comment above the line that you're
changing.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()

2018-03-07 Thread Herbert Xu

On Wed, Mar 07, 2018 at 11:24:16AM -0800, Greg Hackmann wrote:
> f7c83bcbfaf5 ("net: xfrm: use __this_cpu_read per-cpu helper") added a
> __this_cpu_read() call inside ipcomp_alloc_tfms().  Since this call was
> introduced, the rules around per-cpu accessors have been tightened and
> __this_cpu_read() cannot be used in a preemptible context.
> 
> syzkaller reported this leading to the following kernel BUG while
> fuzzing sendmsg:

How about reverting f7c83bcbfaf5 instead?

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net v3 2/2] test_rhashtable: add test case for rhltable with duplicate objects

2018-03-05 Thread Herbert Xu

On Sun, Mar 04, 2018 at 05:29:49PM +0200, Paul Blakey wrote:
> Tries to insert duplicates in the middle of bucket's chain:
> bucket 1:  [[val 21 (tid=1)]] -> [[ val 1 (tid=2),  val 1 (tid=0) ]]
> 
> Reuses tid to distinguish the elements insertion order.
> 
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net v3 1/2] rhashtable: Fix rhlist duplicates insertion

2018-03-05 Thread Herbert Xu

On Sun, Mar 04, 2018 at 05:29:48PM +0200, Paul Blakey wrote:
> When inserting duplicate objects (those with the same key),
> current rhlist implementation messes up the chain pointers by
> updating the bucket pointer instead of prev next pointer to the
> newly inserted node. This causes missing elements on removal and
> travesal.
> 
> Fix that by properly updating pprev pointer to point to
> the correct rhash_head next pointer.
> 
> Issue: 1241076
> Change-Id: I86b2c140bcb4aeb10b70a72a267ff590bb2b17e7
> Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net v2 2/2] test_rhashtable: add test case for rhltable with duplicate objects

2018-03-04 Thread Herbert Xu

On Sun, Mar 04, 2018 at 03:26:49PM +0200, Paul Blakey wrote:
> Tries to insert duplicates in the middle of bucket's chain:
> bucket 1:  [[val 21 (tid=1)]] -> [[ val 1 (tid=2),  val 1 (tid=0) ]]
> 
> Reuses tid to distinguish the elements insertion order.
> 
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net v2 1/2] rhashtable: Fix rhlist duplicates insertion

2018-03-04 Thread Herbert Xu

On Sun, Mar 04, 2018 at 03:26:48PM +0200, Paul Blakey wrote:
> When inserting duplicate objects (those with the same key),
> current rhlist implementation messes up the chain pointers by
> updating the bucket pointer instead of prev next pointer to the
> newly inserted node. This causes missing elements on removal and
> travesal.
> 
> Fix that by properly updating pprev pointer to point to
> the correct rhash_head next pointer.
> 
> Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Oops, replied to the wrong email.

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net 1/2] rhashtable: Fix rhltable duplicates insertion

2018-03-04 Thread Herbert Xu

On Sun, Mar 04, 2018 at 02:34:26PM +0200, Paul Blakey wrote:
> When inserting duplicate objects (those with the same key),
> current rhashtable implementation messes up the chain pointers by
> updating the bucket pointer instead of prev next pointer to the
> newly inserted node. This causes missing elements on removal and
> travesal.
> 
> Fix that by properly updating pprev pointer to point to
> the correct rhash_head next pointer.
> 
> Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Ah I see, thanks for catching this!

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net 1/2] rhashtable: Fix rhltable duplicates insertion

2018-03-04 Thread Herbert Xu

On Sun, Mar 04, 2018 at 02:34:26PM +0200, Paul Blakey wrote:
> When inserting duplicate objects (those with the same key),
> current rhashtable implementation messes up the chain pointers by
> updating the bucket pointer instead of prev next pointer to the
> newly inserted node. This causes missing elements on removal and
> travesal.
> 
> Fix that by properly updating pprev pointer to point to
> the correct rhash_head next pointer.
> 
> Fixes: ca26893f05e8 ('rhashtable: Add rhlist interface')
> Signed-off-by: Paul Blakey <pa...@mellanox.com>

Nack.  You must not insert objects with the same key through
rhashtable.  The reason is that we cannot reliably fetch all
of the objects with the same key during a resize.

If you need duplicate objects, you should use rhlist.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/3] tcp: Add ESP encapsulation support

2018-01-17 Thread Herbert Xu

On Tue, Jan 16, 2018 at 11:28:23AM +0100, Steffen Klassert wrote:
>
> > Is there any chance this can be done with almost no change in TCP
> > stack, using a layer model ? ( net/kcm comes to mind )
> 
> Herbert, would this be an option or is this not possible?

Yes it can be done.  I'm working on it.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH 3/3] ipsec: Add ESP over TCP encapsulation support

2018-01-11 Thread Herbert Xu

This patch adds support for ESP over TCP encapsulation per RFC8229.

Most of the input processing is done in the TCP stack and not in
this patch, which is similar to UDP encapsulation.

On the output side, there are two potential levels of indirection.
Firstly all packets are fed through a tasklet in order to avoid
TCP socket lock recursion.  They're then processed directly if
the TCP socket is not owned by user-space.  If it is owned then
we'll place the packet in a queue (tp->encap_out) for processing
when the socket lock is released.

The first outbound packet will trigger a socket lockup for a
matching TCP socket.  If the TCP connection drops we will repeat
the lookup as needed.  The TCP socket is cached in the xfrm state
and is read using RCU.

Note that unlike normal IPsec packets, once we hit a TCP xfrm
state, the xfrm stack is short-circuited and its journey will
continue through the TCP stack, after which a new IPsec lookup
will be done.  This is different from how UDP encapsulation is
done.  This means that if you're doing nested IPsec then you
will need to construct the policies with this in mind.  That is,
start with a new policy whenever TCP encapsulation is done.

Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>
---

 include/net/xfrm.h|7 +
 net/ipv4/esp4.c   |  208 --
 net/xfrm/xfrm_input.c |   21 +++--
 net/xfrm/xfrm_state.c |3 
 4 files changed, 228 insertions(+), 11 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ae35991..3694536 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -180,6 +180,7 @@ struct xfrm_state {
 
/* Data for encapsulator */
struct xfrm_encap_tmpl  *encap;
+   struct sock __rcu   *encap_sk;
 
/* Data for care-of address */
xfrm_address_t  *coaddr;
@@ -210,6 +211,9 @@ struct xfrm_state {
u32 replay_maxage;
u32 replay_maxdiff;
 
+   /* Copy of encap_type from encap to avoid locking. */
+   u16 encap_type;
+
/* Replay detection notification timer */
struct timer_list   rtimer;
 
@@ -1570,6 +1574,9 @@ struct xfrmk_spdinfo {
 int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
 int xfrm_input_resume(struct sk_buff *skb, int nexthdr);
+int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
+int (*finish)(struct net *, struct sock *,
+  struct sk_buff *));
 int xfrm_trans_queue(struct sk_buff *skb,
 int (*finish)(struct net *, struct sock *,
   struct sk_buff *));
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 61fe6e4..0544e4e 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -9,13 +9,16 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -30,6 +33,11 @@ struct esp_output_extra {
u32 esphoff;
 };
 
+struct esp_tcp_sk {
+   struct sock *sk;
+   struct rcu_head rcu;
+};
+
 #define ESP_SKB_CB(__skb) ((struct esp_skb_cb *)&((__skb)->cb[0]))
 
 static u32 esp4_get_mtu(struct xfrm_state *x, int mtu);
@@ -118,6 +126,143 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp)
put_page(sg_page(sg));
 }
 
+static void esp_free_tcp_sk(struct rcu_head *head)
+{
+   struct esp_tcp_sk *esk = container_of(head, struct esp_tcp_sk, rcu);
+
+   sock_put(esk->sk);
+   kfree(esk);
+}
+
+static struct sock *esp_find_tcp_sk(struct xfrm_state *x)
+{
+   struct xfrm_encap_tmpl *encap = x->encap;
+   struct esp_tcp_sk *esk;
+   __be16 sport, dport;
+   struct sock *nsk;
+   struct sock *sk;
+
+   sk = rcu_dereference(x->encap_sk);
+   if (sk && sk->sk_state == TCP_ESTABLISHED)
+   return sk;
+
+   spin_lock_bh(>lock);
+   sport = encap->encap_sport;
+   dport = encap->encap_dport;
+   nsk = rcu_dereference_protected(x->encap_sk,
+   lockdep_is_held(>lock));
+   if (sk && sk == nsk) {
+   esk = kmalloc(sizeof(*esk), GFP_ATOMIC);
+   if (!esk) {
+   spin_unlock_bh(>lock);
+   return ERR_PTR(-ENOMEM);
+   }
+   RCU_INIT_POINTER(x->encap_sk, NULL);
+   esk->sk = sk;
+   call_rcu(>rcu, esp_free_tcp_sk);
+   }
+   spin_unlock_bh(>lock);
+
+   /* XXX We don't support bound_dev_if. */
+   sk = inet_lookup_established(xs_net(x), _hashinfo, x->id.daddr.a4,
+dport, x->props.saddr.a4, sport, 0);
+
+   if (!sk)
+   retur

[PATCH 2/3] tcp: Add ESP encapsulation support

2018-01-11 Thread Herbert Xu

This patch adds the plumbing in TCP for ESP encapsulation support
per RFC8229.

The patch mostly deals with inbound processing, as well as enabling
TCP encapsulation on a socket through setsockopt.  The outbound
processing is dealt with in the ESP code as is done for UDP.

The inbound processing is split into two halves.  First of all,
the softirq path directly intercepts ESP packets and feeds them
into the IPsec stack.  Most of the time the packet will be freed
right away if it contains complete ESP packets.  However, if
the message is incomplete or it contains non-ESP data, then the
skb will be added to the receive queue.  We also add packets to
the receive queue if it is currently non-emtpy, in order to
preserve sequence number continuity and minimise the changes
to the TCP code.

On the user-space facing side, packets marked as ESP-only are
skipped and not visible to user-space.  However, some ESP data
may seep through.  For example, if we receive a partial message
then we will always give it to user-space regardless of whether
it turns out to be ESP or not.  So user-space should be prepared
to skip ESP messages (SPI != 0).

There is a little bit of code dealing with the encapsulation side.
In particular, if encapsulation data comes in while the socket
is owned by user-space, the packets will be stored in tp->encap_out
and processed during release_sock.

Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>
---

 include/linux/tcp.h  |   15 ++
 include/net/tcp.h|   27 +++
 include/uapi/linux/tcp.h |1 
 include/uapi/linux/udp.h |1 
 net/ipv4/tcp.c   |   68 +
 net/ipv4/tcp_input.c |  326 +--
 net/ipv4/tcp_ipv4.c  |1 
 net/ipv4/tcp_output.c|   48 ++
 8 files changed, 473 insertions(+), 14 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ca4a636..1360a0e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
fastopen_no_cookie:1, /* Allow send/recv SYN+data without a 
cookie */
is_sack_reneg:1,/* in recovery from loss with SACK reneg? */
-   unused:2;
+   encap:1,/* TCP IKE/ESP encapsulation */
+   encap_lenhi_valid:1;
u8  nonagle : 4,/* Disable Nagle algorithm? */
thin_lto: 1,/* Use linear timeouts for thin streams */
unused1 : 1,
@@ -373,6 +374,16 @@ struct tcp_sock {
 */
struct request_sock *fastopen_rsk;
u32 *saved_syn;
+
+#ifdef CONFIG_XFRM
+/* TCP ESP encapsulation */
+   struct sk_buff *encap_in;
+   struct sk_buff_head encap_out;
+   u32 encap_seq;
+   u32 encap_last;
+   u16 encap_backlog;
+   u8  encap_lenhi;
+#endif
 };
 
 enum tsq_enum {
@@ -384,6 +395,7 @@ enum tsq_enum {
TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
* tcp_v{4|6}_mtu_reduced()
*/
+   TCP_ESP_DEFERRED,  /* esp_output_tcp_encap2 queued packets */
 };
 
 enum tsq_flags {
@@ -393,6 +405,7 @@ enum tsq_flags {
TCPF_WRITE_TIMER_DEFERRED   = (1UL << TCP_WRITE_TIMER_DEFERRED),
TCPF_DELACK_TIMER_DEFERRED  = (1UL << TCP_DELACK_TIMER_DEFERRED),
TCPF_MTU_REDUCED_DEFERRED   = (1UL << TCP_MTU_REDUCED_DEFERRED),
+   TCPF_ESP_DEFERRED   = (1UL << TCP_ESP_DEFERRED),
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d..6513ae2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@ int tcp_sendpage_locked(struct sock *sk, struct page *page, 
int offset,
size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 size_t size, int flags);
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
@@ -399,6 +400,7 @@ int compat_tcp_setsockopt(struct sock *sk, int level, int 
optname,
  char __user *optval, unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
+void tcp_cleanup_rbuf(struct sock *sk, int copied);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
int flags, int *addr_len);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
@@ -789,7 +791,8 @@ struct tcp_skb_cb {
__u8txstamp_ack:1,  /* Record TX timestamp for ack? */
eor:1,  /* Is skb MSG_EOR marked? */

[PATCH 1/3] skbuff: Avoid sleeping in skb_send_sock_locked

2018-01-11 Thread Herbert Xu

For a function that needs to be called with the socket spinlock
held, sleeping would seem to be a bad idea.  This function does
in fact avoid sleeping when calling kernel_sendpage_locked on the
page part of the skb.  However, it doesn't do that when sending
the linear part.  Resulting in sleeping when the socket send buffer
is full.

This patch fixes it by setting the MSG_DONTWAIT flag when calling
kernel_sendmsg_locked.

Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>
---

 net/core/skbuff.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6b0ff39..8197b7a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2279,6 +2279,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff 
*skb, int offset,
kv.iov_base = skb->data + offset;
kv.iov_len = slen;
memset(, 0, sizeof(msg));
+   msg.msg_flags = MSG_DONTWAIT;
 
ret = kernel_sendmsg_locked(sk, , , 1, slen);
if (ret <= 0)

[PATCH 0/3] ipsec: Add ESP over TCP encapsulation

2018-01-11 Thread Herbert Xu

Hi:

This series of patches add basic support for ESP over TCP (RFC 8229).
Note that this does not include TLS support but it could be added in
future.

Here is an iproute patch to setup xfrm states with this:

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 12c2f72..f3fb1e2 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -738,6 +738,9 @@ void xfrm_xfrma_print(struct rtattr *tb[], __u16 family,
case 2:
fprintf(fp, "espinudp ");
break;
+   case 6:
+   fprintf(fp, "espintcp ");
+   break;
default:
fprintf(fp, "%u ", e->encap_type);
break;
@@ -1182,6 +1185,8 @@ int xfrm_encap_type_parse(__u16 *type, int *argcp, char 
***argvp)
*type = 1;
else if (strcmp(*argv, "espinudp") == 0)
*type = 2;
+   else if (strcmp(*argv, "espintcp") == 0)
+   *type = 6;
else
invarg("ENCAP-TYPE value is invalid", *argv);
 

Here is a sample program for setting up the TCP socket to use this.
Note that it doesn't do the magic word as required by RFC 8229 so
you'll need to add that for a real key manager.

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TCP_ENCAP 35

int main(int argc, char **argv)
{
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_port = htons(4500),
};
char buf[4096];
int one = 1;
int err;
int s;

s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0)
error(-1, errno, "socket");

if (bind(s, (struct sockaddr *), sizeof(addr)) < 0)
error(-1, errno, "bind");

if (argc > 1) {
addr.sin_addr.s_addr = inet_addr(argv[1]);
if (connect(s, (struct sockaddr *), sizeof(addr)) < 0)
error(-1, errno, "connect");
} else {
if (listen(s, 0) < 0)
error(-1, errno, "listen");

s = accept(s, NULL, 0);
if (s < 0)
error(-1, errno, "accept");
}

if (setsockopt(s, SOL_TCP, TCP_NODELAY, , sizeof(one)) < 0)
error(-1, errno, "TCP_NODELAY");

if (setsockopt(s, SOL_TCP, TCP_ENCAP, NULL, 0) < 0)
error(-1, errno, "TCP_ENCAP");

while ((err = read(s, buf, sizeof(buf))) > 0)
;

if (err < 0)
error(-1, errno, "read");

return 0;
}


Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH v2] xfrm: Return error on unknown encap_type in init_state

2018-01-05 Thread Herbert Xu

On Fri, Jan 05, 2018 at 09:32:47AM +0100, Steffen Klassert wrote:
>
> Looks like we catch the unknown mode in __xfrm_init_state().
> But in any case, if we want to return -EINVAL on unknown mode,
> we should do it for IPv6 and for IPv4.

OK, how about this one then:

---8<---
Currently esp will happily create an xfrm state with an unknown
encap type for IPv4, without setting the necessary state parameters.
This patch fixes it by returning -EINVAL.

There is a similar problem in IPv6 where if the mode is unknown
we will skip initialisation while returning zero.  However, this
is harmless as the mode has already been checked further up the
stack.  This patch removes this anomaly by aligning the IPv6
behaviour with IPv4 and treating unknown modes (which cannot
actually happen) as transport mode.

Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP")
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index d57aa64..61fe6e4 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -981,6 +981,7 @@ static int esp_init_state(struct xfrm_state *x)
 
switch (encap->encap_type) {
default:
+   err = -EINVAL;
goto error;
case UDP_ENCAP_ESPINUDP:
x->props.header_len += sizeof(struct udphdr);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index a902ff8..1a7f00c 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -890,13 +890,12 @@ static int esp6_init_state(struct xfrm_state *x)
x->props.header_len += IPV4_BEET_PHMAXLEN +
   (sizeof(struct ipv6hdr) - 
sizeof(struct iphdr));
break;
+   default:
case XFRM_MODE_TRANSPORT:
break;
case XFRM_MODE_TUNNEL:
x->props.header_len += sizeof(struct ipv6hdr);
break;
-   default:
-   goto error;
}
 
align = ALIGN(crypto_aead_blocksize(aead), 4);
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH v2] xfrm: Use __skb_queue_tail in xfrm_trans_queue

2018-01-04 Thread Herbert Xu

On Thu, Jan 04, 2018 at 12:20:26PM +0100, Artem Savkov wrote:

> Right, thats a better solution.
> 
> Reported-and-tested-by: Artem Savkov <asav...@redhat.com>

Thanks!

But I just realised that this patch is based on my dirty tree.
So here is a rebased version:

---8<---
We do not need locking in xfrm_trans_queue because it is designed
to use per-CPU buffers.  However, the original code incorrectly
used skb_queue_tail which takes the lock.  This patch switches
it to __skb_queue_tail instead.

Reported-and-tested-by: Artem Savkov <asav...@redhat.com>
Fixes: acf568ee859f ("xfrm: Reinject transport-mode packets...")
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 444fa37..9dbf425 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -508,7 +508,7 @@ int xfrm_trans_queue(struct sk_buff *skb,
return -ENOBUFS;
 
XFRM_TRANS_SKB_CB(skb)->finish = finish;
-   skb_queue_tail(>queue, skb);
+   __skb_queue_tail(>queue, skb);
tasklet_schedule(>tasklet);
return 0;
 }
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

xfrm: Return error on unknown switch in init_state

2018-01-04 Thread Herbert Xu

Currently esp will happily create an xfrm state with an unknown
encap type for IPv4 or an unknown mode for IPv6, without setting
the necessary state parameters.  This patch fixes it by returning
-EINVAL.

Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP")
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index d57aa64..61fe6e4 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -981,6 +981,7 @@ static int esp_init_state(struct xfrm_state *x)
 
switch (encap->encap_type) {
default:
+   err = -EINVAL;
goto error;
case UDP_ENCAP_ESPINUDP:
x->props.header_len += sizeof(struct udphdr);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index a902ff8..f2130ff 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -896,6 +896,7 @@ static int esp6_init_state(struct xfrm_state *x)
x->props.header_len += sizeof(struct ipv6hdr);
break;
default:
+   err = -EINVAL;
goto error;
    }
 
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH] xfrm: init skb_head lock for transport-mode packets

2018-01-04 Thread Herbert Xu

On Thu, Jan 04, 2018 at 11:36:28AM +0100, Artem Savkov wrote:
> Commit acf568ee859f "xfrm: Reinject transport-mode packets through tasklet"
> adds an sk_buff_head queue, but never initializes trans->queue.lock, which
> results in a "spinlock bad magic" BUG on skb_queue_tail() call in
> xfrm_trans_queue.
> Use skb_queue_head_init() instead of __skb_queue_head_init() to properly
> initialize said lock.
> 
> Signed-off-by: Artem Savkov <asav...@redhat.com>

Thanks for catching this.  But we don't need the lock as this
is meant to be per-CPU only.  So we should remove the locking
instead:

---8<---
xfrm: Use __skb_queue_tail in xfrm_trans_queue

We do not need locking in xfrm_trans_queue because it is designed
to use per-CPU buffers.  However, the original code incorrectly
used skb_queue_tail which takes the lock.  This patch switches
it to __skb_queue_tail instead.

Reported-by: Artem Savkov <asav...@redhat.com>
Fixes: acf568ee859f ("xfrm: Reinject transport-mode packets...")
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 098f47a..1eb0bba 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -511,7 +511,7 @@ int xfrm_trans_queue_net(struct net *net, struct sk_buff 
*skb,
 
XFRM_TRANS_SKB_CB(skb)->finish = finish;
XFRM_TRANS_SKB_CB(skb)->net = net;
-   skb_queue_tail(>queue, skb);
+   __skb_queue_tail(>queue, skb);
tasklet_schedule(>tasklet);
return 0;
 }
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

xfrm: Forbid state updates from changing encap type

2017-12-25 Thread Herbert Xu

Currently we allow state updates to competely replace the contents
of x->encap.  This is bad because on the user side ESP only sets up
header lengths depending on encap_type once when the state is first
created.  This could result in the header lengths getting out of
sync with the actual state configuration.

In practice key managers will never do a state update to change the
encapsulation type.  Only the port numbers need to be changed as the
peer NAT entry is updated.

Therefore this patch adds a check in xfrm_state_update to forbid
any changes to the encap_type.

Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 065d896..dc1cdde 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1533,8 +1533,12 @@ int xfrm_state_update(struct xfrm_state *x)
err = -EINVAL;
spin_lock_bh(>lock);
if (likely(x1->km.state == XFRM_STATE_VALID)) {
-   if (x->encap && x1->encap)
+   if (x->encap && x1->encap &&
+   x->encap->encap_type == x1->encap->encap_type)
memcpy(x1->encap, x->encap, sizeof(*x1->encap));
+   else if (x->encap || x1->encap)
+   goto fail;
+
if (x->coaddr && x1->coaddr) {
memcpy(x1->coaddr, x->coaddr, sizeof(*x1->coaddr));
}
@@ -1551,6 +1555,8 @@ int xfrm_state_update(struct xfrm_state *x)
x->km.state = XFRM_STATE_DEAD;
__xfrm_state_put(x);
}
+
+fail:
spin_unlock_bh(>lock);
 
xfrm_state_put(x1);
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 2/3] rhashtable: Add rhashtable_walk_curr

2017-12-18 Thread Herbert Xu

On Mon, Dec 18, 2017 at 02:31:21PM +0100, Andreas Gruenbacher wrote:
> When iterating through an rhashtable is stopped with
> rhashtable_walk_stop and then resumed with rhashtable_walk_start, there
> currently is no way to get back to the current object and thus revisit
> the object rhashtable_walk_next has previously returned.
> 
> This functionality is useful when dumping an rhashtable via the seq file
> interface: seq_read will convert one object after the other.  When an
> object doesn't fit in the remaining buffer space anymore, user-space
> will be returned all objects that have been fully converted so far.
> Upon the next read from user-space, the object that didn't fit
> previously will be revisited.
> 
> Signed-off-by: Andreas Gruenbacher <agrue...@redhat.com>

Doesn't the helper that Tom Herbert just added do exactly this?

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

xfrm: Reinject transport-mode packets through tasklet

2017-12-14 Thread Herbert Xu

This is an old bugbear of mine:

https://www.mail-archive.com/netdev@vger.kernel.org/msg03894.html

By crafting special packets, it is possible to cause recursion
in our kernel when processing transport-mode packets at levels
that are only limited by packet size.

The easiest one is with DNAT, but an even worse one is where
UDP encapsulation is used in which case you just have to insert
an UDP encapsulation header in between each level of recursion.

This patch avoids this problem by reinjecting tranport-mode packets
through a tasklet.

Fixes: b05e106698d9 ("[IPV4/6]: Netfilter IPsec input hooks")
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index dc28a98..ae35991 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1570,6 +1570,9 @@ struct xfrmk_spdinfo {
 int xfrm_prepare_input(struct xfrm_state *x, struct sk_buff *skb);
 int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type);
 int xfrm_input_resume(struct sk_buff *skb, int nexthdr);
+int xfrm_trans_queue(struct sk_buff *skb,
+int (*finish)(struct net *, struct sock *,
+  struct sk_buff *));
 int xfrm_output_resume(struct sk_buff *skb, int err);
 int xfrm_output(struct sock *sk, struct sk_buff *skb);
 int xfrm_inner_extract_output(struct xfrm_state *x, struct sk_buff *skb);
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index e50b7fe..bcfc00e 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -23,6 +23,12 @@ int xfrm4_extract_input(struct xfrm_state *x, struct sk_buff 
*skb)
return xfrm4_extract_header(skb);
 }
 
+static int xfrm4_rcv_encap_finish2(struct net *net, struct sock *sk,
+  struct sk_buff *skb)
+{
+   return dst_input(skb);
+}
+
 static inline int xfrm4_rcv_encap_finish(struct net *net, struct sock *sk,
 struct sk_buff *skb)
 {
@@ -33,7 +39,11 @@ static inline int xfrm4_rcv_encap_finish(struct net *net, 
struct sock *sk,
 iph->tos, skb->dev))
goto drop;
}
-   return dst_input(skb);
+
+   if (xfrm_trans_queue(skb, xfrm4_rcv_encap_finish2))
+   goto drop;
+
+   return 0;
 drop:
kfree_skb(skb);
return NET_RX_DROP;
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index fe04e23..841f4a0 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -32,6 +32,14 @@ int xfrm6_rcv_spi(struct sk_buff *skb, int nexthdr, __be32 
spi,
 }
 EXPORT_SYMBOL(xfrm6_rcv_spi);
 
+static int xfrm6_transport_finish2(struct net *net, struct sock *sk,
+  struct sk_buff *skb)
+{
+   if (xfrm_trans_queue(skb, ip6_rcv_finish))
+   __kfree_skb(skb);
+   return -1;
+}
+
 int xfrm6_transport_finish(struct sk_buff *skb, int async)
 {
struct xfrm_offload *xo = xfrm_offload(skb);
@@ -56,7 +64,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async)
 
NF_HOOK(NFPROTO_IPV6, NF_INET_PRE_ROUTING,
dev_net(skb->dev), NULL, skb, skb->dev, NULL,
-   ip6_rcv_finish);
+   xfrm6_transport_finish2);
return -1;
 }
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 347ab31..444fa37 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -8,15 +8,29 @@
  *
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
+struct xfrm_trans_tasklet {
+   struct tasklet_struct tasklet;
+   struct sk_buff_head queue;
+};
+
+struct xfrm_trans_cb {
+   int (*finish)(struct net *net, struct sock *sk, struct sk_buff *skb);
+};
+
+#define XFRM_TRANS_SKB_CB(__skb) ((struct xfrm_trans_cb *)&((__skb)->cb[0]))
+
 static struct kmem_cache *secpath_cachep __read_mostly;
 
 static DEFINE_SPINLOCK(xfrm_input_afinfo_lock);
@@ -25,6 +39,8 @@
 static struct gro_cells gro_cells;
 static struct net_device xfrm_napi_dev;
 
+static DEFINE_PER_CPU(struct xfrm_trans_tasklet, xfrm_trans_tasklet);
+
 int xfrm_input_register_afinfo(const struct xfrm_input_afinfo *afinfo)
 {
int err = 0;
@@ -467,9 +483,41 @@ int xfrm_input_resume(struct sk_buff *skb, int nexthdr)
 }
 EXPORT_SYMBOL(xfrm_input_resume);
 
+static void xfrm_trans_reinject(unsigned long data)
+{
+   struct xfrm_trans_tasklet *trans = (void *)data;
+   struct sk_buff_head queue;
+   struct sk_buff *skb;
+
+   __skb_queue_head_init();
+   skb_queue_splice_init(>queue, );
+
+   while ((skb = __skb_dequeue()))
+   XFRM_TRANS_SKB_CB(skb)->finish(dev_net(skb->dev), NULL, skb);
+}
+
+int xfrm_trans_queue(struct sk_buff *skb,
+int (*finish)(struct net *, struct sock *,
+  struct sk_buff *))
+{
+   struct xfrm_trans_ta

Re: [PATCH v2 net-next 0/5] rhashtable: New features in walk and bucket

2017-12-11 Thread Herbert Xu

On Tue, Dec 05, 2017 at 02:47:58PM -0500, David Miller wrote:
>
> I'll allow Herbert time to think about this some more as he requested
> in comments against the first version of this series.

Sorry for the late response.  Tom's changes look good to me.

We should also fix up all existing rhashtable users that dump
through netlink to use the new peek interface.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 net-next 2/5] rhashtable: Add rhastable_walk_peek

2017-12-11 Thread Herbert Xu

On Mon, Dec 04, 2017 at 10:31:42AM -0800, Tom Herbert wrote:
> This function is like rhashtable_walk_next except that it only returns
> the current element in the inter and does not advance the iter.
> 
> This patch also creates __rhashtable_walk_find_next. It finds the next
> element in the table when the entry cached in iter is NULL or at the end
> of a slot. __rhashtable_walk_find_next is called from
> rhashtable_walk_next and rhastable_walk_peek.
> 
> end_of_table is an added field to the iter structure. This indicates
> that the end of table was reached (walker.tbl being NULL is not a
> sufficient condition for end of table).
> 
> Signed-off-by: Tom Herbert <t...@quantonium.net>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 net-next 1/5] rhashtable: Change rhashtable_walk_start to return void

2017-12-11 Thread Herbert Xu

On Mon, Dec 04, 2017 at 10:31:41AM -0800, Tom Herbert wrote:
> Most callers of rhashtable_walk_start don't care about a resize event
> which is indicated by a return value of -EAGAIN. So calls to
> rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
> like this is common:
> 
>ret = rhashtable_walk_start(rhiter);
>if (ret && ret != -EAGAIN)
>goto out;
> 
> Since zero and -EAGAIN are the only possible return values from the
> function this check is pointless. The condition never evaluates to true.
> 
> This patch changes rhashtable_walk_start to return void. This simplifies
> code for the callers that ignore -EAGAIN. For the few cases where the
> caller cares about the resize event, particularly where the table can be
> walked in mulitple parts for netlink or seq file dump, the function
> rhashtable_walk_start_check has been added that returns -EAGAIN on a
> resize event.
> 
> Signed-off-by: Tom Herbert <t...@quantonium.net>

Acked-by: Herbert Xu <herb...@gondor.apana.org.au>
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: Transport mode xfrm_gro

2017-12-05 Thread Herbert Xu

On Wed, Dec 06, 2017 at 02:37:17PM +1100, Herbert Xu wrote:
>
> So why is xfrm_input in the xfrm_gro case trying to reinject the
> skb into the network stack?

Nevermind, I see now that transport_finish has code to skip xfrm_gro
packets.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Transport mode xfrm_gro

2017-12-05 Thread Herbert Xu

Hi Steffen:

I'm looking at the function xfrm_input near the end where it deals
with transport mode packets:

err = x->inner_mode->afinfo->transport_finish(skb, xfrm_gro || 
async);
if (xfrm_gro) {
if (skb->sp)
skb->sp->olen = 0;
skb_dst_drop(skb);
gro_cells_receive(_cells, skb);
return err;
}

This looks wrong because in transport mode, transport_finish is
well within its rights to consume and free the skb.  For example,
IPv4 transport_finish eventually calls xfrm4_rcv_encap_finish which
does:

if (!skb_dst(skb)) {
const struct iphdr *iph = ip_hdr(skb);

if (ip_route_input_noref(skb, iph->daddr, iph->saddr,
 iph->tos, skb->dev))
goto drop;
}
return dst_input(skb);
drop:
kfree_skb(skb);
return NET_RX_DROP;

Whichever path it takes the skb is either gone or belongs to someone
else.

So why is xfrm_input in the xfrm_gro case trying to reinject the
skb into the network stack?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next 1/5] rhashtable: Don't reset walker table in rhashtable_walk_start

2017-12-01 Thread Herbert Xu

On Thu, Nov 30, 2017 at 04:03:01PM -0800, Tom Herbert wrote:
> Remove the code that resets the walker table. The walker table should
> only be initialized in the walk init function or when a future table is
> encountered. If the walker table is NULL this is the indication that
> the walk has completed and this information can be used to break a
> multi-call walk in the table (e.g. successive calls to nelink_dump
> that are dumping elements of an rhashtable).
> 
> This also allows us to change rhashtable_walk_start to return void
> since the only error it was returning was -EAGAIN for a table change.
> This patch changes all the callers of rhashtable_walk_start to expect
> void which eliminates logic needed to check the return value for a
> rare condition. Note that -EAGAIN will be returned in a call
> to rhashtable_walk_next which seems to always follow the start
> of the walk so there should be no behavioral change in doing this.
> 
> Signed-off-by: Tom Herbert <t...@quantonium.net>

Doesn't this mean that if a walk encounters a rehash you may end up
missing half or more of the hash table?

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next 2/5] rhashtable: Add rhastable_walk_peek

2017-11-30 Thread Herbert Xu

On Thu, Nov 30, 2017 at 05:15:16PM -0800, Tom Herbert wrote:
>
> We don't need a guarantee of stability, but what I am seeing is that
> we're consisitently dropping entries on when doing a multi-part
> netlink walk. We start iterating over the table filling in the netlink
> info. But eventually the netlink info fills up and returns an error.
> netlink dump gets called again but now the iter of the table returns
> the object following the one that would have overflowed the netlink
> buffer. So the result I was seeing is that we dropped one object in in
> each pass.

Thanks Tom! This information is very useful.

It sounds like this problem isn't specific to ila and would exist
for all rhashtable users that dump through netlink.  Let me think
about this a little bit more.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net-next 2/5] rhashtable: Add rhastable_walk_peek

2017-11-30 Thread Herbert Xu

On Thu, Nov 30, 2017 at 04:03:02PM -0800, Tom Herbert wrote:
> This function is like rhashtable_walk_next except that it only returns
> the current element in the inter and does not advance the iter.
> 
> This patch also creates __rhashtable_walk_find_next. It finds the next
> element in the table when the entry cached in iter is NULL or at the end
> of a slot. __rhashtable_walk_find_next is called from
> rhashtable_walk_next and rhastable_walk_peek.
> 
> Signed-off-by: Tom Herbert <t...@quantonium.net>

Hi Tom:

Could you add some motivation for this feature into the patch
description? As it is it's difficult to deduce why we would want
to add something like this given that hashtable walks are always
unstable and there is no guarantee that two calls to peek or a
peek followed by a normal walk will see the same entry.

Thanks,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [5/8] crypto: remove unused hardirq.h

2017-11-28 Thread Herbert Xu

On Sat, Nov 18, 2017 at 07:02:18AM +0800, Yang Shi wrote:
> Preempt counter APIs have been split out, currently, hardirq.h just
> includes irq_enter/exit APIs which are not used by crypto at all.
> 
> So, remove the unused hardirq.h.
> 
> Signed-off-by: Yang Shi <yan...@alibaba-inc.com>
> Cc: Herbert Xu <herb...@gondor.apana.org.au>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: linux-cry...@vger.kernel.org

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [crypto v3 2/2] chcr: Add support for Inline IPSec

2017-11-28 Thread Herbert Xu

On Thu, Nov 16, 2017 at 04:57:08PM +0530, Atul Gupta wrote:
> register xfrmdev_ops callbacks, Send IPsec tunneled data
> to HW for inline processing.
> The driver use hardware crypto accelerator to encrypt and
> generate ICV for the transmitted packet in Inline mode.
> 
> Signed-off-by: Atul Gupta <atul.gu...@chelsio.com>
> Signed-off-by: Harsh Jain <ha...@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganes...@chelsio.com>
> ---
> V2: Fixed the build warnings and created patch against cryptodev
> to avoid possible merge conflicts
> V3: Fixed a build warning and added flag for data packet

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [crypto v3 1/2] cxgb4: Add support for Inline IPSec Tx

2017-11-28 Thread Herbert Xu

On Thu, Nov 16, 2017 at 04:56:39PM +0530, Atul Gupta wrote:
> Added Tx routine for ULD
> - define interface for ULD Tx.
> 
> Export routines used for Tx data
> - Routines common for data transmit are used by cxgb4 and chcr
>   drivers.
> - EXPORT routines enable transmit from chcr driver.
> 
> Signed-off-by: Atul Gupta <atul.gu...@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganes...@chelsio.com>
> ---
> V2: Fixed the build warnings and created patch against cryptodev
> to avoid possible merge conflicts
> V3: Fixed a build warning and added flag for data packet

Patch applied.  Thanks.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: GRO disabled with IPv4 options

2017-11-16 Thread Herbert Xu

On Thu, Nov 16, 2017 at 04:12:43PM +0100, Cristian Klein wrote:
>
> Does somebody know the rationale for this? Is it because IPv4
> options are rarely used, hence implementing GRO in that case does
> not pay off or are there some caveats? Specifically would it make

Precisely.  GRO is about optimising for the common case.  At the
time there was no impetus to support IP options.

> sense to do GRO when the IPv4 options are byte-identical in
> consecutive packets?

Yes there is no reason why we can't do this.  As long as it doesn't
penalise the non-IP-option case too much.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v2 2/2] chcr: Add support for Inline IPSec

2017-11-15 Thread Herbert Xu

On Thu, Nov 16, 2017 at 01:19:52PM +0800, kbuild test robot wrote:
> Hi Atul,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on cryptodev/master]
> [also build test ERROR on next-20171115]
> [cannot apply to v4.14]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Atul-Gupta/cxgb4-Add-support-for-Inline-IPSec-Tx/20171112-012558
> base:   
> https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git 
> master
> config: x86_64-randconfig-g0-11160917 (attached as .config)
> compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_tx_handler':
> >> drivers/crypto/chelsio/chcr_core.c:195: undefined reference to 
> >> `chcr_ipsec_xmit'
>drivers/crypto/chelsio/chcr_core.o: In function `chcr_uld_add':
> >> drivers/crypto/chelsio/chcr_core.c:169: undefined reference to 
> >> `chcr_add_xfrmops'

Atul, can you fix this and resubmit your patches please?

Thanks!
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: kernel BUG at net/key/af_key.c:LINE!

2017-11-09 Thread Herbert Xu

On Fri, Nov 10, 2017 at 01:30:38PM +1100, Herbert Xu wrote:
> 
> I found the problem.  This crap is coming from clone_policy.  Now
> let me where this code came from.

---8<---
Subject: xfrm: Copy policy family in clone_policy

The syzbot found an ancient bug in the IPsec code.  When we cloned
a socket policy (for example, for a child TCP socket derived from a
listening socket), we did not copy the family field.  This results
in a live policy with a zero family field.  This triggers a BUG_ON
check in the af_key code when the cloned policy is retrieved.

This patch fixes it by copying the family field over.

Reported-by: syzbot <syzkal...@googlegroups.com>
Signed-off-by: Herbert Xu <herb...@gondor.apana.org.au>

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 8cafb3c..c238959 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1306,6 +1306,7 @@ static struct xfrm_policy *clone_policy(const struct 
xfrm_policy *old, int dir)
newp->xfrm_nr = old->xfrm_nr;
newp->index = old->index;
newp->type = old->type;
+   newp->family = old->family;
memcpy(newp->xfrm_vec, old->xfrm_vec,
   newp->xfrm_nr*sizeof(struct xfrm_tmpl));
    spin_lock_bh(>xfrm.xfrm_policy_lock);
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: kernel BUG at net/key/af_key.c:LINE!

2017-11-09 Thread Herbert Xu

On Fri, Nov 10, 2017 at 01:11:45PM +1100, Herbert Xu wrote:
>
> Oh and this is an important clue.  We have two policies with
> identical index values.  The index value is meant to be unique
> so clearly something funny is going on.

I found the problem.  This crap is coming from clone_policy.  Now
let me where this code came from.
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: kernel BUG at net/key/af_key.c:LINE!

2017-11-09 Thread Herbert Xu

On Fri, Nov 10, 2017 at 01:04:59PM +1100, Herbert Xu wrote:
> 
> By castrating the reproducer to not perform a pfkey dump I have
> captured the corrupted policy via xfrm:
> 
> src ???/0 dst ???/0 uid 0
> socket in action allow index 2083 priority 0 ptype main share any 
> flag  (0x)
> lifetime config:
>   limit: soft 0(bytes), hard 0(bytes)
>   limit: soft 0(packets), hard 0(packets)
>   expire add: soft 0(sec), hard 0(sec)
>   expire use: soft 0(sec), hard 0(sec)
> lifetime current:
>   0(bytes), 0(packets)
>   add 2017-11-10 09:58:17 use 2017-11-10 09:58:20
> tmpl src ac14:bb:: dst ::
> proto 0 spi 0x(0) reqid 0(0x) mode transport
> level 5 share any 
> enc-mask  auth-mask  comp-mask 
> 
> For comparison here is a good policy that was also created by the
> reproducer:
> 
> src fe80::bb/0 dst ::/0 uid 0
> socket in action allow index 2083 priority 0 ptype main share any 
> flag  (0x)
> lifetime config:
>   limit: soft 0(bytes), hard 0(bytes)
>   limit: soft 0(packets), hard 0(packets)
>   expire add: soft 0(sec), hard 0(sec)
>   expire use: soft 0(sec), hard 0(sec)
> lifetime current:
>   0(bytes), 0(packets)
>   add 2017-11-10 09:58:17 use 2017-11-10 09:58:17
> tmpl src ac14:bb:: dst ::
> proto 0 spi 0x(0) reqid 0(0x) mode transport
> level 5 share any 
> enc-mask  auth-mask  comp-mask 

Oh and this is an important clue.  We have two policies with
identical index values.  The index value is meant to be unique
so clearly something funny is going on.

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: kernel BUG at net/key/af_key.c:LINE!

2017-11-09 Thread Herbert Xu

On Thu, Nov 09, 2017 at 10:38:57PM +1100, Herbert Xu wrote:
> 
> The xfrm code path is meant to forbid the creation of such a policy.
> I don't currently see how this is bypassing that check.  But
> clearly it has found a way through the check since it's crashing.

By castrating the reproducer to not perform a pfkey dump I have
captured the corrupted policy via xfrm:

src ???/0 dst ???/0 uid 0
socket in action allow index 2083 priority 0 ptype main share any flag  
(0x)
lifetime config:
  limit: soft 0(bytes), hard 0(bytes)
  limit: soft 0(packets), hard 0(packets)
  expire add: soft 0(sec), hard 0(sec)
  expire use: soft 0(sec), hard 0(sec)
lifetime current:
  0(bytes), 0(packets)
  add 2017-11-10 09:58:17 use 2017-11-10 09:58:20
tmpl src ac14:bb:: dst ::
proto 0 spi 0x(0) reqid 0(0x) mode transport
level 5 share any 
enc-mask  auth-mask  comp-mask 

For comparison here is a good policy that was also created by the
reproducer:

src fe80::bb/0 dst ::/0 uid 0
socket in action allow index 2083 priority 0 ptype main share any flag  
(0x)
lifetime config:
  limit: soft 0(bytes), hard 0(bytes)
  limit: soft 0(packets), hard 0(packets)
  expire add: soft 0(sec), hard 0(sec)
  expire use: soft 0(sec), hard 0(sec)
lifetime current:
  0(bytes), 0(packets)
  add 2017-11-10 09:58:17 use 2017-11-10 09:58:17
tmpl src ac14:bb:: dst ::
proto 0 spi 0x(0) reqid 0(0x) mode transport
level 5 share any 
enc-mask  auth-mask  comp-mask 

Cheers,
-- 
Email: Herbert Xu <herb...@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2197 matches

Mail list logo