Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat
 ❦  2 avril 2018 22:05 +0300, Julian Anastasov  :

>   Sorry to say it but may be you missed the discussion
> on lvs-devel about the new MH scheduler implemented by Inju Song:
>
> https://www.spinics.net/lists/lvs-devel/msg04928.html
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html
>
>   In the last 6 months we fixed all issues and I acked
> v4 just yesterday:
>
> http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html

For some reason, "ipvs maglev" in Google doesn't yield those
results. But having code already reviewed is great news for me: I can
use it with confidence. :)
-- 
Make sure your code "does nothing" gracefully.
- The Elements of Programming Style (Kernighan & Plauger)


Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Julian Anastasov

Hello,

On Mon, 2 Apr 2018, Vincent Bernat wrote:

> Based on Google's Maglev algorithm [1][2], this scheduler builds a
> lookup table in a way disruption is minimized when a change
> occurs. This helps in case of active/active setup without
> synchronization. Like for classic source hashing, this lookup table is
> used to assign connections to a real server.
> 
> Both source address and port are used to compute the hash (unlike sh
> where this is optional).
> 
> Weights are correctly handled. Unlike sh, servers with a weight of 0
> are considered as absent. Also, unlike sh, when a server becomes
> unavailable due to a threshold, no fallback is possible: doing so
> would seriously impair the the usefulness of using a consistent hash.
> 
> There is a small hack to detect when all real servers have a weight of
> 0. It relies on the fact it is not possible for the weight of a real
> server to change during the execution of the assignment. I believe
> this is the case as modifications through netlink are subject to a
> mutex, but the use of atomic_read() is unsettling.
> 
> The value of 65537 for the hash table size is currently not modifiable
> at compile-time. This is the value suggested in the Maglev
> paper. Another possible value is 257 (for small tests) and 655373 (for
> very large setups).
> 
> [1]: https://research.google.com/pubs/pub44824.html
> [2]: 
> https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/

Sorry to say it but may be you missed the discussion
on lvs-devel about the new MH scheduler implemented by Inju Song:

https://www.spinics.net/lists/lvs-devel/msg04928.html
http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html

In the last 6 months we fixed all issues and I acked
v4 just yesterday:

http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html

This scheduler supports:

- tables with different size (prime): IP_VS_MH_TAB_INDEX
- gcd of weights: ip_vs_mh_gcd_weight
- shifted weights: ip_vs_mh_shift_weight
- weight can be changed any time

> Signed-off-by: Vincent Bernat 
> ---
>  include/net/ip_vs.h|  27 
>  net/netfilter/ipvs/Kconfig |  13 ++
>  net/netfilter/ipvs/Makefile|   1 +
>  net/netfilter/ipvs/ip_vs_csh.c | 339 
> +
>  net/netfilter/ipvs/ip_vs_sh.c  |  32 +---
>  5 files changed, 381 insertions(+), 31 deletions(-)
>  create mode 100644 net/netfilter/ipvs/ip_vs_csh.c

Regards

--
Julian Anastasov 


Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat
 ❦  2 avril 2018 10:33 -0700, Eric Dumazet  :

>> +static inline u32
>> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j)
>> +{
>> +u32 offset, skip;
>> +__be32 addr_fold = d->addr.ip;
>> +
>> +#ifdef CONFIG_IP_VS_IPV6
>> +if (d->af == AF_INET6)
>> +addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^
>> +d->addr.ip6[2]^d->addr.ip6[3];
>> +#endif
>> +addr_fold = ntohl(addr_fold) + ntohs(d->port);
>> +offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE;
>> +skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1;
>> +return (offset + j * skip) % IP_VS_CSH_TAB_SIZE;
>> +}
>> +
>
> This does not look very strong to me, particularly the IPv6 folding
>
> I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(),
> even if it is hard coded ;)

I can switch to ipv6_addr_hash(). However, switching to
__ipv6_addr_jhash seems useless as I would need to hardcode the initial
value: people use source hashing to get the same result from one host to
another. Am I missing something?
-- 
Each module should do one thing well.
- The Elements of Programming Style (Kernighan & Plauger)


Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Eric Dumazet


On 04/02/2018 10:20 AM, Vincent Bernat wrote:

> +static inline u32
> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j)
> +{
> + u32 offset, skip;
> + __be32 addr_fold = d->addr.ip;
> +
> +#ifdef CONFIG_IP_VS_IPV6
> + if (d->af == AF_INET6)
> + addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^
> + d->addr.ip6[2]^d->addr.ip6[3];
> +#endif
> + addr_fold = ntohl(addr_fold) + ntohs(d->port);
> + offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE;
> + skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1;
> + return (offset + j * skip) % IP_VS_CSH_TAB_SIZE;
> +}
> +

This does not look very strong to me, particularly the IPv6 folding

I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), even if it 
is hard coded ;)





[PATCH net-next v1] ipvs: add consistent source hashing scheduling

2018-04-02 Thread Vincent Bernat
Based on Google's Maglev algorithm [1][2], this scheduler builds a
lookup table in a way disruption is minimized when a change
occurs. This helps in case of active/active setup without
synchronization. Like for classic source hashing, this lookup table is
used to assign connections to a real server.

Both source address and port are used to compute the hash (unlike sh
where this is optional).

Weights are correctly handled. Unlike sh, servers with a weight of 0
are considered as absent. Also, unlike sh, when a server becomes
unavailable due to a threshold, no fallback is possible: doing so
would seriously impair the the usefulness of using a consistent hash.

There is a small hack to detect when all real servers have a weight of
0. It relies on the fact it is not possible for the weight of a real
server to change during the execution of the assignment. I believe
this is the case as modifications through netlink are subject to a
mutex, but the use of atomic_read() is unsettling.

The value of 65537 for the hash table size is currently not modifiable
at compile-time. This is the value suggested in the Maglev
paper. Another possible value is 257 (for small tests) and 655373 (for
very large setups).

[1]: https://research.google.com/pubs/pub44824.html
[2]: 
https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/

Signed-off-by: Vincent Bernat 
---
 include/net/ip_vs.h|  27 
 net/netfilter/ipvs/Kconfig |  13 ++
 net/netfilter/ipvs/Makefile|   1 +
 net/netfilter/ipvs/ip_vs_csh.c | 339 +
 net/netfilter/ipvs/ip_vs_sh.c  |  32 +---
 5 files changed, 381 insertions(+), 31 deletions(-)
 create mode 100644 net/netfilter/ipvs/ip_vs_csh.c

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index eb0bec043c96..2184b43b7320 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -195,6 +195,33 @@ static inline int ip_vs_addr_equal(int af, const union 
nf_inet_addr *a,
return a->ip == b->ip;
 }
 
+static inline __be16 ip_vs_get_port(const struct sk_buff *skb, struct 
ip_vs_iphdr *iph)
+{
+   __be16 _ports[2], *ports;
+
+   /* At this point we know that we have a valid packet of some kind.
+* Because ICMP packets are only guaranteed to have the first 8
+* bytes, let's just grab the ports.  Fortunately they're in the
+* same position for all three of the protocols we care about.
+*/
+   switch (iph->protocol) {
+   case IPPROTO_TCP:
+   case IPPROTO_UDP:
+   case IPPROTO_SCTP:
+   ports = skb_header_pointer(skb, iph->len, sizeof(_ports),
+  &_ports);
+   if (unlikely(!ports))
+   return 0;
+
+   if (likely(!ip_vs_iph_inverse(iph)))
+   return ports[0];
+   else
+   return ports[1];
+   default:
+   return 0;
+   }
+}
+
 #ifdef CONFIG_IP_VS_DEBUG
 #include 
 
diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
index b32fb0dbe237..bfd091e020af 100644
--- a/net/netfilter/ipvs/Kconfig
+++ b/net/netfilter/ipvs/Kconfig
@@ -225,6 +225,19 @@ config IP_VS_SH
  If you want to compile it in kernel, say Y. To compile it as a
  module, choose M here. If unsure, say N.
 
+config IP_VS_CSH
+   tristate "consistent source hashing scheduling"
+   ---help---
+ The consistent source hashing scheduling algorithm assigns
+ network connections to the servers through looking up a
+ statically assigned hash table by their source IP addresses
+ and ports. The hash table is built to minimize disruption
+ when a server becomes unavailable. It relies on the Maglev
+ algorithm.
+
+ If you want to compile it in kernel, say Y. To compile it as a
+ module, choose M here. If unsure, say N.
+
 config IP_VS_SED
tristate "shortest expected delay scheduling"
---help---
diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile
index c552993fa4b9..7d0badf86dfe 100644
--- a/net/netfilter/ipvs/Makefile
+++ b/net/netfilter/ipvs/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_IP_VS_LBLC) += ip_vs_lblc.o
 obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o
 obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o
 obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o
+obj-$(CONFIG_IP_VS_CSH) += ip_vs_csh.o
 obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o
 obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o
 
diff --git a/net/netfilter/ipvs/ip_vs_csh.c b/net/netfilter/ipvs/ip_vs_csh.c
new file mode 100644
index ..c70db61ba934
--- /dev/null
+++ b/net/netfilter/ipvs/ip_vs_csh.c
@@ -0,0 +1,339 @@
+/*
+ * IPVS:Consistent Hashing scheduling module using Google's Maglev
+ *
+ * Authors: Vincent Bernat 
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under