Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
❦ 2 avril 2018 22:05 +0300, Julian Anastasov: > Sorry to say it but may be you missed the discussion > on lvs-devel about the new MH scheduler implemented by Inju Song: > > https://www.spinics.net/lists/lvs-devel/msg04928.html > http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html > > In the last 6 months we fixed all issues and I acked > v4 just yesterday: > > http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html For some reason, "ipvs maglev" in Google doesn't yield those results. But having code already reviewed is great news for me: I can use it with confidence. :) -- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
Hello, On Mon, 2 Apr 2018, Vincent Bernat wrote: > Based on Google's Maglev algorithm [1][2], this scheduler builds a > lookup table in a way disruption is minimized when a change > occurs. This helps in case of active/active setup without > synchronization. Like for classic source hashing, this lookup table is > used to assign connections to a real server. > > Both source address and port are used to compute the hash (unlike sh > where this is optional). > > Weights are correctly handled. Unlike sh, servers with a weight of 0 > are considered as absent. Also, unlike sh, when a server becomes > unavailable due to a threshold, no fallback is possible: doing so > would seriously impair the the usefulness of using a consistent hash. > > There is a small hack to detect when all real servers have a weight of > 0. It relies on the fact it is not possible for the weight of a real > server to change during the execution of the assignment. I believe > this is the case as modifications through netlink are subject to a > mutex, but the use of atomic_read() is unsettling. > > The value of 65537 for the hash table size is currently not modifiable > at compile-time. This is the value suggested in the Maglev > paper. Another possible value is 257 (for small tests) and 655373 (for > very large setups). > > [1]: https://research.google.com/pubs/pub44824.html > [2]: > https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/ Sorry to say it but may be you missed the discussion on lvs-devel about the new MH scheduler implemented by Inju Song: https://www.spinics.net/lists/lvs-devel/msg04928.html http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html In the last 6 months we fixed all issues and I acked v4 just yesterday: http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html This scheduler supports: - tables with different size (prime): IP_VS_MH_TAB_INDEX - gcd of weights: ip_vs_mh_gcd_weight - shifted weights: ip_vs_mh_shift_weight - weight can be changed any time > Signed-off-by: Vincent Bernat> --- > include/net/ip_vs.h| 27 > net/netfilter/ipvs/Kconfig | 13 ++ > net/netfilter/ipvs/Makefile| 1 + > net/netfilter/ipvs/ip_vs_csh.c | 339 > + > net/netfilter/ipvs/ip_vs_sh.c | 32 +--- > 5 files changed, 381 insertions(+), 31 deletions(-) > create mode 100644 net/netfilter/ipvs/ip_vs_csh.c Regards -- Julian Anastasov
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
❦ 2 avril 2018 10:33 -0700, Eric Dumazet: >> +static inline u32 >> +ip_vs_csh_permutation(struct ip_vs_dest *d, int j) >> +{ >> +u32 offset, skip; >> +__be32 addr_fold = d->addr.ip; >> + >> +#ifdef CONFIG_IP_VS_IPV6 >> +if (d->af == AF_INET6) >> +addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^ >> +d->addr.ip6[2]^d->addr.ip6[3]; >> +#endif >> +addr_fold = ntohl(addr_fold) + ntohs(d->port); >> +offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE; >> +skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1; >> +return (offset + j * skip) % IP_VS_CSH_TAB_SIZE; >> +} >> + > > This does not look very strong to me, particularly the IPv6 folding > > I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), > even if it is hard coded ;) I can switch to ipv6_addr_hash(). However, switching to __ipv6_addr_jhash seems useless as I would need to hardcode the initial value: people use source hashing to get the same result from one host to another. Am I missing something? -- Each module should do one thing well. - The Elements of Programming Style (Kernighan & Plauger)
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
On 04/02/2018 10:20 AM, Vincent Bernat wrote: > +static inline u32 > +ip_vs_csh_permutation(struct ip_vs_dest *d, int j) > +{ > + u32 offset, skip; > + __be32 addr_fold = d->addr.ip; > + > +#ifdef CONFIG_IP_VS_IPV6 > + if (d->af == AF_INET6) > + addr_fold = d->addr.ip6[0]^d->addr.ip6[1]^ > + d->addr.ip6[2]^d->addr.ip6[3]; > +#endif > + addr_fold = ntohl(addr_fold) + ntohs(d->port); > + offset = hash_32(addr_fold, 32) % IP_VS_CSH_TAB_SIZE; > + skip = (hash_32(addr_fold + 1, 32) % (IP_VS_CSH_TAB_SIZE - 1)) + 1; > + return (offset + j * skip) % IP_VS_CSH_TAB_SIZE; > +} > + This does not look very strong to me, particularly the IPv6 folding I would rather use __ipv6_addr_jhash() instead of ipv6_addr_hash(), even if it is hard coded ;)
[PATCH net-next v1] ipvs: add consistent source hashing scheduling
Based on Google's Maglev algorithm [1][2], this scheduler builds a lookup table in a way disruption is minimized when a change occurs. This helps in case of active/active setup without synchronization. Like for classic source hashing, this lookup table is used to assign connections to a real server. Both source address and port are used to compute the hash (unlike sh where this is optional). Weights are correctly handled. Unlike sh, servers with a weight of 0 are considered as absent. Also, unlike sh, when a server becomes unavailable due to a threshold, no fallback is possible: doing so would seriously impair the the usefulness of using a consistent hash. There is a small hack to detect when all real servers have a weight of 0. It relies on the fact it is not possible for the weight of a real server to change during the execution of the assignment. I believe this is the case as modifications through netlink are subject to a mutex, but the use of atomic_read() is unsettling. The value of 65537 for the hash table size is currently not modifiable at compile-time. This is the value suggested in the Maglev paper. Another possible value is 257 (for small tests) and 655373 (for very large setups). [1]: https://research.google.com/pubs/pub44824.html [2]: https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/ Signed-off-by: Vincent Bernat--- include/net/ip_vs.h| 27 net/netfilter/ipvs/Kconfig | 13 ++ net/netfilter/ipvs/Makefile| 1 + net/netfilter/ipvs/ip_vs_csh.c | 339 + net/netfilter/ipvs/ip_vs_sh.c | 32 +--- 5 files changed, 381 insertions(+), 31 deletions(-) create mode 100644 net/netfilter/ipvs/ip_vs_csh.c diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h index eb0bec043c96..2184b43b7320 100644 --- a/include/net/ip_vs.h +++ b/include/net/ip_vs.h @@ -195,6 +195,33 @@ static inline int ip_vs_addr_equal(int af, const union nf_inet_addr *a, return a->ip == b->ip; } +static inline __be16 ip_vs_get_port(const struct sk_buff *skb, struct ip_vs_iphdr *iph) +{ + __be16 _ports[2], *ports; + + /* At this point we know that we have a valid packet of some kind. +* Because ICMP packets are only guaranteed to have the first 8 +* bytes, let's just grab the ports. Fortunately they're in the +* same position for all three of the protocols we care about. +*/ + switch (iph->protocol) { + case IPPROTO_TCP: + case IPPROTO_UDP: + case IPPROTO_SCTP: + ports = skb_header_pointer(skb, iph->len, sizeof(_ports), + &_ports); + if (unlikely(!ports)) + return 0; + + if (likely(!ip_vs_iph_inverse(iph))) + return ports[0]; + else + return ports[1]; + default: + return 0; + } +} + #ifdef CONFIG_IP_VS_DEBUG #include diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig index b32fb0dbe237..bfd091e020af 100644 --- a/net/netfilter/ipvs/Kconfig +++ b/net/netfilter/ipvs/Kconfig @@ -225,6 +225,19 @@ config IP_VS_SH If you want to compile it in kernel, say Y. To compile it as a module, choose M here. If unsure, say N. +config IP_VS_CSH + tristate "consistent source hashing scheduling" + ---help--- + The consistent source hashing scheduling algorithm assigns + network connections to the servers through looking up a + statically assigned hash table by their source IP addresses + and ports. The hash table is built to minimize disruption + when a server becomes unavailable. It relies on the Maglev + algorithm. + + If you want to compile it in kernel, say Y. To compile it as a + module, choose M here. If unsure, say N. + config IP_VS_SED tristate "shortest expected delay scheduling" ---help--- diff --git a/net/netfilter/ipvs/Makefile b/net/netfilter/ipvs/Makefile index c552993fa4b9..7d0badf86dfe 100644 --- a/net/netfilter/ipvs/Makefile +++ b/net/netfilter/ipvs/Makefile @@ -33,6 +33,7 @@ obj-$(CONFIG_IP_VS_LBLC) += ip_vs_lblc.o obj-$(CONFIG_IP_VS_LBLCR) += ip_vs_lblcr.o obj-$(CONFIG_IP_VS_DH) += ip_vs_dh.o obj-$(CONFIG_IP_VS_SH) += ip_vs_sh.o +obj-$(CONFIG_IP_VS_CSH) += ip_vs_csh.o obj-$(CONFIG_IP_VS_SED) += ip_vs_sed.o obj-$(CONFIG_IP_VS_NQ) += ip_vs_nq.o diff --git a/net/netfilter/ipvs/ip_vs_csh.c b/net/netfilter/ipvs/ip_vs_csh.c new file mode 100644 index ..c70db61ba934 --- /dev/null +++ b/net/netfilter/ipvs/ip_vs_csh.c @@ -0,0 +1,339 @@ +/* + * IPVS:Consistent Hashing scheduling module using Google's Maglev + * + * Authors: Vincent Bernat + * + * This program is free software; you can redistribute it and/or + * modify it under