Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread Eric W. Biederman
David Ahern  writes:

> On 7/25/18 11:38 AM, Eric W. Biederman wrote:
>> 
>> Absolutely NOT.  Global thresholds are exactly correct given the fact
>> you are running on a single kernel.
>> 
>> Memory is not free (Even though we are swimming in enough of it memory
>> rarely matters).  One of the few remaining challenges is for containers
>> is finding was to limit resources in such a way that one application
>> does not mess things up for another container during ordinary usage.
>> 
>> It looks like the neighbour tables absolutely are that kind of problem,
>> because the artificial limits are too strict.   Completely giving up on
>> limits does not seem right approach either.  We need to fix the limits
>> we have (perhaps making them go away entirely), not just apply a
>> band-aid.  Let's get to the bottom of this and make the system better.
>
> Eric: yes, they all share the global resource of memory and there should
> be limits on how many entries a remote entity can create.
>
> Network namespaces can provide a separation such that one namespace does
> not disrupt networking in another. It is absolutely appropriate to do
> so. Your rigid stance is inconsistent given the basic meaning of a
> network namespace and the parallels to this same problem -- bridges,
> vxlans, and ip fragments. Only neighbor tables are not per-device or per
> namespace; your insistence on global limits is missing the mark and wrong.

That is not what I said.  Let me rephrase and see if you understand.

The problem appears to be of lots of devices.  Fundamentally if you use
lots of network devices today unless you adjust gc_thresh3 you will run
out of neighbour table entries.

The problem has a bigger scope than what you are looking at.

If you fix the core problem you won't see the problem in the context
of network namespaces either.

Default limits should be something that will never be hit unless
something goes crazy.  We are hitting them.  Therefore by definition
there is a bug in these limits.


And yes there is absolutely a place for global limits on things like
inodes, file descriptors etc, that does not care about which part of the
kernel you are in.  However hitting those limits in normal operation is
a bug.

We have ourselves a bug.

Eric

p.s. I wrote the definition of network namespaces and it absolutely does
have room for global limits.   One of the things Linus has periodically
yelled at me about is that there are not enough of them.




Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread David Ahern
On 7/24/18 11:14 AM, David Miller wrote:
> From: David Ahern 
> Date: Tue, 24 Jul 2018 09:14:01 -0600
> 
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>>
>> neighbor entries round up to 512 byte allocations, so with the current
>> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
>> limits per namespace seems high which is why I suggested a per-namespace
>> default of (16/32/64) which amounts to 32k per namespace limit by
>> default. Open to other suggestions as well.
> 
> No objection from me about going to per-ns neigh tables.
> 
> About the defaults, I wonder if we can scale them to the amount of
> memory given to the ns or something like that?  I bet this will better
> match the intended use of the ns.
> 

Not sure how to do that. I am not aware of memory allocations to a
network namespace. As I understand it containers use cgroups to control
memory use, but I am not aware of any direct ties to namespace.


Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread David Ahern
On 7/25/18 11:38 AM, Eric W. Biederman wrote:
> 
> Absolutely NOT.  Global thresholds are exactly correct given the fact
> you are running on a single kernel.
> 
> Memory is not free (Even though we are swimming in enough of it memory
> rarely matters).  One of the few remaining challenges is for containers
> is finding was to limit resources in such a way that one application
> does not mess things up for another container during ordinary usage.
> 
> It looks like the neighbour tables absolutely are that kind of problem,
> because the artificial limits are too strict.   Completely giving up on
> limits does not seem right approach either.  We need to fix the limits
> we have (perhaps making them go away entirely), not just apply a
> band-aid.  Let's get to the bottom of this and make the system better.

Eric: yes, they all share the global resource of memory and there should
be limits on how many entries a remote entity can create.

Network namespaces can provide a separation such that one namespace does
not disrupt networking in another. It is absolutely appropriate to do
so. Your rigid stance is inconsistent given the basic meaning of a
network namespace and the parallels to this same problem -- bridges,
vxlans, and ip fragments. Only neighbor tables are not per-device or per
namespace; your insistence on global limits is missing the mark and wrong.


Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread Eric W. Biederman
David Ahern  writes:

> On 7/25/18 6:33 AM, Eric W. Biederman wrote:
>> Cong Wang  writes:
>> 
>>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern  wrote:

 On 7/19/18 11:12 AM, Cong Wang wrote:
> On Thu, Jul 19, 2018 at 9:16 AM David Ahern  wrote:
>>
>> Chatting with Nikolay about this and he brought up a good corollary - ip
>> fragmentation. It really is a similar problem in that memory is consumed
>> as a result of packets received from an external entity. The ipfrag
>> sysctls are per namespace with a limit that non-init_net namespaces can
>> not set high_thresh > the current value of init_net. Potential memory
>> consumed by fragments scales with the number of namespaces which is the
>> primary concern with making neighbor tables per namespace.
>
> Nothing new, already discussed:
> https://marc.info/?l=linux-netdev=140391416215988=2
>
> :)
>

 Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
 local memory resources due to received packets. bridge and vxlan fdb's
 are fairly straightforward analogs to neighbor entries; they are per
 device with no limits on the number of entries. Fragments have memory
 limits per namespace. So neighbor tables are the only ones with this
 strict limitation and concern on memory consumption.

 I get the impression there is no longer a strong resistance against
 moving the tables to per namespace, but deciding what is the right
 approach to handle backwards compatibility. Correct? Changing the
 accounting is inevitably going to be noticeable to some use case(s), but
 with sysctl settings it is a simple runtime update once the user knows
 to make the change.
>>>
>>> This question definitely should go to Eric Biederman who was against
>>> my proposal.
>>>
>>> Let's add Eric into CC.
>> 
>> Given that the entries are per device and the devices are per-namespace,
>> semantically neighbours are already kept in a per-namespace manner.  So
>> this is all about making the code not honoring global resource limits.
>> Making the code not honor gc_thresh3.
>> 
>> Skimming through the code today the default for gc_thresh3 is 1024.
>> Which means that we limit the neighbour tables to 1024 entries per
>> protocol type.
>> 
>> There are some pretty compelling reasons especially with ipv4 to keep
>> the subnet size down.  Arp storms are a real thing.
>> 
>> I don't know off the top of my head what the reasons for limiting the
>> neighbour table sizes.  I would be much more comfortable with a patchset
>> like this if we did some research and figured out the reasons why
>> we have a global limit.  Then changed the code to remove those limits.
>> 
>> When the limits are gone.  When the code can support large subnets
>> without tuning.  We we don't have to worry about someone scanning an all
>> addresses in an ipv6 subnet and causing a DOS on working machines.
>> I think it is completely appropriate to look to see if something per
>> network namespace needs to happen.
>> 
>> So please let's address the limits, not the fact that some specific
>> corner case ran into them.
>> 
>> If we are going to neuter gc_thresh3 let's go as far as removing it
>> entirely.  If we are going to make the neighbour table per something
>> let's make it per network device.  If we can afford the multiple hash
>> tables then a hash table per device is better.   Perhaps we want to move
>> to rhash tables while we look at this, instead of an old hand grown
>> version of resizable hash table.
>
> Given the uses cases with increasing number of devices (> 10,000),
> per-device tables will have more problems than per namespace - in
> reference to your concern in the last paragraph below.
>
>> 
>> Unless I misread something all your patchset did is reshuffle code and
>> data structures so that gc_thresh3 does not apply accross namespaces.
>> That does not feel like it really fixes anything.  That just lies to
>> people.
>
> This patch set fixes the lie that network namespaces provide complete
> isolation when in fact one namespace can evict neighbor entries from
> another. An arp storm you are concerned about in one namespace impacts
> all containers.

Network namespaces can not provide complete isolation.  They share the
same kernel and they do not dedicate resources to each other.
Namespaces in general are about the names.  They are about sharing a
machine efficiently.

I humbly suggest that anyone who wants ``complete'' isolation to use vm
at the very least.

I do think the limits on the neighbour table are quite likely too
strict.  We should be able to relax them and continue to have a
networking stack that works for everyone.

> It starts by removing the proliferation of open coded references to
> arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
> From there (patches 14-16) it makes the tables per-namespace and hence
> makes the gc_thresh parameters which 

Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread David Ahern
On 7/25/18 6:33 AM, Eric W. Biederman wrote:
> Cong Wang  writes:
> 
>> On Tue, Jul 24, 2018 at 8:14 AM David Ahern  wrote:
>>>
>>> On 7/19/18 11:12 AM, Cong Wang wrote:
 On Thu, Jul 19, 2018 at 9:16 AM David Ahern  wrote:
>
> Chatting with Nikolay about this and he brought up a good corollary - ip
> fragmentation. It really is a similar problem in that memory is consumed
> as a result of packets received from an external entity. The ipfrag
> sysctls are per namespace with a limit that non-init_net namespaces can
> not set high_thresh > the current value of init_net. Potential memory
> consumed by fragments scales with the number of namespaces which is the
> primary concern with making neighbor tables per namespace.

 Nothing new, already discussed:
 https://marc.info/?l=linux-netdev=140391416215988=2

 :)

>>>
>>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>>> local memory resources due to received packets. bridge and vxlan fdb's
>>> are fairly straightforward analogs to neighbor entries; they are per
>>> device with no limits on the number of entries. Fragments have memory
>>> limits per namespace. So neighbor tables are the only ones with this
>>> strict limitation and concern on memory consumption.
>>>
>>> I get the impression there is no longer a strong resistance against
>>> moving the tables to per namespace, but deciding what is the right
>>> approach to handle backwards compatibility. Correct? Changing the
>>> accounting is inevitably going to be noticeable to some use case(s), but
>>> with sysctl settings it is a simple runtime update once the user knows
>>> to make the change.
>>
>> This question definitely should go to Eric Biederman who was against
>> my proposal.
>>
>> Let's add Eric into CC.
> 
> Given that the entries are per device and the devices are per-namespace,
> semantically neighbours are already kept in a per-namespace manner.  So
> this is all about making the code not honoring global resource limits.
> Making the code not honor gc_thresh3.
> 
> Skimming through the code today the default for gc_thresh3 is 1024.
> Which means that we limit the neighbour tables to 1024 entries per
> protocol type.
> 
> There are some pretty compelling reasons especially with ipv4 to keep
> the subnet size down.  Arp storms are a real thing.
> 
> I don't know off the top of my head what the reasons for limiting the
> neighbour table sizes.  I would be much more comfortable with a patchset
> like this if we did some research and figured out the reasons why
> we have a global limit.  Then changed the code to remove those limits.
> 
> When the limits are gone.  When the code can support large subnets
> without tuning.  We we don't have to worry about someone scanning an all
> addresses in an ipv6 subnet and causing a DOS on working machines.
> I think it is completely appropriate to look to see if something per
> network namespace needs to happen.
> 
> So please let's address the limits, not the fact that some specific
> corner case ran into them.
> 
> If we are going to neuter gc_thresh3 let's go as far as removing it
> entirely.  If we are going to make the neighbour table per something
> let's make it per network device.  If we can afford the multiple hash
> tables then a hash table per device is better.   Perhaps we want to move
> to rhash tables while we look at this, instead of an old hand grown
> version of resizable hash table.

Given the uses cases with increasing number of devices (> 10,000),
per-device tables will have more problems than per namespace - in
reference to your concern in the last paragraph below.

> 
> Unless I misread something all your patchset did is reshuffle code and
> data structures so that gc_thresh3 does not apply accross namespaces.
> That does not feel like it really fixes anything.  That just lies to
> people.

This patch set fixes the lie that network namespaces provide complete
isolation when in fact one namespace can evict neighbor entries from
another. An arp storm you are concerned about in one namespace impacts
all containers.

It starts by removing the proliferation of open coded references to
arp_tbl and nd_tbl, moving them behind the existing neigh_find_table.
>From there (patches 14-16) it makes the tables per-namespace and hence
makes the gc_thresh parameters which are per-table now
per-table-per-namespace.

So it removes the global thresholds because the global ones are just
wrong given the meaning of a network namespace and provides the more
appropriate per-namespace limits.

> 
> Further unless I misread something you are increasing the number of
> timers to 3 per namespace.  If I create create a thousand network
> namespaces that feels like it will hurt system performance overall.

It seems to me the timers are per neighbor entry not table. The per
table ones are for proxies.


Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-25 Thread Eric W. Biederman
Cong Wang  writes:

> On Tue, Jul 24, 2018 at 8:14 AM David Ahern  wrote:
>>
>> On 7/19/18 11:12 AM, Cong Wang wrote:
>> > On Thu, Jul 19, 2018 at 9:16 AM David Ahern  wrote:
>> >>
>> >> Chatting with Nikolay about this and he brought up a good corollary - ip
>> >> fragmentation. It really is a similar problem in that memory is consumed
>> >> as a result of packets received from an external entity. The ipfrag
>> >> sysctls are per namespace with a limit that non-init_net namespaces can
>> >> not set high_thresh > the current value of init_net. Potential memory
>> >> consumed by fragments scales with the number of namespaces which is the
>> >> primary concern with making neighbor tables per namespace.
>> >
>> > Nothing new, already discussed:
>> > https://marc.info/?l=linux-netdev=140391416215988=2
>> >
>> > :)
>> >
>>
>> Neighbor tables, bridge fdbs, vxlan fdbs and ip fragments all consume
>> local memory resources due to received packets. bridge and vxlan fdb's
>> are fairly straightforward analogs to neighbor entries; they are per
>> device with no limits on the number of entries. Fragments have memory
>> limits per namespace. So neighbor tables are the only ones with this
>> strict limitation and concern on memory consumption.
>>
>> I get the impression there is no longer a strong resistance against
>> moving the tables to per namespace, but deciding what is the right
>> approach to handle backwards compatibility. Correct? Changing the
>> accounting is inevitably going to be noticeable to some use case(s), but
>> with sysctl settings it is a simple runtime update once the user knows
>> to make the change.
>
> This question definitely should go to Eric Biederman who was against
> my proposal.
>
> Let's add Eric into CC.

Given that the entries are per device and the devices are per-namespace,
semantically neighbours are already kept in a per-namespace manner.  So
this is all about making the code not honoring global resource limits.
Making the code not honor gc_thresh3.

Skimming through the code today the default for gc_thresh3 is 1024.
Which means that we limit the neighbour tables to 1024 entries per
protocol type.

There are some pretty compelling reasons especially with ipv4 to keep
the subnet size down.  Arp storms are a real thing.

I don't know off the top of my head what the reasons for limiting the
neighbour table sizes.  I would be much more comfortable with a patchset
like this if we did some research and figured out the reasons why
we have a global limit.  Then changed the code to remove those limits.

When the limits are gone.  When the code can support large subnets
without tuning.  We we don't have to worry about someone scanning an all
addresses in an ipv6 subnet and causing a DOS on working machines.
I think it is completely appropriate to look to see if something per
network namespace needs to happen.

So please let's address the limits, not the fact that some specific
corner case ran into them.

If we are going to neuter gc_thresh3 let's go as far as removing it
entirely.  If we are going to make the neighbour table per something
let's make it per network device.  If we can afford the multiple hash
tables then a hash table per device is better.   Perhaps we want to move
to rhash tables while we look at this, instead of an old hand grown
version of resizable hash table.

Unless I misread something all your patchset did is reshuffle code and
data structures so that gc_thresh3 does not apply accross namespaces.
That does not feel like it really fixes anything.  That just lies to
people.

Further unless I misread something you are increasing the number of
timers to 3 per namespace.  If I create create a thousand network
namespaces that feels like it will hurt system performance overall.

Eric




Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-24 Thread David Miller
From: David Ahern 
Date: Tue, 24 Jul 2018 09:14:01 -0600

> I get the impression there is no longer a strong resistance against
> moving the tables to per namespace, but deciding what is the right
> approach to handle backwards compatibility. Correct? Changing the
> accounting is inevitably going to be noticeable to some use case(s), but
> with sysctl settings it is a simple runtime update once the user knows
> to make the change.
> 
> neighbor entries round up to 512 byte allocations, so with the current
> gc_thresh defaults (128/512/1024) 512k can be consumed. Using those
> limits per namespace seems high which is why I suggested a per-namespace
> default of (16/32/64) which amounts to 32k per namespace limit by
> default. Open to other suggestions as well.

No objection from me about going to per-ns neigh tables.

About the defaults, I wonder if we can scale them to the amount of
memory given to the ns or something like that?  I bet this will better
match the intended use of the ns.


Re: [PATCH RFC/RFT net-next 00/17] net: Convert neighbor tables to per-namespace

2018-07-17 Thread Cong Wang
On Tue, Jul 17, 2018 at 10:43 AM David Ahern  wrote:
>
> On 7/17/18 11:40 AM, Cong Wang wrote:
> > On Tue, Jul 17, 2018 at 5:11 AM  wrote:
> >>
> >> From: David Ahern 
> >>
> >> Nikita Leshenko reported that neighbor entries in one namespace can
> >> evict neighbor entries in another. The problem is that the neighbor
> >> tables have entries across all namespaces without separate accounting
> >> and with global limits on when to scan for entries to evict.
> >
> > It is nothing new, people including me already noticed this before.
> >
> >
> >>
> >> Resolve by making the neighbor tables for ipv4, ipv6 and decnet per
> >> namespace and making the accounting and threshold limits per namespace.
> >
> >
> > The last discussion about this a long time ago concluded that neigh
> > table entries are controllable by remote, so after moving it to per netns,
> > it would be easier to DOS the host.
> >
>
> There are still limits on the total number of entries and with
> per-namespace limits an admin has better control.

Per-netns limit is *exactly* the problem here.

Quote from David Miller:
"
From: ebied...@xmission.com (Eric W. Biederman)
Date: Wed, 25 Jun 2014 18:17:08 -0700

> I disagree that removing a global DOS prevention check is a benefit.
> Certainly large semantics changes like that should not happen without
> being discussed in the patch description.

Agreed, this is the most important core issue.

If we just make these things per netns, then as a result if you create
N namespaces we will allow N times more neighbour entries to be
sitting in the system at once.

Actually, I'm really surprised the limits get hit and this actually
causes problems.
"

You can see the original discussion here:
https://marc.info/?l=linux-netdev=140356141019653=2