date:20210223

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

Hello Robert,

Your statement that DPDK “keeps utilization at 100% regardless of packet
> activity” is just not correct.  You further pre-suppose "widespread DPDK's
> core operating inefficiency” without any data to backup the operating
> inefficacy assertion.
>

This statement is incorrect.
I have provided references (please see earlier e-mails) that investigate
the operation of DPDK.
These references are items of peer-reviewed research that investigate a
perceived problem with deployment of DPDK.
If the power consumption incurred while running DPDK were a corner case,
then there would be little to no research value in investigating such
behavior.

Please don’t mislead the community into believing that DPDK == power bad
>
I have to object to this statement. It does seem to imply malice, or, at
best, amateurish behaviour, whether you intended it or not.

Everything following is informational.  Stop here if so inclined.
>
 Please stop delving into the detail of DPDK's facilities without regard
for your logical omission:
that whether the facilities are available or not, DPDK's deployment profile
(meaning: how it's being used in general), as indicated by the references
I've provided,
are leading to high power inefficiency on cores partitioned to the data
plane.

The takeaway is that DPDK (and similar) doesn’t guarantee runaway power
> bills.
>
Of course it doesn't.
Even the second question of that bare-bones survey tried to communicate
this much.

If you have questions, I’d be happy to discuss off line
>
I would be happy to answer your objections in detail off line too.
Just let me know.

Cheers,

Etienne


On Wed, Feb 24, 2021 at 12:12 AM Robert Bays  wrote:

> Hi Etienne,
>
> Your statement that DPDK “keeps utilization at 100% regardless of packet
> activity” is just not correct.  You further pre-suppose "widespread DPDK's
> core operating inefficiency” without any data to backup the operating
> inefficacy assertion.  Your statements, taken at face value, lead people to
> believe that if a project uses DPDK it’s going to increase their power
> costs.  And that’s just not the case.  Please don’t mislead the community
> into believing that DPDK == power bad.
>
> Everything following is informational.  Stop here if so inclined.
>
> DPDK does not dictate CPU utilization or power consumption, the
> application leveraging DPDK does.  It’s the application that decides how to
> poll packets.  If an application implements DPDK using only a tight polling
> loop, then it will keep CPU cores that are running DPDK threads at 100%.
> But only the most simple and/or bespoke (think trading) applications are
> implemented this way.  You don’t need tight polling all the time to get the
> performance gains provided by DPDK or similar environments.  The vast
> majority of applications that this audience would actually install in their
> networks do not do tight polling all the time and therefore don’t consume
> 100% of the CPU all the time.   An interesting, helpful research effort you
> could lead would be to survey the ecosystem to catalog those applications
> that do fall into the power hungry category and help them to change their
> code.
>
> Intel DPDK application development guidelines don’t pre-suppose tight
> polling all the time and offer at least two methods for optimizing power
> against throughput.  The older method is to use adaptive polling;
> increasing the polling frequency as traffic load increases.  This keeps cpu
> utilization low when packet load is light and increases it as traffic
> levels warrant.  The second method is to use P-states and/or C-states to
> put the processor into lower power modes when traffic loads are lighter.
> We have found that adaptive polling works better across a larger pool of
> hardware types, and therefore that is what DANOS uses, amongst other
> things.
>
> Further, performance and power consumption are dictated by a multivariate
> set of application decisions including: design patterns such as single
> thread run to completion models vs. passing mbufs between multiple threads,
> buffer sizes and cache management algorithms, combining and/or separating
> tx/rx threads, binding threads to specific lcores, reserved cores for DPDK
> threads, hyperthreading, kernel schedulers, hypervisor schedulers,
> interface drivers, etc.  All of these are application specific, not DPDK
> generic.  Well written applications that leverage DPDK provide knobs for
> the user to tune these settings for their specific environment and use
> case.  None of this unique to DPDK.  Solution designs were cribbed from
> previous technologies.
>
> The takeaway is that DPDK (and similar) doesn’t guarantee runaway power
> bills.  Power consumption is dictated by the application.  Look for well
> behaved applications and everything will be alright.
>
> If you have questions, I’d be happy to discuss off line.
>
> Thanks,
> Robert.
>
>
> > On Feb 22, 2021, at 11:27 PM, Etienne-Victor Depasquale 
> wrote:
> >
> >

Re: CGNAT

2021-02-23 Thread JORDI PALET MARTINEZ via NANOG

I did this "economics" exercise for a customer having 25.000.000 customers 
(DSL, GPON and cellular). Even updating/replacing the CPEs, the cost of 464XLAT 
deployment was cheaper than CGN or anything else.

Also, if you consider the cost of buying more IPv4 addresses instead of 
investing that money in CGN, you avoid CGN troubles (like black listening your 
IPv4 addresses by Sony and others and the consequently operation/management 
expenses to rotate IPv4 addresses in the CGN, resolve customers problems, 
etc.), it becomes cheaper than CGN boxes.

It's easy to predict that you will buy now CGN and tomorrow you will need to 
buy some new IPv4 addresses because that black listening.

Regards,
Jordi
@jordipalet

El 24/2/21 3:13, "NANOG en nombre de Owen DeLong via NANOG" 
 escribió:

> On Feb 22, 2021, at 6:44 AM, na...@jima.us wrote:
> 
> While I don't doubt the accuracy of Lee's presentation at the time, at 
least two base factors have changed since then:
> 
> - Greater deployment of IPv6 content (necessitating less CGN capacity per 
user)

This is only true if the ISP in question is implementing IPv6 along side 
their CGN deployment and only if they get a significant uptake of IPv6 
capability by their end users.

> - Increased price of Legacy IP space on the secondary market (changing 
the formula) -- strictly speaking, this presentation was still in "primary 
market" era for LACNIC/ARIN/AFRINIC

While that’s true, even at current prices, IPv4 addresses are cheaper to 
buy and/or lease than CGN.

> IPv6 migration is not generally aided by CGNAT, but CGNAT deployment is 
generally aided by IPv6 deployment; to reiterate the earlier point, any ISPs 
deploying CGNAT without first deploying IPv6 are burning cash.

Yep.

I still think that implementing CGN is a good way to burn cash vs. the 
alternatives, but YMMV.

Owen

> 
> - Jima
> 
> From: NANOG On Behalf Of Owen DeLong
> Sent: Sunday, February 21, 2021 16:59
> To: Steve Saner
> Cc: nanog@nanog.org
> Subject: Re: CGNAT
> 
> 
> On Feb 18, 2021, at 8:38 AM, Steve Saner wrote:
> 
>> We are starting to look at CGNAT solutions. The primary motivation at 
the moment is to extend current IPv4 resources, but IPv6 migration is also a 
factor.
> 
> IPv6 Migration is generally not aided by CGNAT.
> 
> In general, the economics today still work out to make purchasing or 
leasing addresses more favorable than CGNAT.
> 
> It’s a bit dated by now, but still very relevant, see Lee Howard’s 
excellent research presented at the 2012 Rocky
> mountain v6 task force meeting:
> 
> https://www.rmv6tf.org/wp-content/uploads/2012/11/TCO-of-CGN1.pdf
> 
> Owen
> 
> 
> We've been in touch with A10. Just wondering if there are some 
alternative vendors that anyone would recommend. We'd probably be looking at a 
solution to support 5k to 15k customers and bandwidth up to around 30-40 gig as 
a starting point. A solution that is as transparent to user experience as 
possible is a priority.
> 
> Thanks
> 
> -- 
> Steve Saner
> ideatek HUMAN AT OUR VERY FIBER
> This email transmission, and any documents, files or previous email 
messages attached to it may contain confidential information. If the reader of 
this message is not the intended recipient or the employee or agent responsible 
for delivering the message to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not, or believe you may not be, the intended 
recipient, please advise the sender immediately by return email or by calling 
tel:620.543.5026. Then take all steps necessary to permanently delete the email 
and all attachments from your computer system.
> 

**
IPv4 is over
Are you ready for the new Internet ?
http://www.theipv6company.com
The IPv6 Company

This electronic message contains information which may be privileged or 
confidential. The information is intended to be for the exclusive use of the 
individual(s) named above and further non-explicilty authorized disclosure, 
copying, distribution or use of the contents of this information, even if 
partially, including attached files, is strictly prohibited and will be 
considered a criminal offense. If you are not the intended recipient be aware 
that any disclosure, copying, distribution or use of the contents of this 
information, even if partially, including attached files, is strictly 
prohibited, will be considered a criminal offense, so you must reply to the 
original sender to inform about this communication and delete it.

Re: Famous operational issues

2021-02-23 Thread Valdis Klētnieks

On Tue, 23 Feb 2021 20:46:38 -0800, Randy Bush said:
> maybe late '60s or so, we had a few 2314 dasd monsters[0].  think maybe
> 4m x 2m with 9 drives with removable disk packs.
>
> a grave shift operator gets errors on a drive and wonders if maybe they
> swap it into another spindle.  no luck, so swapped those two drives with
> two others.  one more iteration, and they had wiped out the entire
> array.  at that point they called me; so i missed the really creative
> part.

I suspect every S/360 site that had 2314's had an operator who did that, as I
was witness to the same thing.  For at least a decade after that debacle, the
Manager of Operations was awarding Gold, Silver, and Bronze Danny awards for
operational screw-ups. (The 2314 event was the sole Platinum Danny :)

And yes, IBM 4341 consoles were all too easy to hit the EPO button on the
keyboard, we got guards for the consoles after one of our operators nailed the
button a second time in a month.

And to tie the S/360 and 4341 together - we were one of the last sites that was
still running an S/360 Mod 65J.  And plans came through for a new server room
on the top floor of a new building.  Architect comes through, measures the S/360
and all the peripherals for floorspace and power/cooling - and the CPU, plus
*4* meg of memory, and 3 strings of 2314 drives chewed a lot of both.

Construction starts.   Meanwhile, IBM announces the 4341, and offers us a real
sweetheart deal because even at the high maintenance charges we were paying,
IBM was losing money. Something insane like the system and peripherals and
first 3 years of maintenance, for less than the old system per-year
maintenance. Oh, and the power requirements are like 10% of the 360s.

So we take delivery of the new system and it's looking pitiful, just one box
and 2 small strings of disk in 10K square feet.  Lots of empty space. Do all
the migrations to the new system over the summer, and life is good.   Until
fall and winter arrive, and we discover there is zero heat in the room, and the
ceiling is uninsulated, and it's below zero outside because this is way upstate
NY.  And if there was a 360 in the room, it would *still* be needing cooling
rather than heating. But it's a 4341 that's shedding only 10% of the heat...

Finally, one February morning, the 4341 throws a thermal check. Air was too
cold at the intakes.  Our IBM CE did a double-take because he'd been doing IBM
mainframes for 3 decades and had never seen a thermal check for too cold
before.

Lots of legal action threatened against the architect, who simply said "If you
had *told* me that the system was being replaced, I'd have put heat in the
room". A settlement was reached, revised plans were drawn up, there was a whole
mess of construction to get ductwork and insulation and other stuff into place,
and life was good for the decade or so before I left for a better gig

Re: Famous operational issues

2021-02-23 Thread Randy Bush

maybe late '60s or so, we had a few 2314 dasd monsters[0].  think maybe
4m x 2m with 9 drives with removable disk packs.

a grave shift operator gets errors on a drive and wonders if maybe they
swap it into another spindle.  no luck, so swapped those two drives with
two others.  one more iteration, and they had wiped out the entire
array.  at that point they called me; so i missed the really creative
part.

[0] https://www.ibm.com/ibm/history/exhibits/storage/storage_2314.html

randy

---
ra...@psg.com
`gpg --locate-external-keys --auto-key-locate wkd ra...@psg.com`
signatures are back, thanks to dmarc header mangling

Re: Famous operational issues

2021-02-23 Thread Adam Kennedy via NANOG

While we're talking about raid types...

A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in
Northern Indiana. Our CEO decided to sell Internet service to school
systems because the e-rate funding was too much to resist. He had the idea
to install towers on the schools and sell service off that while paying the
school for roof rights. About two years into the endeavor, I wake up one
morning and walk to my car. Two FBI agents get out of an unmarked towncar.
About an hour later, they let me go to the office where I found an entire
barrage of FBI agents. It was a full raid and not the kind you want to see.
Hard drives were involved and being made redundant, but the redundant
copies were labeled and placed into boxes that were carried out to SUVs
that were as dark as the morning coffee these guys drank. There were a lot
of drives, all of our servers were in our server room at the office. There
were roughly five or six racks of varying amounts of equipment in each.

After some questioning and assisting them in their cataloging adventure,
the agents left us with a ton of questions and just enough equipment to
keep the customers connected. CEO became extremely paranoid at this point.
He told us to prepare to move servers to a different building. He went into
a tailspin trying to figure out where he could hide the servers to keep
things going without the bank or FBI seizing the assets. He was extremely
worried the bank would close the office down. We started moving all network
routing around to avoid using the office as our primary DIA.

One morning I get into the office and we hear the words we've been
dreading: "We're moving the servers". The plan was to move them to a tower
site that had a decent-sized shack on site. Connectivity was decent, we had
a licensed 11GHz microwave backhaul capable of about 155mbps. The site was
part of the old MCI microwave long-distance network in the 80s and 90s. It
had redundant air conditioners, a large propane tank, and a generator
capable of keeping the site alive for about three days. We were told not to
notify any customers, which became problematic because two customers had
servers colocated in our building. We consolidated the servers into three
racks and managed to get things prepared with a decent UPS in each rack.
CEO decided to move the servers at nightfall to "avoid suspicion". Our
office was in an unsavory part of town, moving anything at night was
suspicious. So, under the cover of half-ass darkness, we loaded the racks
onto a flatbed truck and drove them 20 minutes to the tower. While we
unloaded the racks, an electrician we knew was wiring up the L5-20 outlets
for the UPS in each rack. We got the racks plugged in, servers powered up,
and then the two customers came that had colocated equipment. They got
their equipment powered up and all seemed ok.

Back at the office the next day we were told to gather our workstations and
start working from home. I've been working from home ever since and quite
enjoy it, but that's beside the point.

Summer starts and I tell the CEO we need to repair the AC units because
they are failing. He ignores it, claiming he doesn't want to lose money the
bank could take at any minute. About a month later, a nice hot summer day
rolls in and the AC units both die. I stumble upon an old portable AC unit
and put that at the site. Temperatures rise to 140F ambient. Server
overheat alarms start going off, things start failing. Our colocation
customers are extremely upset. They pull their servers and drop service.
The heat subsides, CEO finally pays to repair one of the AC units.

Eventually, the company declares bankruptcy and goes into liquidation.
Luckily another WISP catches wind of it, buys the customers and assets, and
hires me. My happiest day that year was moving all the servers into a
better-suited home, a real data center. I don't know what happened to the
CEO, but I know that I'll never trust anything he has his hands in ever
again.

Adam Kennedy
Systems Engineer
adamkenn...@watchcomm.net | 800-589-3837 x120 <800-589-3837;120>
Watch Communications | www.watchcomm.net

3225 W Elm St, Suite A
Lima, OH 45805

On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG  wrote:

> My war story.
>
> At one of our major POPs in DC we had a row of 7513's, and one of them had
> intermittent problems. I had replaced every piece of removable card/part in
> it over time, and it kept failing. Even the vendor flew in a team to the
> site to try to figure out what was wrong. It was finally decided to replace
> the whole router (about 200lbs?). Being the local field tech, that was my
> Job. On the night of the maintenance at 3am, the work started. I switched
> off the rack power, which included a 2511 terminal

Re: CGNAT

2021-02-23 Thread Owen DeLong via NANOG




> On Feb 22, 2021, at 6:44 AM, na...@jima.us wrote:
> 
> While I don't doubt the accuracy of Lee's presentation at the time, at least 
> two base factors have changed since then:
> 
> - Greater deployment of IPv6 content (necessitating less CGN capacity per 
> user)

This is only true if the ISP in question is implementing IPv6 along side their 
CGN deployment and only if they get a significant uptake of IPv6 capability by 
their end users.

> - Increased price of Legacy IP space on the secondary market (changing the 
> formula) -- strictly speaking, this presentation was still in "primary 
> market" era for LACNIC/ARIN/AFRINIC

While that’s true, even at current prices, IPv4 addresses are cheaper to buy 
and/or lease than CGN.

> IPv6 migration is not generally aided by CGNAT, but CGNAT deployment is 
> generally aided by IPv6 deployment; to reiterate the earlier point, any ISPs 
> deploying CGNAT without first deploying IPv6 are burning cash.

Yep.

I still think that implementing CGN is a good way to burn cash vs. the 
alternatives, but YMMV.

Owen

> 
> - Jima
> 
> From: NANOG On Behalf Of Owen DeLong
> Sent: Sunday, February 21, 2021 16:59
> To: Steve Saner
> Cc: nanog@nanog.org
> Subject: Re: CGNAT
> 
> 
> On Feb 18, 2021, at 8:38 AM, Steve Saner wrote:
> 
>> We are starting to look at CGNAT solutions. The primary motivation at the 
>> moment is to extend current IPv4 resources, but IPv6 migration is also a 
>> factor.
> 
> IPv6 Migration is generally not aided by CGNAT.
> 
> In general, the economics today still work out to make purchasing or leasing 
> addresses more favorable than CGNAT.
> 
> It’s a bit dated by now, but still very relevant, see Lee Howard’s excellent 
> research presented at the 2012 Rocky
> mountain v6 task force meeting:
> 
> https://www.rmv6tf.org/wp-content/uploads/2012/11/TCO-of-CGN1.pdf
> 
> Owen
> 
> 
> We've been in touch with A10. Just wondering if there are some alternative 
> vendors that anyone would recommend. We'd probably be looking at a solution 
> to support 5k to 15k customers and bandwidth up to around 30-40 gig as a 
> starting point. A solution that is as transparent to user experience as 
> possible is a priority.
> 
> Thanks
> 
> -- 
> Steve Saner
> ideatek HUMAN AT OUR VERY FIBER
> This email transmission, and any documents, files or previous email messages 
> attached to it may contain confidential information. If the reader of this 
> message is not the intended recipient or the employee or agent responsible 
> for delivering the message to the intended recipient, you are hereby notified 
> that any dissemination, distribution or copying of this communication is 
> strictly prohibited. If you are not, or believe you may not be, the intended 
> recipient, please advise the sender immediately by return email or by calling 
> tel:620.543.5026. Then take all steps necessary to permanently delete the 
> email and all attachments from your computer system.
>

Re: Famous operational issues

2021-02-23 Thread brutal8z via NANOG

My war story.

At one of our major POPs in DC we had a row of 7513's, and one of them
had intermittent problems. I had replaced every piece of removable
card/part in it over time, and it kept failing. Even the vendor flew in
a team to the site to try to figure out what was wrong. It was finally
decided to replace the whole router (about 200lbs?). Being the local
field tech, that was my Job. On the night of the maintenance at 3am, the
work started. I switched off the rack power, which included a 2511
terminal server that was connected to half the routers in the row and
started to remove the router. A few minutes later I got a text, "You're
taking out the wrong router!" You can imagine the "Damn it, what have I
done?" feeling that runs through your mind and the way your heart stops
for a moment.

Okay, I wasn't taking out the wrong router. But unknown at the time,
terminal servers when turned off, had a nasty habit of sending a break
to all the routers it was connected to, and all those routers
effectively stopped. The remote engineer that was in charge saw the
whole POP go red and assumed I was the cause. I was, but not because of
anything I could have known about. I had to power cycle the downed
routers to bring them back on-line, and then continue with the
maintenance. A disaster to all involved, but the router got replaced.

I gave a very detailed account of my actions in the postmortem. It was
clear they knew I had turned off the wrong rack/router, and wasn't being
honest about it. I was adamant I had done exactly what I said, and even
swore I would fess up if I had error-ed, and always would, even if it
cost me the job. I rarely made mistakes, if any, so it was an easy thing
for me to say. For the next two weeks everyone that aware of the work
gave me the side eye.

About a week after that, the same thing happened to another field tech
in another state. That helped my case. They used my account to figure
out it was the TS that caused the problem. A few of them that had
questioned me harshly admitted to me my account helped them figure out
the cause.

And the worst part of this story? That router, completely replaced,
still had the same intermittent problem as before. It was a DC powered
POP, so they were all wired with the same clean DC power. In the end
they chalked it up to cosmic rays and gave up on it. I believe this
break issue was unique to the DC powered 2511's, and that we were the
first to use them, but I might be wrong on that.

On 2/16/21 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John

Re: Texas internet connectivity declining due to blackouts

2021-02-23 Thread Rich Kulawiec

On Mon, Feb 22, 2021 at 08:44:32PM +0200, Saku Ytti wrote:
> On Mon, 22 Feb 2021 at 20:28, Rich Kulawiec  wrote:
> 
> > right: artificial sweeteners are safe, WMDs were in Iraq, and Anna Nicole
> 
> Hope you meant to write 'unsafe', as the conspiracy theory is that
> aspartame is unsafe, the science says it is safe.

Those last three points are a quote from a movie -- which is why I included
the shout-out to Levon Helm (warning, spoilers, the quote's at about 2:00):

Shooter: Levon Helm as Mr Rate - YouTube
https://www.youtube.com/watch?v=xVw8FPIvOZc

I included them as a joke because anyone who disputes AGW at this
point *is* a joke and should be laughed out of the room.

Less snarktastically, a very good starting point for people who want to
understand the science of global warming is this document:

Climate Change 2013: The Physical Science Basis
https://www.ipcc.ch/site/assets/uploads/2018/02/WG1AR5_all_final.pdf

It's exhaustive in its coverage (which is why it's 1500+ pages) and reading
it will require basic literacy in math/stat and physical science.  But it's
part of the required homework for anyone trying to understand this topic.
The 2021 version is now in preparation and if things go well, it should be
out mid-to-late summer.

Another highly useful document is:

Fourth National Climate Assessment
https://nca2018.globalchange.gov/downloads/NCA4_2018_FullReport.pdf

which also clocks in at 1500+ pages.  This document has both a broader focus
(for example, it discusses impacts and mitigation) and a narrower focus
(it's US-centric).  It's also written for a more general audience and
requires less math/science background for comprehension, so I recommend
that anybody who struggles to get through the one above try this instead.

Also: this one is arguably more useful for NANOG/operational/planning
purposes.  I think it'd be a good read for anyone who's trying to figure
out what's going to happen to their physical assets/locations, or for
anyone who's trying to plan where to put things and how to build them.

Additional resources:

Climate Change and Infrastructure, Urban Systems, and Vulnerability
Technical Report for the US DoE
Thomas Wilbanks and Steven J. Fernandez

(This one is also useful for NANOG denizens.  Chapter 5 on risk mitigation
strategies is particularly interesting.)

The Encyclopedia of Global Warming and Climate Change, 2ed 
S. George Philander, editor

(A general reference.  Having this handy while reading the IPCC report
I mentioned above can be helpful.)

Atmospheric Thermodynamics - Elementary Physics and Chemistry
Gerald R. North and Tatiana L. Erukhimova

(You'll need integral and differential calculus for this one, and a previous
course in introductory thermodynamics will help.  This is not about climate
-- or weather -- per se, but it provides some of the fundamental science
necessary to understand both.)

Attribution of Extreme Weather Events in the Context of Climate Change
Committee of Extreme Weather Events and Climate Change Attribution

(Attribution science is relatively new but is making rapid progress.  The
ability to look back and demonstrate causal relationships is going to be
invaluable as we look forward.)

rsk

Re: Famous operational issues

2021-02-23 Thread bzs



Anyone remember when DEC delivered a new VMS version (V5 I think)
whose backups didn't work, couldn't be restored?

BU did, the hard way, when the engineering dept's faculty and student
disk failed.

DEC actually paid thousands of dollars for typist services to come and
re-enter whatever was on paper and could be re-entered.

I think that was the day I won the Unix vs VMS wars at BU anyhow.

-- 
-Barry Shein

Software Tool & Die| b...@theworld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
The World: Since 1989  | A Public Information Utility | *oo*

Re: Famous operational issues

2021-02-23 Thread scott




On 2/23/2021 12:22 PM, Justin Streiner wrote:

An interesting sub-thread to this could be:
Have you ever unintentionally crashed a device by running a perfectly 
innocuous command?

---


There was that time in the later 1990s where I took most of a global 
network down several
times by typing "show ip bgp regexp " on most all of the 
core routers.  It turned
out to be a cisco bug.  I looked for a reference, but cannot find one.  
Ahh, the earlier days of

the commercial internet...gotta love'em.

scott

Re: Famous operational issues

2021-02-23 Thread Warren Kumari

On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner  wrote:

> Beyond the widespread outages, I have so many personal war stories that
> it's hard to pick a favorite.
>
> My first job out of college in the mid-late 90s was at an ISP in
> Pittsburgh that I joined pretty early in its existence, and everyone did a
> bit of everything. I was hired to do sysadmin stuff, networking, pretty
> much whatever was needed. About a year after I started, we brought up a new
> mail system with an external RAID enclosure for the mail store itself.  One
> day, we saw indications that one of the disks in the RAID enclosure was
> starting to fail, so I scheduled a maintenance window to replace the disk
> and let the controller rebuild the data and integrate it back into the RAID
> set.  No big worries, right?
>
> It's Tuesday at about 2 AM.
>
> Well, the kernel on the RAID controller itself decided that when I pulled
> the failing drive would be a fine time to panic, and more or less turn
> itself into a bit-blender, and take all the mailstore down with it.  After
> a few hours of watching fsck make no progress on anything, in terms of
> trying to un-fsck the mailstore, we made the decision in consultation with
> the CEO to pull the plug on trying to bring the old RAID enclosure back to
> life, and focus on finding suitable replacement hardware and rebuild from
> scratch.  We also discovered that the most recent backups of the mailstore
> were over a month old :(
>
> I think our CEO ended up driving several hours to procure a suitable
> enclosure.  By the time we got the enclosure installed, filesystems built,
> and got whatever tape backups we had restored, and tested the integrity of
> the system, it was now Thursday around 8 AM. Coincidentally, that was the
> same day the company hosted a big VIP gathering (the mayor was there, along
> with lots of investors and other bigwigs), so I had to come back and put on
> a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
> about the previous 3 days.  I still don't know how I got home that night
> without wrapping my vehicle around a utility pole (due to being over-tired,
> not due to alcohol).
>
> Many painful lessons learned over that stretch of days, as often the case
> as a company grows from startup mode and builds more robust technology and
> business processes as a consequence of growth.
>

Oh, dear. RAID that triggered 2 stories.
1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and
want to kill process 1742, so I type 'kill -9 1' ... and then, before
pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate
story). After I get back to my desk I unlock my terminal, and call over a
friend to show just how close I'd gotten to making something go Boom. He
says "Nah, BSD is cleverer than that. I'm sure the kill command has some
check in to stop you killing init.". I disagree. He disagrees. I disagree
again. He calls me stupid. I bet him a soda.
He proves his point by typing 'su; kill -9 1' in the window he's logged
into -- and our primary NFS server (with all of the user sites)
obediently kills off init, and all of the child processes we run over
to the front of the box and hit the power switch, while desperately looking
for a monitor and keyboard to watch it boot.
It does the BIOS checks, and then stops on the RAID controller, complaining
about the fact that there are *2* dead drives, and that the array is now
sad.
This makes no sense. I can understand one drive not recovering from a power
outage, but 2 seems a bit unlikely, especially because the machine hadn't
been beeping or anything like that we try turning it off and on again a
few times, no change... We pull the machine out of the rack and rip the
cover off.
Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some
reason, wrapped in a bunch of napkins, held in place with electrical tape.
I pull that off, and there is also some  paper towel jammed into the hole
in the buzzer, and bits of a broken pencil

After replacing the drives, starting an rsync restore from a backup server
we investigate more
...
it turns out that a few months ago(!) the machine had started beeping. The
night crew naturally found this annoying, and so they'd gone investigating
and discovered that it was this machine, and lifted the lid while still in
the rack. They traced the annoying noise to this small black thingie, and
made poked it until it stopped, thus solving the problem once and for
all yay!

2: I used to work at a company which was in one of the buildings next to
the twin-towers. For various clever reasons, they had their "datacenter" in
a corner of the office space... anyway, the planes hit, power goes out and
the building is evacuated - luckily no one is injured, but the entire
company/site is down. After a few weeks, my friend Joe is able to arrange
with a fire marshal to get access to the building so he can go and grab the
disks

Re: Famous operational issues

2021-02-23 Thread Eric Kuhnke

I would be more interested in seeing someone who HASN'T crashed a Cisco
6500/7600, particularly one with a long uptime, by typing in a supposedly
harmless 'show' command.


On Tue, Feb 23, 2021 at 2:26 PM Justin Streiner  wrote:

> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
> Sev1 bug that caused two linecards to crash and reload, and take down about
> two dozen buildings on campus at the .edu where I used to work.
> 3. For those that ever had the misfortune of using early versions of the
> "bcc" command shell* on Bay Networks routers, which was intended to make
> the CLI make look and feel more like a Cisco router, you have my
> condolences.  One would reasonably expect "delete ?" to respond with a list
> of valid arguments for that command.  Instead, it deleted, well...
> everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps  the
>> most notorious and likely to top many lists including mine.  So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session.  I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective.  I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>

Re: CGNAT

2021-02-23 Thread Mark Andrews

IPv4AAS will also work easily for any ISP on the planet.

CGNAT requires IPv4 address space between the CGNAT and the customer CPE which 
doesn’t overlap with that on the Internet nor that behind the CPE (no you can’t 
use RFC 1918).  100.64/10 gives you ~4M addresses which fit this criteria but 
that isn’t enough without reuse for the larger ISPs.

IPv4AAS uses no IPv4 addresses between the B4/NAT64/… and the CPE.

You should also be able to out source IPv4AAS to specialist providers.  There 
is no requirement to run your own hardware.  You just populate your 
DHCPv6/RA/IPV4ONLY.ARPA with the correct information and the traffic will be 
delivered to the specialist providers.  I’m not sure if there are any out there 
yet but it is a business opportunity for anyone that is already running such 
boxes.

Mark

> On 24 Feb 2021, at 09:43, Owen DeLong  wrote:
> 
> That’s provably not true if the IPv4AAS implementation is done carefully.
> 
> Owen
> 
> 
>> On Feb 19, 2021, at 12:11 PM, Tony Wicks  wrote:
>> 
>> Because then a large part of the Internet won't work
>> 
>> From: NANOG  on behalf of Mark 
>> Andrews 
>> Sent: Saturday, 20 February 2021, 9:04 am
>> To: Steve Saner
>> Cc: nanog@nanog.org
>> Subject: Re: CGNAT
>> 
>> Why not go whole hog and provide IPv4 as a service? That way you are not 
>> waiting for your customers to turn up IPv6 to take the load off your NAT box.
>> 
>> Yes, you can do it dual stack but you have waited so long you may as well 
>> miss that step along the deployment path.
>> -- 
>> Mark Andrews
>> 
>>> On 20 Feb 2021, at 01:55, Steve Saner  wrote:
>>> 
>>> 
>>> We are starting to look at CGNAT solutions. The primary motivation at the 
>>> moment is to extend current IPv4 resources, but IPv6 migration is also a 
>>> factor.
>>> 
>>> We've been in touch with A10. Just wondering if there are some alternative 
>>> vendors that anyone would recommend. We'd probably be looking at a solution 
>>> to support 5k to 15k customers and bandwidth up to around 30-40 gig as a 
>>> starting point. A solution that is as transparent to user experience as 
>>> possible is a priority.
>>> 
>>> Thanks
>>> 
>>> -- 
>>> Steve Saner
>>> ideatek HUMAN AT OUR VERY FIBER
>>> This email transmission, and any documents, files or previous email 
>>> messages attached to it may contain confidential information. If the reader 
>>> of this message is not the intended recipient or the employee or agent 
>>> responsible for delivering the message to the intended recipient, you are 
>>> hereby notified that any dissemination, distribution or copying of this 
>>> communication is strictly prohibited. If you are not, or believe you may 
>>> not be, the intended recipient, please advise the sender immediately by 
>>> return email or by calling 620.543.5026. Then take all steps necessary to 
>>> permanently delete the email and all attachments from your computer system.
>>> 
>> 
> 

-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742  INTERNET: ma...@isc.org

Re: DPDK and energy efficiency

2021-02-23 Thread Robert Bays

Hi Etienne,

Your statement that DPDK “keeps utilization at 100% regardless of packet 
activity” is just not correct.  You further pre-suppose "widespread DPDK's core 
operating inefficiency” without any data to backup the operating inefficacy 
assertion.  Your statements, taken at face value, lead people to believe that 
if a project uses DPDK it’s going to increase their power costs.  And that’s 
just not the case.  Please don’t mislead the community into believing that DPDK 
== power bad.

Everything following is informational.  Stop here if so inclined.

DPDK does not dictate CPU utilization or power consumption, the application 
leveraging DPDK does.  It’s the application that decides how to poll packets.  
If an application implements DPDK using only a tight polling loop, then it will 
keep CPU cores that are running DPDK threads at 100%.  But only the most simple 
and/or bespoke (think trading) applications are implemented this way.  You 
don’t need tight polling all the time to get the performance gains provided by 
DPDK or similar environments.  The vast majority of applications that this 
audience would actually install in their networks do not do tight polling all 
the time and therefore don’t consume 100% of the CPU all the time.   An 
interesting, helpful research effort you could lead would be to survey the 
ecosystem to catalog those applications that do fall into the power hungry 
category and help them to change their code.  

Intel DPDK application development guidelines don’t pre-suppose tight polling 
all the time and offer at least two methods for optimizing power against 
throughput.  The older method is to use adaptive polling; increasing the 
polling frequency as traffic load increases.  This keeps cpu utilization low 
when packet load is light and increases it as traffic levels warrant.  The 
second method is to use P-states and/or C-states to put the processor into 
lower power modes when traffic loads are lighter.  We have found that adaptive 
polling works better across a larger pool of hardware types, and therefore that 
is what DANOS uses, amongst other things.  

Further, performance and power consumption are dictated by a multivariate set 
of application decisions including: design patterns such as single thread run 
to completion models vs. passing mbufs between multiple threads, buffer sizes 
and cache management algorithms, combining and/or separating tx/rx threads, 
binding threads to specific lcores, reserved cores for DPDK threads, 
hyperthreading, kernel schedulers, hypervisor schedulers, interface drivers, 
etc.  All of these are application specific, not DPDK generic.  Well written 
applications that leverage DPDK provide knobs for the user to tune these 
settings for their specific environment and use case.  None of this unique to 
DPDK.  Solution designs were cribbed from previous technologies.

The takeaway is that DPDK (and similar) doesn’t guarantee runaway power bills.  
Power consumption is dictated by the application.  Look for well behaved 
applications and everything will be alright.

If you have questions, I’d be happy to discuss off line.

Thanks,
Robert.

> On Feb 22, 2021, at 11:27 PM, Etienne-Victor Depasquale  
> wrote:
> 
> Sorry, last line should have been:
> "intended to get an impression of how widespread ***knowledge of*** DPDK's 
> core operating inefficiency is",
> not:
> "intended to get an impression of how widespread DPDK's core operating 
> inefficiency is"
> 
> On Tue, Feb 23, 2021 at 8:22 AM Etienne-Victor Depasquale  
> wrote:
> Beyond RX/TX CPU affinity, in DANOS you can further tune power consumption by 
> changing the adaptive polling rate.  It doesn’t, per the survey, "keep 
> utilization at 100% regardless of packet activity.” 
> Robert, you seem to be conflating DPDK 
> with DANOS' power control algorithms that modulate DPDK's default behaviour.
> 
> Let me know what you think; otherwise, I'm pretty confident that DPDK does:
> "keep utilization at 100% regardless of packet activity.”   
> 
> Keep in mind that this is a bare-bones survey intended for busy, 
> knowledgeable people (the ones you'd find on NANOG) -
> not a detailed breakdown of modes of operation of DPDK or DANOS.
> DPDK has been designed for fast I/O that's unencumbered by the trappings of 
> general-purpose OSes, 
> and that's the impression that needs to be forefront.
> Power control, as well as any other dimensions of modulation, 
> are detailed modes of operation that are well beyond the scope of a 
> bare-bones 2-question survey 
> intended to get an impression of how widespread DPDK's core operating 
> inefficiency is.
> 
> Cheers,
> 
> Etienne 
> 
> On Mon, Feb 22, 2021 at 10:20 PM Robert Bays  wrote:
> Beyond RX/TX CPU affinity, in DANOS you can further tune power consumption by 
> changing the adaptive polling rate.  It doesn’t, per the survey, "keep 
> utilization at 100% regardless of packet activity.”  Adaptive polling changes 
> in DPDK optimize for

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

>
> This is way too deep in the weeds of developing with the DPDK
> libraries for your audience here to have much in the way of useful
> comment.  This is an operators group.
>

Fair enough, and thank you for stepping on the brakes :)

Honestly, I didn't intend to get embroiled in this. The questions were
bare-bones and relate to common use of DPDK.

Over and out. I'll post the results on Friday evening CET.

Cheers,

Etienne

On Tue, Feb 23, 2021 at 11:38 PM William Herrin  wrote:

> On Tue, Feb 23, 2021 at 2:22 PM Etienne-Victor Depasquale
>  wrote:
> >> DPDK doesn't inherently do much in the way of power management.
> >
> > I agree - it doesn't. That's not what it was made for.
> >
> >>  Note that DPDK applications are usually intended to run in very-high
> >
> > data rate environments where no gains are likely to be realized by
> > avoiding a busy-wait loop.
> >
> > That's not what research shows.
> >
> > Use of LPI states is proposed for power management under high data rate
> conditions in [5] and
> > in [6], use of the low-power instruction halt  is investigated and found
> to save power under such conditions.
>
> Howdy,
>
> This is way too deep in the weeds of developing with the DPDK
> libraries for your audience here to have much in the way of useful
> comment.  This is an operators group.
>
> If anyone is interested, the techniques DPDK offers application
> authors to manage power on the dataplane cores are described here:
>
> https://doc.dpdk.org/guides/prog_guide/power_man.html
>
> The main thing devs do, since it's easy, is add a call to rte_pause()
> in any empty polling loop. IIRC, that just calls the CPU PAUSE
> instruction which doesn't actually pause anything but saves a little
> power by de-pipelining and, if hyperthreading is enabled, releasing
> the core to run the alternate thread.
>
> Regards,
> Bill Herrin
>
>
>
> --
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

>
> This comes from OVS code and shows OVS thread spinning, not DPDK PMD.
> Blame the OVS application for not using e.g. _mm_pause() and burning
> the CPU like crazy.
>


OK, I'm citing a bit more from the same reference:
*"By tracing back to the function’s caller *
*in the PMD thread main(void *f_), *
we found that the thread kept spinning on the following code block:
for ( ; ; ) {
for ( i = 0; i < poll_cnt; i ++) {
dp_netdev_process_rxq_port (pmd, list[i].port, poll_list[i].rx) ;
}
}
This indicates that the [PMD] thread was continuously
monitoring and executing the receiving data path."

Cheers,

Etienne

On Tue, Feb 23, 2021 at 10:33 PM Pawel Malachowski <
pawmal-na...@freebsd.lublin.pl> wrote:

> > > No, it is not PMD that runs the processor in a polling loop.
> > > It is the application itself, thay may or may not busy loop,
> > > depending on application programmers choice.
> >
> > From one of my earlier references [2]:
> >
> > "we found that a poll mode driver (PMD)
> > thread accounted for approximately 99.7 percent
> > CPU occupancy (a full core utilization)."
> >
> > And further on:
> >
> > "we found that the thread kept spinning on the following code block:
> >
> > *for ( ; ; ) {for ( i = 0; i < poll_cnt; i ++)
> {dp_netdev_process_rxq_port
> > (pmd, list[i].port, poll_list[i].rx) ;}}*
> > This indicates that the thread was continuously
> > monitoring and executing the receiving data path."
>
> This comes from OVS code and shows OVS thread spinning, not DPDK PMD.
> Blame the OVS application for not using e.g. _mm_pause() and burning
> the CPU like crazy.
>
>
> For comparison, take a look at top+i7z output from DPDK-based 100G DDoS
> scrubber currently lifting some low traffic using cores 1-13 on 16 core
> host. It uses naive DPDK::rte_pause() throttling to enter C1.
>
> Tasks: 342 total,   1 running, 195 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  6.6 us,  0.6 sy,  0.0 ni, 89.7 id,  3.1 wa,  0.0 hi,  0.0 si,
> 0.0 st
>
> Core [core-id]  :Actual Freq (Mult.)  C0%   Halt(C1)%  C3 %
>  C6 %  Temp  VCore
> Core 1 [0]:   1467.73 (14.68x)  2.155.35   1
> 92.343  0.6724
> Core 2 [1]:   1201.09 (12.01x)  11.793.9   0
>  039  0.6575
> Core 3 [2]:   1200.06 (12.00x)  11.893.8   0
>  042  0.6543
> Core 4 [3]:   1200.14 (12.00x)  11.893.8   0
>  041  0.6549
> Core 5 [4]:   1200.10 (12.00x)  11.893.8   0
>  041  0.6526
> Core 6 [5]:   1200.12 (12.00x)  11.893.8   0
>  040  0.6559
> Core 7 [6]:   1201.01 (12.01x)  11.893.8   0
>  041  0.6559
> Core 8 [7]:   1201.02 (12.01x)  11.893.8   0
>  043  0.6525
> Core 9 [8]:   1201.00 (12.01x)  11.893.8   0
>  041  0.6857
> Core 10 [9]:  1201.04 (12.01x)  11.893.8   0
>  040  0.6541
> Core 11 [10]: 1201.95 (12.02x)  13.692.9   0
>  040  0.6558
> Core 12 [11]: 1201.02 (12.01x)  11.893.8   0
>  042  0.6526
> Core 13 [12]: 1204.97 (12.05x)  17.690.8   0
>  045  0.6814
> Core 14 [13]: 1248.39 (12.48x)  28.284.7   0
>  041  0.6855
> Core 15 [14]: 2790.74 (27.91x)  91.9   0   1
>  141  0.8885 <-- not PMD
> Core 16 [15]: 1262.29 (12.62x)  13.134.9 1.7
> 56.243  0.6616
>
> $ dataplanectl stats fcore | grep total
> fcore total idle 393788223887 work 860443658 (0.2%) (forced-idle
> 7458486526622) recv 202201388561 drop 61259353721 (30.3%) limit 269909758
> (0.1%) pass 140606076622 (69.6%) ingress 66048460 (0.0%/0.0%) sent
> 162580376914 (80.4%/100.0%) overflow 0 (0.0%) sampled 628488188/628488188
>
>
>
> --
> Pawel Malachowski
> @pawmal80
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: CGNAT

2021-02-23 Thread Owen DeLong via NANOG

That’s provably not true if the IPv4AAS implementation is done carefully.

Owen


> On Feb 19, 2021, at 12:11 PM, Tony Wicks  wrote:
> 
> Because then a large part of the Internet won't work
> 
> From: NANOG  on behalf of Mark 
> Andrews 
> Sent: Saturday, 20 February 2021, 9:04 am
> To: Steve Saner
> Cc: nanog@nanog.org
> Subject: Re: CGNAT
> 
> Why not go whole hog and provide IPv4 as a service? That way you are not 
> waiting for your customers to turn up IPv6 to take the load off your NAT box.
> 
> Yes, you can do it dual stack but you have waited so long you may as well 
> miss that step along the deployment path.
> -- 
> Mark Andrews
> 
>> On 20 Feb 2021, at 01:55, Steve Saner  wrote:
>> 
>> 
>> We are starting to look at CGNAT solutions. The primary motivation at the 
>> moment is to extend current IPv4 resources, but IPv6 migration is also a 
>> factor.
>> 
>> We've been in touch with A10. Just wondering if there are some alternative 
>> vendors that anyone would recommend. We'd probably be looking at a solution 
>> to support 5k to 15k customers and bandwidth up to around 30-40 gig as a 
>> starting point. A solution that is as transparent to user experience as 
>> possible is a priority.
>> 
>> Thanks
>> 
>> -- 
>> Steve Saner
>> ideatek HUMAN AT OUR VERY FIBER
>> This email transmission, and any documents, files or previous email messages 
>> attached to it may contain confidential information. If the reader of this 
>> message is not the intended recipient or the employee or agent responsible 
>> for delivering the message to the intended recipient, you are hereby 
>> notified that any dissemination, distribution or copying of this 
>> communication is strictly prohibited. If you are not, or believe you may not 
>> be, the intended recipient, please advise the sender immediately by return 
>> email or by calling 620.543.5026 . Then take all steps 
>> necessary to permanently delete the email and all attachments from your 
>> computer system.
>> 
>

Re: Texas internet connectivity declining due to blackouts

2021-02-23 Thread Tom Beecher

The issue is that while there are lots of information out there detailing
the risks of variable rate supply plans, the majority of consumers are not
equipped to properly understand that risk; these are complex markets in the
best of times. Many of these companies are also borderline predatory in how
they market their services. It's the standard model you see in many
industries; highlight the savings, fine print or hide the risk, and then
when the consumer gets screwed , point and say 'well they should have
understood what they signed up for!'. That's complete trash when it comes
to life critical utilities.

On Wed, Feb 17, 2021 at 7:25 PM Yang Yu  wrote:

> On Wed, Feb 17, 2021 at 10:46 AM John Sage  wrote:
> > This article is an interest description of Texas electricity pricing for
> > one provider and for the market in general:
> >
> https://www.dallasnews.com/business/energy/2021/02/16/electricity-retailer-griddys-unusual-plea-to-texas-customers-leave-now-before-you-get-a-big-bill/
>
> That is far from the market in general.
>
> Most people use a fixed rate plan (can easily find one without rebate
> for <10c/kwh after taxes & fees). The customer would have to make an
> explicit decision to pick a variable/market rate plan (excluded by
> default on http://powertochoose.org/) with higher risk and cheaper
> electricity when the wholesale price is low.
>
>
> http://www.puc.texas.gov/consumer/facts/factsheets/elecfacts/Electricplans.pdf
>
> >Changing Rate (Variable) Plans have rates per kWh that can vary according
> to a method determined solely by the provider and may be dependent on
> market changes and other exceptions beyond the provider's control
> >Market Rate (Indexed) Plans have rates per kWh that can vary according to
> pre-defined publicly available indices or information and other exceptions
> beyond the provider's control
>
>
> > The highest the price can go to is $9/kWh (which has only ever happened
> 0.005% of the time.) Most of the time though, 96.9% to be exact, it is
> below the Texas Average of 6.8¢/kWh
> https://www.griddy.com/texas/learn-more#learn-pricing
>

Re: DPDK and energy efficiency

2021-02-23 Thread William Herrin

On Tue, Feb 23, 2021 at 2:22 PM Etienne-Victor Depasquale
 wrote:
>> DPDK doesn't inherently do much in the way of power management.
>
> I agree - it doesn't. That's not what it was made for.
>
>>  Note that DPDK applications are usually intended to run in very-high
>
> data rate environments where no gains are likely to be realized by
> avoiding a busy-wait loop.
>
> That's not what research shows.
>
> Use of LPI states is proposed for power management under high data rate 
> conditions in [5] and
> in [6], use of the low-power instruction halt  is investigated and found to 
> save power under such conditions.

Howdy,

This is way too deep in the weeds of developing with the DPDK
libraries for your audience here to have much in the way of useful
comment.  This is an operators group.

If anyone is interested, the techniques DPDK offers application
authors to manage power on the dataplane cores are described here:

https://doc.dpdk.org/guides/prog_guide/power_man.html

The main thing devs do, since it's easy, is add a call to rte_pause()
in any empty polling loop. IIRC, that just calls the CPU PAUSE
instruction which doesn't actually pause anything but saves a little
power by de-pipelining and, if hyperthreading is enabled, releasing
the core to run the alternate thread.

Regards,
Bill Herrin

-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/

Re: Famous operational issues

2021-02-23 Thread Shawn L via NANOG


That brings back memoriesI had a similar experience.  First month on the 
job, large Sun raid array storing ~ 5k of mailboxes dies in the middle of the 
afternoon.  So, I start troubleshooting and determine it's most likely a bad 
disk.  The CEO walked into the server room right about the time I had 20 disks 
laid out on a table.  He had a fit and called the desktop support guy to come 
and 'show me how to fix a pc'.
 
Never mind the fact that we had a 90% ready to go replacement box sitting at 
another site, and just needed to either go get it, or bring the disks to 
it. So we sat there until the desktop who was 30 minutes away guy got 
there.  He took one look at it and said 'never touched that thing before, looks 
like he knows what he's doing' and pointed to me.  4 hours later we were 
driving the new server to the data center strapped down in the back of a 
pickup.  Fun times.
 
 
-Original Message-
From: "Justin Streiner" 
Sent: Tuesday, February 23, 2021 5:11pm
To: "John Kristoff" 
Cc: "NANOG" 
Subject: Re: Famous operational issues



Beyond the widespread outages, I have so many personal war stories that it's 
hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh 
that I joined pretty early in its existence, and everyone did a bit of 
everything. I was hired to do sysadmin stuff, networking, pretty much whatever 
was needed. About a year after I started, we brought up a new mail system with 
an external RAID enclosure for the mail store itself.  One day, we saw 
indications that one of the disks in the RAID enclosure was starting to fail, 
so I scheduled a maintenance window to replace the disk and let the controller 
rebuild the data and integrate it back into the RAID set.  No big worries, 
right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the 
failing drive would be a fine time to panic, and more or less turn itself into 
a bit-blender, and take all the mailstore down with it.  After a few hours of 
watching fsck make no progress on anything, in terms of trying to un-fsck the 
mailstore, we made the decision in consultation with the CEO to pull the plug 
on trying to bring the old RAID enclosure back to life, and focus on finding 
suitable replacement hardware and rebuild from scratch.  We also discovered 
that the most recent backups of the mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure. 
 By the time we got the enclosure installed, filesystems built, and got 
whatever tape backups we had restored, and tested the integrity of the system, 
it was now Thursday around 8 AM. Coincidentally, that was the same day the 
company hosted a big VIP gathering (the mayor was there, along with lots of 
investors and other bigwigs), so I had to come back and put on a suit to hobnob 
with the VIPs after getting a total of 6 hours of sleep in about the previous 3 
days.  I still don't know how I got home that night without wrapping my vehicle 
around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a 
company grows from startup mode and builds more robust technology and business 
processes as a consequence of growth.
jms


On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ j...@dataplane.org ]( 
mailto:j...@dataplane.org )> wrote:Friends,

 I'd like to start a thread about the most famous and widespread Internet
 operational issues, outages or implementation incompatibilities you
 have seen.

 Which examples would make up your top three?

 To get things started, I'd suggest the AS 7007 event is perhaps  the
 most notorious and likely to top many lists including mine.  So if
 that is one for you I'm asking for just two more.

 I'm particularly interested in this as the first step in developing a
 future NANOG session.  I'd be particularly interested in any issues
 that also identify key individuals that might still be around and
 interested in participating in a retrospective.  I already have someone
 that is willing to talk about AS 7007, which shouldn't be hard to guess
 who.

 Thanks in advance for your suggestions,

 John

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

Oh dear ... instead of "and in [6]", I should have written "and in [3]".

On Tue, Feb 23, 2021 at 11:21 PM Etienne-Victor Depasquale 
wrote:

> DPDK doesn't inherently do much in the way of power management.
>>
> I agree - it doesn't. That's not what it was made for.
>
>  Note that DPDK applications are usually intended to run in very-high
>
> data rate environments where no gains are likely to be realized by
> avoiding a busy-wait loop.
>
> That's not what research shows.
>
> Use of LPI states is proposed for power management under high data rate
> conditions in [5] and
> in [6], use of the low-power instruction *halt * is investigated and
> found to save power under such conditions.
>
> Cheers,
>
> Etienne
>
> [3] X. Li, W. Cheng, T. Zhang, F. Ren, and B. Yang, “Towards Power
> Efficient High Performance Packet I/O,”
> IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 4, pp.
> 981–996, April 2020,
> ISSN:1558-2183. DOI: 10.1109/TPDS.2019.2957746
>
> [5] R. Bolla, R. Bruschi, F. Davoli, and J. F. Pajo, “A Model-Based
> Approach Towards Real-Time Analytics in NFV Infrastructures,”
> IEEE Transactions on Green Communications and Networking, vol. 4, no. 2,
> pp. 529–541, Jun. 2020, ISSN: 2473-2400.
> DOI: 10.1109/TGCN.2019.2961192.
>
>
> On Tue, Feb 23, 2021 at 11:04 PM William Herrin  wrote:
>
>> On Mon, Feb 22, 2021 at 11:24 PM Etienne-Victor Depasquale
>>  wrote:
>> >>
>> >> Beyond RX/TX CPU affinity, in DANOS you can further tune power
>> consumption by changing the adaptive polling rate.  It doesn’t, per the
>> survey, "keep utilization at 100% regardless of packet activity.”
>> >
>> > Robert, you seem to be conflating DPDK
>> > with DANOS' power control algorithms that modulate DPDK's default
>> behaviour.
>> > Keep in mind that this is a bare-bones survey intended for busy,
>> knowledgeable people (the ones you'd find on NANOG) -
>>
>> Hi,
>>
>> Since you understand that, I'm not really clear what you're asking in
>> the survey.
>>
>> DPDK doesn't inherently do much in the way of power management. The
>> polling loops are in the application side of the software, not the
>> DPDK libraries or NIC driver. It's up to the application author to
>> decide to detect idleness in the polling loop and take action to
>> reduce CPU load. If they go for a simple busy-wait, the dataplane
>> cores run at 100% all the time regardless of packet load. This has the
>> expected impact on the server's power consumption.
>>
>> Note that DPDK applications are usually intended to run in very-high
>> data rate environments where no gains are likely to be realized by
>> avoiding a busy-wait loop.
>>
>> Regards,
>> Bill Herrin
>>
>>
>> --
>> William Herrin
>> b...@herrin.us
>> https://bill.herrin.us/
>>
>
>
> --
> Ing. Etienne-Victor Depasquale
> Assistant Lecturer
> Department of Communications & Computer Engineering
> Faculty of Information & Communication Technology
> University of Malta
> Web. https://www.um.edu.mt/profile/etiennedepasquale
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: Famous operational issues

2021-02-23 Thread Justin Streiner

An interesting sub-thread to this could be:

Have you ever unintentionally crashed a device by running a perfectly
innocuous command?
1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
Sev1 bug that caused two linecards to crash and reload, and take down about
two dozen buildings on campus at the .edu where I used to work.
3. For those that ever had the misfortune of using early versions of the
"bcc" command shell* on Bay Networks routers, which was intended to make
the CLI make look and feel more like a Cisco router, you have my
condolences.  One would reasonably expect "delete ?" to respond with a list
of valid arguments for that command.  Instead, it deleted, well...
everything, and prompted an on-site restore/reboot.

BCC originally stood for "Bay Command Console", but we joked that it really
stood for "Blatant Cisco Clone".

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

>
> DPDK doesn't inherently do much in the way of power management.
>
I agree - it doesn't. That's not what it was made for.

 Note that DPDK applications are usually intended to run in very-high

data rate environments where no gains are likely to be realized by
avoiding a busy-wait loop.

That's not what research shows.

Use of LPI states is proposed for power management under high data rate
conditions in [5] and
in [6], use of the low-power instruction *halt * is investigated and found
to save power under such conditions.

Cheers,

Etienne

[3] X. Li, W. Cheng, T. Zhang, F. Ren, and B. Yang, “Towards Power
Efficient High Performance Packet I/O,”
IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 4, pp.
981–996, April 2020,
ISSN:1558-2183. DOI: 10.1109/TPDS.2019.2957746

[5] R. Bolla, R. Bruschi, F. Davoli, and J. F. Pajo, “A Model-Based
Approach Towards Real-Time Analytics in NFV Infrastructures,”
IEEE Transactions on Green Communications and Networking, vol. 4, no. 2,
pp. 529–541, Jun. 2020, ISSN: 2473-2400.
DOI: 10.1109/TGCN.2019.2961192.


On Tue, Feb 23, 2021 at 11:04 PM William Herrin  wrote:

> On Mon, Feb 22, 2021 at 11:24 PM Etienne-Victor Depasquale
>  wrote:
> >>
> >> Beyond RX/TX CPU affinity, in DANOS you can further tune power
> consumption by changing the adaptive polling rate.  It doesn’t, per the
> survey, "keep utilization at 100% regardless of packet activity.”
> >
> > Robert, you seem to be conflating DPDK
> > with DANOS' power control algorithms that modulate DPDK's default
> behaviour.
> > Keep in mind that this is a bare-bones survey intended for busy,
> knowledgeable people (the ones you'd find on NANOG) -
>
> Hi,
>
> Since you understand that, I'm not really clear what you're asking in
> the survey.
>
> DPDK doesn't inherently do much in the way of power management. The
> polling loops are in the application side of the software, not the
> DPDK libraries or NIC driver. It's up to the application author to
> decide to detect idleness in the polling loop and take action to
> reduce CPU load. If they go for a simple busy-wait, the dataplane
> cores run at 100% all the time regardless of packet load. This has the
> expected impact on the server's power consumption.
>
> Note that DPDK applications are usually intended to run in very-high
> data rate environments where no gains are likely to be realized by
> avoiding a busy-wait loop.
>
> Regards,
> Bill Herrin
>
>
> --
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: Famous operational issues

2021-02-23 Thread Justin Streiner

Beyond the widespread outages, I have so many personal war stories that
it's hard to pick a favorite.

My first job out of college in the mid-late 90s was at an ISP in Pittsburgh
that I joined pretty early in its existence, and everyone did a bit of
everything. I was hired to do sysadmin stuff, networking, pretty much
whatever was needed. About a year after I started, we brought up a new mail
system with an external RAID enclosure for the mail store itself.  One day,
we saw indications that one of the disks in the RAID enclosure was starting
to fail, so I scheduled a maintenance window to replace the disk and let
the controller rebuild the data and integrate it back into the RAID set.
No big worries, right?

It's Tuesday at about 2 AM.

Well, the kernel on the RAID controller itself decided that when I pulled
the failing drive would be a fine time to panic, and more or less turn
itself into a bit-blender, and take all the mailstore down with it.  After
a few hours of watching fsck make no progress on anything, in terms of
trying to un-fsck the mailstore, we made the decision in consultation with
the CEO to pull the plug on trying to bring the old RAID enclosure back to
life, and focus on finding suitable replacement hardware and rebuild from
scratch.  We also discovered that the most recent backups of the mailstore
were over a month old :(

I think our CEO ended up driving several hours to procure a suitable
enclosure.  By the time we got the enclosure installed, filesystems built,
and got whatever tape backups we had restored, and tested the integrity of
the system, it was now Thursday around 8 AM. Coincidentally, that was the
same day the company hosted a big VIP gathering (the mayor was there, along
with lots of investors and other bigwigs), so I had to come back and put on
a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
about the previous 3 days.  I still don't know how I got home that night
without wrapping my vehicle around a utility pole (due to being over-tired,
not due to alcohol).

Many painful lessons learned over that stretch of days, as often the case
as a company grows from startup mode and builds more robust technology and
business processes as a consequence of growth.

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff  wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine.  So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>

Re: DPDK and energy efficiency

2021-02-23 Thread William Herrin

On Mon, Feb 22, 2021 at 11:24 PM Etienne-Victor Depasquale
 wrote:
>>
>> Beyond RX/TX CPU affinity, in DANOS you can further tune power consumption 
>> by changing the adaptive polling rate.  It doesn’t, per the survey, "keep 
>> utilization at 100% regardless of packet activity.”
>
> Robert, you seem to be conflating DPDK
> with DANOS' power control algorithms that modulate DPDK's default behaviour.
> Keep in mind that this is a bare-bones survey intended for busy, 
> knowledgeable people (the ones you'd find on NANOG) -

Hi,

Since you understand that, I'm not really clear what you're asking in
the survey.

DPDK doesn't inherently do much in the way of power management. The
polling loops are in the application side of the software, not the
DPDK libraries or NIC driver. It's up to the application author to
decide to detect idleness in the polling loop and take action to
reduce CPU load. If they go for a simple busy-wait, the dataplane
cores run at 100% all the time regardless of packet load. This has the
expected impact on the server's power consumption.

Note that DPDK applications are usually intended to run in very-high
data rate environments where no gains are likely to be realized by
avoiding a busy-wait loop.

Regards,
Bill Herrin

-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/

Re: Famous operational issues

2021-02-23 Thread Justin Streiner

On Thu, Feb 18, 2021 at 5:38 PM Warren Kumari  wrote:

>
> 2: A somewhat similar thing would happen with the Ascend TNT Max, which
> had side-to-side airflow. These were dial termination boxes, and so people
> would install racks and racks of them. The first one would draw in cool air
> on the left, heat it up and ship it out the right. The next one over would
> draw in warm air on the left, heat it up further, and ship it out the
> right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
> with the final one literally on fire, and still passing packets.
>

We had several racks of TNTs at the peak of our dial POP phase, and I
believe we ended up designing baffles for the sides of those racks to pull
in cool air from the front of the rack to the left side of the chassis and
exhaust it out the back from the right side.  It wasn't perfect, but it did
the job.

The TNTs with channelized T3 interfaces were a great way to terminate lots
of modems in a reasonable amount of rack space with minimal cabling.

Thank you
jms

Re: DPDK and energy efficiency

2021-02-23 Thread Pawel Malachowski

> > No, it is not PMD that runs the processor in a polling loop.
> > It is the application itself, thay may or may not busy loop,
> > depending on application programmers choice.
> 
> From one of my earlier references [2]:
> 
> "we found that a poll mode driver (PMD)
> thread accounted for approximately 99.7 percent
> CPU occupancy (a full core utilization)."
> 
> And further on:
> 
> "we found that the thread kept spinning on the following code block:
> 
> *for ( ; ; ) {for ( i = 0; i < poll_cnt; i ++) {dp_netdev_process_rxq_port
> (pmd, list[i].port, poll_list[i].rx) ;}}*
> This indicates that the thread was continuously
> monitoring and executing the receiving data path."

This comes from OVS code and shows OVS thread spinning, not DPDK PMD.
Blame the OVS application for not using e.g. _mm_pause() and burning
the CPU like crazy.


For comparison, take a look at top+i7z output from DPDK-based 100G DDoS
scrubber currently lifting some low traffic using cores 1-13 on 16 core
host. It uses naive DPDK::rte_pause() throttling to enter C1.

Tasks: 342 total,   1 running, 195 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.6 us,  0.6 sy,  0.0 ni, 89.7 id,  3.1 wa,  0.0 hi,  0.0 si,  0.0 st

Core [core-id]  :Actual Freq (Mult.)  C0%   Halt(C1)%  C3 %   C6 %  
Temp  VCore
Core 1 [0]:   1467.73 (14.68x)  2.155.35   192.3
43  0.6724
Core 2 [1]:   1201.09 (12.01x)  11.793.9   0   0
39  0.6575
Core 3 [2]:   1200.06 (12.00x)  11.893.8   0   0
42  0.6543
Core 4 [3]:   1200.14 (12.00x)  11.893.8   0   0
41  0.6549
Core 5 [4]:   1200.10 (12.00x)  11.893.8   0   0
41  0.6526
Core 6 [5]:   1200.12 (12.00x)  11.893.8   0   0
40  0.6559
Core 7 [6]:   1201.01 (12.01x)  11.893.8   0   0
41  0.6559
Core 8 [7]:   1201.02 (12.01x)  11.893.8   0   0
43  0.6525
Core 9 [8]:   1201.00 (12.01x)  11.893.8   0   0
41  0.6857
Core 10 [9]:  1201.04 (12.01x)  11.893.8   0   0
40  0.6541
Core 11 [10]: 1201.95 (12.02x)  13.692.9   0   0
40  0.6558
Core 12 [11]: 1201.02 (12.01x)  11.893.8   0   0
42  0.6526
Core 13 [12]: 1204.97 (12.05x)  17.690.8   0   0
45  0.6814
Core 14 [13]: 1248.39 (12.48x)  28.284.7   0   0
41  0.6855
Core 15 [14]: 2790.74 (27.91x)  91.9   0   1   1
41  0.8885 <-- not PMD
Core 16 [15]: 1262.29 (12.62x)  13.134.9 1.756.2
43  0.6616

$ dataplanectl stats fcore | grep total
fcore total idle 393788223887 work 860443658 (0.2%) (forced-idle 7458486526622) 
recv 202201388561 drop 61259353721 (30.3%) limit 269909758 (0.1%) pass 
140606076622 (69.6%) ingress 66048460 (0.0%/0.0%) sent 162580376914 
(80.4%/100.0%) overflow 0 (0.0%) sampled 628488188/628488188



-- 
Pawel Malachowski
@pawmal80

Re: Famous operational issues

2021-02-23 Thread Warren Kumari

On Mon, Feb 22, 2021 at 7:31 PM  wrote:

>
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>

At one of the AOL datacenters there was some convoluted fire marshal reason
why a specific door could not be locked "during business hours" (?!), and
so there was a guard permanently stationed outside. The door was all the
way around the back of the building, and so basically never used - and so
the guard would fall asleep outside it with a piece of cardboard saying
"Please wake me before entering". He was a nice guy (and it was less faff
than the main entrance), and so we'd either sneak in and just not tell
anyone, or talk loudly while going round the corner so he could pretend to
have been awake the whole time...

W




>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die| b...@theworld.com |
> http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD   | 800-THE-WRLD
> The World: Since 1989  | A Public Information Utility | *oo*
>


-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

>
> Probably yeah.  Have you assessed the lifetime cost of running a
> multicore CPU at 100% vs at 10%, particularly as you're likely to have
> multiples of these devices in operation?
>

Spot on.

On Tue, Feb 23, 2021 at 6:07 PM Nick Hilliard  wrote:

> Shane Ronan wrote on 23/02/2021 16:59:
> > For use cases where DPDK matters, are you really concerned with power
> > consumption?
>
> Probably yeah.  Have you assessed the lifetime cost of running a
> multicore CPU at 100% vs at 10%, particularly as you're likely to have
> multiples of these devices in operation?
>
> Nick
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: DPDK and energy efficiency

2021-02-23 Thread Nick Hilliard


Shane Ronan wrote on 23/02/2021 16:59:
For use cases where DPDK matters, are you really concerned with power 
consumption?


Probably yeah.  Have you assessed the lifetime cost of running a 
multicore CPU at 100% vs at 10%, particularly as you're likely to have 
multiples of these devices in operation?


Nick

Re: DPDK and energy efficiency

2021-02-23 Thread Shane Ronan

For use cases where DPDK matters, are you really concerned with power
consumption?


On Tue, Feb 23, 2021 at 11:48 AM Nick Hilliard  wrote:

> Etienne-Victor Depasquale wrote on 23/02/2021 16:03:
> > "we found that a poll mode driver (PMD)
> > thread accounted for approximately 99.7 percent
> > CPU occupancy (a full core utilization)."
>
> interrupt-driven network drivers generally can't compete with polled
> mode drivers at higher throughputs on generic CPU / PCI card systems.
> On this style of config, you optimise your driver parameters based on
> what works best under the specific conditions.
>
> Polled mode drivers have been around for a while, e.g.
>
> > https://svnweb.freebsd.org/base?view=revision=87902
>
> Nick
>

Re: DPDK and energy efficiency

2021-02-23 Thread Nick Hilliard


Etienne-Victor Depasquale wrote on 23/02/2021 16:03:

"we found that a poll mode driver (PMD)
thread accounted for approximately 99.7 percent
CPU occupancy (a full core utilization)."


interrupt-driven network drivers generally can't compete with polled 
mode drivers at higher throughputs on generic CPU / PCI card systems. 
On this style of config, you optimise your driver parameters based on 
what works best under the specific conditions.


Polled mode drivers have been around for a while, e.g.


https://svnweb.freebsd.org/base?view=revision=87902


Nick

Re: DPDK and energy efficiency

2021-02-23 Thread Etienne-Victor Depasquale

>
> No, it is not PMD that runs the processor in a polling loop.
> It is the application itself, thay may or may not busy loop,
> depending on application programmers choice.
>

>From one of my earlier references [2]:

"we found that a poll mode driver (PMD)
thread accounted for approximately 99.7 percent
CPU occupancy (a full core utilization)."

And further on:

"we found that the thread kept spinning on the following code block:




*for ( ; ; ) {for ( i = 0; i < poll_cnt; i ++) {dp_netdev_process_rxq_port
(pmd, list[i].port, poll_list[i].rx) ;}}*
This indicates that the thread was continuously
monitoring and executing the receiving data path."


[2] S. Fu, J. Liu, and W. Zhu, “Multimedia Content Delivery with Network
Function Virtualization: The Energy Perspective,”
 IEEE MultiMedia, vol. 24, no. 3, pp. 38–47, 2017, ISSN: 1941-0166.
DOI: 10.1109/MMUL.2017.3051514.

On Tue, Feb 23, 2021 at 12:59 PM Pawel Malachowski <
pawmal-na...@freebsd.lublin.pl> wrote:

> Dnia Mon, Feb 22, 2021 at 12:45:52PM +0100, Etienne-Victor Depasquale
> napisał(a):
>
> > Every research paper I've read indicates that, regardless of whether it
> has
> > packets to process or not, DPDK PMDs (poll-mode drivers) prevent the CPU
> > from falling into an LPI (low-power idle).
> >
> > When it has no packets to process, the PMD runs the processor in a
> polling
> > loop that keeps utilization of the running core at 100%.
>
> No, it is not PMD that runs the processor in a polling loop.
> It is the application itself, thay may or may not busy loop,
> depending on application programmers choice.
>
>
> --
> Pawel Malachowski
> @pawmal80
>


-- 
Ing. Etienne-Victor Depasquale
Assistant Lecturer
Department of Communications & Computer Engineering
Faculty of Information & Communication Technology
University of Malta
Web. https://www.um.edu.mt/profile/etiennedepasquale

Re: CGNAT

2021-02-23 Thread Kevin Burke

Hi Steve

We are looking at implementing a similar solution with A10 for CGNAT.

We've been in touch with A10. Just wondering if there are some alternative 
vendors that anyone would recommend. We'd probably be looking at a solution to 
support 5k to 15k customers and bandwidth up to around 30-40 gig as a starting 
point. A solution that is as transparent to user experience as possible is a 
priority.


The numbers below are for a similar target of subscriber’s and peak bandwidth.

We assumed a couple of numbers:
Current Peak Bandwidth = 40G
Remaining IPv4 traffic after migration = 20% (Seen references to 10% or 20% on 
this forum)
Future Bandwidth Growth = 2x (no data behind this assumption)
Future CGNAT’ed bandwidth = 15Gbps
Equipment & budget lifecycle = 7Yr

Getting that data led us to this price comparison:

Solution
Lifecycle/ Term
Annual Cost/Sub
Product Lifecycle Cost/Sub
Lease IPv4 Cogent
7
$ 4.45
 $   31.13
A10 CGNAT 15Gb 7Yr
7
$ 1.21
 $ 8.47
A10 CGNAT 40Gb 7Yr
7
$ 1.95
 $   13.68
Purchase @ $25 7Yr
7
$ 3.57
 $   25.00


The current plan is implement an A10 CGNAT solution after upgrading our network 
for IPv6.  In the interim we will have to lease IPv4 to tide us over.

I would be curious to see what other’s estimate the costs of various 
approaches.  Feel free to ping me off-list for more specific numbers.

Kevin Burke
802-540-0979
Burlington Telecom
200 Church St, Burlington, VT

From: NANOG  on behalf of 
Steve Saner 
Date: Friday, February 19, 2021 at 9:56 AM
To: "nanog@nanog.org" 
Subject: CGNAT

We are starting to look at CGNAT solutions. The primary motivation at the 
moment is to extend current IPv4 resources, but IPv6 migration is also a factor.

We've been in touch with A10. Just wondering if there are some alternative 
vendors that anyone would recommend. We'd probably be looking at a solution to 
support 5k to 15k customers and bandwidth up to around 30-40 gig as a starting 
point. A solution that is as transparent to user experience as possible is a 
priority.

Thanks

--
Steve Saner
ideatek HUMAN AT OUR VERY FIBER

This email transmission, and any documents, files or previous email messages 
attached to it may contain confidential information. If the reader of this 
message is not the intended recipient or the employee or agent responsible for 
delivering the message to the intended recipient, you are hereby notified that 
any dissemination, distribution or copying of this communication is strictly 
prohibited. If you are not, or believe you may not be, the intended recipient, 
please advise the sender immediately by return email or by calling 
620.543.5026. Then take all steps necessary to permanently 
delete the email and all attachments from your computer system.

Wave Peering Contacts

2021-02-23 Thread Michael Crute

Is there anyone from Wave Broadband (AS11404) around here that can contact me 
off-list? Trying to resolve some public IX peering issues. None of the usual 
contacts are responding.

~mike

Re: DPDK and energy efficiency

2021-02-23 Thread Pawel Malachowski

Dnia Mon, Feb 22, 2021 at 12:45:52PM +0100, Etienne-Victor Depasquale 
napisał(a):

> Every research paper I've read indicates that, regardless of whether it has
> packets to process or not, DPDK PMDs (poll-mode drivers) prevent the CPU
> from falling into an LPI (low-power idle).
> 
> When it has no packets to process, the PMD runs the processor in a polling
> loop that keeps utilization of the running core at 100%.

No, it is not PMD that runs the processor in a polling loop.
It is the application itself, thay may or may not busy loop,
depending on application programmers choice.


-- 
Pawel Malachowski
@pawmal80

Re: STOP USING FONT SIZE SMALL Was: Re: LOAs for Cross Connects - Something like PeeringDB for XC

2021-02-23 Thread Douglas Fischer

Hmmm...
I Don't know why this is happening.
Considering my default set-up on the Gmail interface is defined to use
Normal size.
https://pasteboard.co/JPG2ZoK.png

In fact, I had not even realized that this mail-list forwarded emails in
the exact format they were generated. Usually, they set mailman to convert
everything to pure-text.

But thank you Mama for teaching me good manners.
I will try to behave better, a search how to fix the size...
P.S.: If you cloud help-me to fix that on Gmail, I can show you the Zoom
Function on the Browsers or mail readers.



Em seg., 22 de fev. de 2021 às 21:40, Mark Andrews  escreveu:

> Really, does anyone here think that it is good form to send email with
> font size *SMALL*?
> If your MUA does this by default complain to the developers.  The default
> should be “medium”.
> If the font is too big on your screen change the magnification *you*
> choose to display to *yourself*,
> don’t change the font size you send to everyone else.
>
> Mark
>
>  =
> new,monospace;font-size:small">Well... I must confess that I had some
> diffi=
> culty=C2=A0on the first understanding=C2=A0of what is proposed.But
> =
>
>
> > On 23 Feb 2021, at 04:03, Douglas Fischer 
> wrote:
> >
> > Well... I must confess that I had some difficulty on the first
> understanding of what is proposed.
> >
> > But after the 4 reads, I saw that this "spaghetti" thing is more
> powerful than I could imagine!
> >
> >
> > Please correct me if I'm no right:
> > But it looks like a "crypto sign and publishes" anything related to an
> organization.
> >
> > Yes, I think that with some effort CrossConnect LOAs can be fitted
> inside of it...
> > I'm not sure if it is the better solution for the scope of LOAs, but
> certainly is a valid discussion.
> >
> >
> > What is bubbling in my mind is the standard data model for each type of
> different attribute that can exist...
> > Who will define that?
> >
> >
> >
> > Em seg., 22 de fev. de 2021 às 12:26, Christopher Morrow <
> morrowc.li...@gmail.com> escreveu:
> > On Mon, Feb 22, 2021 at 9:19 AM Douglas Fischer
> >  wrote:
> > >
> > > I believe that almost everyone in here knows that LOAs for Cross
> Connects in Datacenters and Telecom Rooms can be a pain...
> > >
> > > I don't know if I'm suggesting something that already exists.
> > > Or even if I'm suggesting something that could be unpopular for some
> reason.
> > >
> > > But every time I need to deal with some Cross-Connect LOA, and mostly
> when we face some rework on data mistakes, I dream with a "PeeringDB for
> Cross Connects".
> > >
> >
> > are you asking about something like this:
> >   https://datatracker.ietf.org/doc/draft-spaghetti-sidrops-rpki-rsc/
> >
> > Which COULD be used to, as an AS holder:
> >   "sign something to be sent between you and the colo and your intended
> peer"
> >
> > that you could sign (with your rpki stuffs) and your peer could also
> > sign with their 'rpki stuffs', and which the colo provider could
> > automatically validate and action upon final signature(s) received.
> >
> > > So, this mail is a question and also a suggestion.
> > >
> > >
> > > There is something like an "online notary's office" exclusive for
> Cross-Connect LOAs?
> > >  - Somewhere Organizations can register information authorizing
> connections of Port on their Places (Cages, Racks, etc)...
> > >
> >
> > The RPKI data today doesn't contain information about
> > cages/ports/patch-panels, so possibly the spaghetti draft isn't a
> > terrific fit?
> >
> > > If it doesn't exist. What would be necessary for that?
> > > Mostly considering the PeeringDB work model.
> > >  - OpenSource.
> > >  - Free access to the tool, and sponsors to keep the project alive.
> > >  - API driven, with some Web-gui.
> > > And considering some data-modeling.
> > >  - Most of the data being Open/Public (Organizations,
> Facilities(Datacenters and/or Telecom-Rooms), Presence on Facilities, etc).
> > >  - Access control to Information that can not be public (A-side
> organization, Z-Side Organization, PathPanel/Port).
> > > And some workflow
> > >  - Cross Connect Requiremento/Authorization from A-Side
> > >  - Acceptance/Authorization from Z-side.
> > >  - Acceptance/Authorization from Facilities involved (could be more
> than one)
> > >  - Execution/Activation notice from Facilities.
> > >
> > >
> > > --
> > > Douglas Fernando Fischer
> > > Engº de Controle e Automação
> >
> >
> > --
> > Douglas Fernando Fischer
> > Engº de Controle e Automação
>
> --
> Mark Andrews, ISC
> 1 Seymour St., Dundas Valley, NSW 2117, Australia
> PHONE: +61 2 9871 4742  INTERNET: ma...@isc.org
>
>

-- 
Douglas Fernando Fischer
Engº de Controle e Automação

37 matches

Mail list logo