Re: Disaster Recovery Process

2021-10-06 Thread Wolfgang Tremmel
And a layer 8 item from me:

- put a number (as in money) into the process up to that anything spend by 
anyone working on the recovery is covered.

Has to be a number because if you write "all cost are covered" it makes the 
recovery person 2nd-guess if the airplane ticket or spare part he just bought 
is really covered.

Additionally you should put an "approver" on the team who approves higher cost 
on very short notice.

Wolfgang

> On 5. Oct 2021, at 16:05, Karl Auer  wrote:
> 
> On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:
>> A few reminders for people:
>> [excellent list snipped]
> 
> I'd add one "soft" list item:

-- 
Wolfgang Tremmel 

Phone +49 69 1730902 0  | wolfgang.trem...@de-cix.net
Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG 
Cologne, HRB 51135
DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany 
| www.de-cix.net



Re: Disaster Recovery Process

2021-10-05 Thread Jamie Dahl
The NIMS/ICS system works very well for issues like this.   I utilize ICS 
regularly in my Search and Rescue world, and the last two companies I worked 
for utilize(d) it extensively during outages.  It allows folks from various 
different disciplines, roles and backgrounds to come in, and provide a divide 
and conquer methodology to incidents and can be scaled up/scaled out as 
necessary.  Phrases like "Incident Commander" and such have been around for a 
few decades and are concepts used regularly by FEMA, CalFire and other natural 
disaster style incidents.  But those of you who may be EMComm folks probably 
already knew that ;-). 



this was pounded out on my iPhone and i have fat fingers plus  two left thumbs 
:)

We have to remember that what we observe is not nature herself, but nature 
exposed to our method of questioning.


> On Oct 5, 2021, at 10:11, jim deleskie  wrote:
> 
> 
> World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.  
> Glass breaks super easy too and much less expensive then adding 15 min to 
> failure.
> 
> -jim
> 
>> On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz,  wrote:
>> 7. Make sure any access controlled rooms have physical keys that are 
>> available at need - and aren't secured by the same access control that they 
>> are to circumvent. . 
>> 8. Don't make your access control dependent on internet access - always have 
>> something on the local network  it can fall back to. 
>> 
>> That last thing, that apparently their access control failed, locking people 
>> out when either their outward facing DNS and/or BGP routes went goodbye, is 
>> perhaps the most astounding thing to me - making your access control into an 
>> IoT device without (apparently) a quick workaround for a failure in the "I" 
>> part.
>> 
>>> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>>> 
>>> 
>>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>>> > 
>>> > How come such a large operation does not have an out of bound access in 
>>> > case of emergencies ???
>>> > 
>>> > 
>>> 
>>> I mentioned to someone yesterday that most OOB systems _are_ the internet.  
>>> It doesn’t always seem like you need things like modems or dial-backup, or 
>>> access to these services, except when you do it’s critical/essential.
>>> 
>>> A few reminders for people:
>>> 
>>> 1) Program your co-workers into your cell phone
>>> 2) Print out an emergency contact sheet
>>> 3) Have a backup conference bridge/system that you test
>>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet? Audio 
>>> bridge?
>>>   - No judgement, but do test the system!
>>> 4) Know how to access the office and who is closest.  
>>>   - What happens if they are in the hospital, sick or on vacation?
>>> 5) Complacency is dangerous
>>>   - When the tools “just work” you never imagine the tools won’t work.  I’m 
>>> sure the lessons learned will be long internally.  
>>>   - I hope they share them externally so others can learn.
>>> 6) No really, test the backup process.
>>> 
>>> 
>>> 
>>> * interlude *
>>> 
>>> Back at my time at 2914 - one reason we all had T1’s at home was largely so 
>>> we could get in to the network should something bad happen.  My home IP 
>>> space was in the router ACLs.  Much changed since those early days as this 
>>> network became more reliable.  We’ve seen large outages in the past 2 years 
>>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in 
>>> my memory).  
>>> 
>>> Plan for the outages and make sure you understand your playbook.  It may be 
>>> from snow day to all hands on deck.  Test it at least once, and ideally 
>>> with someone who will challenge a few assumptions (eg: that the cell 
>>> network will be up)
>>> 
>>> - Jared
>> 
>> 
>> -- 
>> Jeff Shultz
>> 
>> 
>> Like us on Social Media for News, Promotions, and other information!!
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> *** This message contains confidential information and is intended only for 
>> the individual named. If you are not the named addressee you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately by e-mail if you have received this e-mail by mistake and delete 
>> this e-mail from your system. E-mail transmission cannot be guaranteed to be 
>> secure or error-free as information could be intercepted, corrupted, lost, 
>> destroyed, arrive late or incomplete, or contain viruses. The sender 
>> therefore does not accept liability for any errors or omissions in the 
>> contents of this message, which arise as a result of e-mail transmission. ***


Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
I don't see posting in a DR process thead about thinking to use alternative
entry methods to locked doors and spreading false information.  If do
well.  Mail filters are simple.

-jim

On Tue., Oct. 5, 2021, 7:35 p.m. Niels Bakker, 
wrote:

> * deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:
> >World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.
>
> Please stop spreading fake news.
>
> https://twitter.com/MikeIsaac/status/1445196576956162050
> |need to issue a correction: the team dispatched to the Facebook site
> |had issues getting in because of physical security but did not need to
> |use a saw/ grinder.
>
>
> -- Niels.
>


Re: Disaster Recovery Process

2021-10-05 Thread Niels Bakker

* deles...@gmail.com (jim deleskie) [Tue 05 Oct 2021, 19:13 CEST]:

World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.


Please stop spreading fake news.

https://twitter.com/MikeIsaac/status/1445196576956162050
|need to issue a correction: the team dispatched to the Facebook site
|had issues getting in because of physical security but did not need to
|use a saw/ grinder.


-- Niels.


Re: Disaster Recovery Process

2021-10-05 Thread Warren Kumari
On Tue, Oct 5, 2021 at 1:07 PM Jeff Shultz  wrote:

> 7. Make sure any access controlled rooms have physical keys that are
> available at need - and aren't secured by the same access control that they
> are to circumvent. .
> 8. Don't make your access control dependent on internet access - always
> have something on the local network  it can fall back to.
>
> That last thing, that apparently their access control failed, locking
> people out when either their outward facing DNS and/or BGP routes went
> goodbye, is perhaps the most astounding thing to me - making your access
> control into an IoT device without (apparently) a quick workaround for a
> failure in the "I" part.
>

Keep in mind that the "some employees couldn't get into their offices" has
been filtered through the public press and seems to have grown into "OMG!
Lolz! No-one can fix the Facebook because no-one can reach the
turn-it-off-and-on-again-button".
Facebook has many office buildings, and needs to be able to add and revoke
employee access as people are hired and quit, etc. Just because the press
said that some random employees were unable to enter their office building
doesn't actually mean that: 1: this was a datacenter and they really needed
access or 2: no-one was able to enter or 3: this actually caused issues
with recovery.
Important buildings have security people who have controller-locked cards
and / or physical keys, offices != datacenter, etc.

I'm quite sure that this part of the story is a combination of some small
tidbit of information that a non-technical reporter was able to understand,
mixed with some "Hah. Look at those idiots, even I know to keep a spare key
under the doormat" schadenfreude.

W

>
> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>
>>
>>
>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>> >
>> > How come such a large operation does not have an out of bound access in
>> case of emergencies ???
>> >
>> >
>>
>> I mentioned to someone yesterday that most OOB systems _are_ the
>> internet.  It doesn’t always seem like you need things like modems or
>> dial-backup, or access to these services, except when you do it’s
>> critical/essential.
>>
>> A few reminders for people:
>>
>> 1) Program your co-workers into your cell phone
>> 2) Print out an emergency contact sheet
>> 3) Have a backup conference bridge/system that you test
>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
>> Audio bridge?
>>   - No judgement, but do test the system!
>> 4) Know how to access the office and who is closest.
>>   - What happens if they are in the hospital, sick or on vacation?
>> 5) Complacency is dangerous
>>   - When the tools “just work” you never imagine the tools won’t work.
>> I’m sure the lessons learned will be long internally.
>>   - I hope they share them externally so others can learn.
>> 6) No really, test the backup process.
>>
>>
>>
>> * interlude *
>>
>> Back at my time at 2914 - one reason we all had T1’s at home was largely
>> so we could get in to the network should something bad happen.  My home IP
>> space was in the router ACLs.  Much changed since those early days as this
>> network became more reliable.  We’ve seen large outages in the past 2 years
>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
>> my memory).
>>
>> Plan for the outages and make sure you understand your playbook.  It may
>> be from snow day to all hands on deck.  Test it at least once, and ideally
>> with someone who will challenge a few assumptions (eg: that the cell
>> network will be up)
>>
>> - Jared
>
>
>
> --
> Jeff Shultz
>
>
> Like us on Social Media for News, Promotions, and other information!!
>
> [image:
> https://www.instagram.com/sctc_sctc/]
> 
> 
> 
>
>
>
>
>
>
>
>  This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. 
>


-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Disaster Recovery Process

2021-10-05 Thread jim deleskie
World broke.  Crazy $$ per hour down time.  Doors open with a fire axe.
Glass breaks super easy too and much less expensive then adding 15 min to
failure.

-jim

On Tue., Oct. 5, 2021, 7:05 p.m. Jeff Shultz, 
wrote:

> 7. Make sure any access controlled rooms have physical keys that are
> available at need - and aren't secured by the same access control that they
> are to circumvent. .
> 8. Don't make your access control dependent on internet access - always
> have something on the local network  it can fall back to.
>
> That last thing, that apparently their access control failed, locking
> people out when either their outward facing DNS and/or BGP routes went
> goodbye, is perhaps the most astounding thing to me - making your access
> control into an IoT device without (apparently) a quick workaround for a
> failure in the "I" part.
>
> On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:
>
>>
>>
>> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
>> >
>> > How come such a large operation does not have an out of bound access in
>> case of emergencies ???
>> >
>> >
>>
>> I mentioned to someone yesterday that most OOB systems _are_ the
>> internet.  It doesn’t always seem like you need things like modems or
>> dial-backup, or access to these services, except when you do it’s
>> critical/essential.
>>
>> A few reminders for people:
>>
>> 1) Program your co-workers into your cell phone
>> 2) Print out an emergency contact sheet
>> 3) Have a backup conference bridge/system that you test
>>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
>> Audio bridge?
>>   - No judgement, but do test the system!
>> 4) Know how to access the office and who is closest.
>>   - What happens if they are in the hospital, sick or on vacation?
>> 5) Complacency is dangerous
>>   - When the tools “just work” you never imagine the tools won’t work.
>> I’m sure the lessons learned will be long internally.
>>   - I hope they share them externally so others can learn.
>> 6) No really, test the backup process.
>>
>>
>>
>> * interlude *
>>
>> Back at my time at 2914 - one reason we all had T1’s at home was largely
>> so we could get in to the network should something bad happen.  My home IP
>> space was in the router ACLs.  Much changed since those early days as this
>> network became more reliable.  We’ve seen large outages in the past 2 years
>> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
>> my memory).
>>
>> Plan for the outages and make sure you understand your playbook.  It may
>> be from snow day to all hands on deck.  Test it at least once, and ideally
>> with someone who will challenge a few assumptions (eg: that the cell
>> network will be up)
>>
>> - Jared
>
>
>
> --
> Jeff Shultz
>
>
> Like us on Social Media for News, Promotions, and other information!!
>
> [image:
> https://www.instagram.com/sctc_sctc/]
> 
> 
> 
>
>
>
>
>
>
>
>  This message contains confidential information and is intended only
> for the individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. E-mail transmission cannot be
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
> The sender therefore does not accept liability for any errors or omissions
> in the contents of this message, which arise as a result of e-mail
> transmission. 
>


Re: Disaster Recovery Process

2021-10-05 Thread Jeff Shultz
7. Make sure any access controlled rooms have physical keys that are
available at need - and aren't secured by the same access control that they
are to circumvent. .
8. Don't make your access control dependent on internet access - always
have something on the local network  it can fall back to.

That last thing, that apparently their access control failed, locking
people out when either their outward facing DNS and/or BGP routes went
goodbye, is perhaps the most astounding thing to me - making your access
control into an IoT device without (apparently) a quick workaround for a
failure in the "I" part.

On Tue, Oct 5, 2021 at 6:01 AM Jared Mauch  wrote:

>
>
> > On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
> >
> > How come such a large operation does not have an out of bound access in
> case of emergencies ???
> >
> >
>
> I mentioned to someone yesterday that most OOB systems _are_ the
> internet.  It doesn’t always seem like you need things like modems or
> dial-backup, or access to these services, except when you do it’s
> critical/essential.
>
> A few reminders for people:
>
> 1) Program your co-workers into your cell phone
> 2) Print out an emergency contact sheet
> 3) Have a backup conference bridge/system that you test
>   - if zoom/webex/ms are down, where do you go?  Slack?  Google meet?
> Audio bridge?
>   - No judgement, but do test the system!
> 4) Know how to access the office and who is closest.
>   - What happens if they are in the hospital, sick or on vacation?
> 5) Complacency is dangerous
>   - When the tools “just work” you never imagine the tools won’t work.
> I’m sure the lessons learned will be long internally.
>   - I hope they share them externally so others can learn.
> 6) No really, test the backup process.
>
>
>
> * interlude *
>
> Back at my time at 2914 - one reason we all had T1’s at home was largely
> so we could get in to the network should something bad happen.  My home IP
> space was in the router ACLs.  Much changed since those early days as this
> network became more reliable.  We’ve seen large outages in the past 2 years
> of platforms, carriers, etc.. (the Aug 30th 2020 issue is still firmly in
> my memory).
>
> Plan for the outages and make sure you understand your playbook.  It may
> be from snow day to all hands on deck.  Test it at least once, and ideally
> with someone who will challenge a few assumptions (eg: that the cell
> network will be up)
>
> - Jared



-- 
Jeff Shultz

-- 
Like us on Social Media for News, Promotions, and other information!!

   
      
      
      














_ This message 
contains confidential information and is intended only for the individual 
named. If you are not the named addressee you should not disseminate, 
distribute or copy this e-mail. Please notify the sender immediately by 
e-mail if you have received this e-mail by mistake and delete this e-mail 
from your system. E-mail transmission cannot be guaranteed to be secure or 
error-free as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses. The sender therefore does 
not accept liability for any errors or omissions in the contents of this 
message, which arise as a result of e-mail transmission. _



Re: Disaster Recovery Process

2021-10-05 Thread Sean Donelan

On Wed, 6 Oct 2021, Karl Auer wrote:

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY
high up in the organisation. Their lines need to be open to the
decisionmakers in the emergency team(s). Their job is to put the fear
of a vengeful god into any idiot who tries to interfere with the
recovery process by e.g. demanding status reports at ten-minute
intervals.


A good idea I learned was designate separate "executive" conference room 
and "incident command" conference room.


Executives are only allowed in the executive conference room.  Executives 
are NOT allowed in any NOC/SOC/operations areas.  The executive conference 
room was well stocked with coffee, snacks, TVs, monitors, paper and 
easels.


An executive was anyone with a CxO, General Counsel, EVP, VP, etc. title. 
You know who you are :-)


One operations person (i.e. Director of Operations or designee for shift) 
would brief the executives when they wanted something, and take their 
suggestions back to the incident room.  The Incident Commander was 
God as far as the incident, with a pre-approved emergency budget 
authorization.


One compromise, we did allow one lawyer in the incident command conference 
room, but it was NOT the corporate General Counsel.


Re: Disaster Recovery Process

2021-10-05 Thread Jared Mauch



> On Oct 5, 2021, at 10:05 AM, Karl Auer  wrote:
> 
> On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:
>> A few reminders for people:
>> [excellent list snipped]
> 
> I'd add one "soft" list item:
> 
> - in your emergency plan, have one or two people nominated who are VERY
> high up in the organisation. Their lines need to be open to the
> decisionmakers in the emergency team(s). Their job is to put the fear
> of a vengeful god into any idiot who tries to interfere with the
> recovery process by e.g. demanding status reports at ten-minute
> intervals.

At $dayjob we split the technical updates on a different bridge from the 
business updates.

There is a dedicated team to coordinate an entire thing, they can be low 
severity (risk) or high severity (whole business impacting).

They provide the timeline to next update and communicate what tasks are being 
done.  There’s even training on how to be a SME in the environment.  

Nothing is perfect but this runs very smooth at $dayjob

- Jared



Re: Disaster Recovery Process

2021-10-05 Thread Karl Auer
On Tue, 2021-10-05 at 08:50 -0400, Jared Mauch wrote:
> A few reminders for people:
> [excellent list snipped]

I'd add one "soft" list item:

- in your emergency plan, have one or two people nominated who are VERY
high up in the organisation. Their lines need to be open to the
decisionmakers in the emergency team(s). Their job is to put the fear
of a vengeful god into any idiot who tries to interfere with the
recovery process by e.g. demanding status reports at ten-minute
intervals.

Regards, K.

-- 
~~~
Karl Auer (ka...@biplane.com.au)
http://www.biplane.com.au/kauer

GPG fingerprint: 61A0 99A9 8823 3A75 871E 5D90 BADB B237 260C 9C58
Old fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170





Disaster Recovery Process

2021-10-05 Thread Jared Mauch



> On Oct 4, 2021, at 4:53 PM, Jorge Amodio  wrote:
> 
> How come such a large operation does not have an out of bound access in case 
> of emergencies ???
> 
> 

I mentioned to someone yesterday that most OOB systems _are_ the internet.  It 
doesn’t always seem like you need things like modems or dial-backup, or access 
to these services, except when you do it’s critical/essential.

A few reminders for people:

1) Program your co-workers into your cell phone
2) Print out an emergency contact sheet
3) Have a backup conference bridge/system that you test
  - if zoom/webex/ms are down, where do you go?  Slack?  Google meet? Audio 
bridge?
  - No judgement, but do test the system!
4) Know how to access the office and who is closest.  
  - What happens if they are in the hospital, sick or on vacation?
5) Complacency is dangerous
  - When the tools “just work” you never imagine the tools won’t work.  I’m 
sure the lessons learned will be long internally.  
  - I hope they share them externally so others can learn.
6) No really, test the backup process.



* interlude *

Back at my time at 2914 - one reason we all had T1’s at home was largely so we 
could get in to the network should something bad happen.  My home IP space was 
in the router ACLs.  Much changed since those early days as this network became 
more reliable.  We’ve seen large outages in the past 2 years of platforms, 
carriers, etc.. (the Aug 30th 2020 issue is still firmly in my memory).  

Plan for the outages and make sure you understand your playbook.  It may be 
from snow day to all hands on deck.  Test it at least once, and ideally with 
someone who will challenge a few assumptions (eg: that the cell network will be 
up)

- Jared