As was foretold, we've added advertisements to the forums! If you have questions, or if you encounter any bugs, please visit this thread: https://forums.penny-arcade.com/discussion/240191/forum-advertisement-faq-and-reports-thread/

[sysadmin] on-call schedule - Always you

13468920

Posts

  • zagdrobzagdrob Registered User regular
    I eavesdropped on a call my wife had yesterday with her new job now working in IT (one of us, one of us) where they were discussing disaster recovery plans. Part of the discussion had to do with the backup data center being close (< 1 mile) from the primary data center, and how its not really redundant.

    One of the people on the call got very, very concerned about if there was a nuclear attack on the city and how widely spaced the data centers are, and what nations might attack with which kilotonnage / megatonnage and how far spread the sites would need to be. It was a hilarious spiral because you had two of 'those guys' who were feeding off each other. They were talking about nukemaps and overlays on how far / what would be needed and how reinforced the structure would be.

    I did get a bit of respect for my wife's new boss because he gave them a minute or two to riff on it before bringing it all back to earth by telling them that he's worried about a natural disaster like a large tornado not a nuclear attack because if there is no situation where IF there is a nuclear attack data center redundancy for those services matters in the short term. Gave a concise and reasonable answer about why and then moved everyone on to legit and reasonable disaster recovery concerns.

    I wish my boss did as good a job keeping meetings on track when people start going into the weeds.

  • ThawmusThawmus +Jackface Registered User regular
    twmjr wrote: »
    Thawmus wrote: »
    Feral wrote: »
    schuss wrote: »
    Feral wrote: »
    What's your plan if Azure is inaccessible?

    Considering if that happens, Azure AD is down and basically the entirety of the business world is down, I'd say "go to an early lunch".

    There are a lot of situations where your cloud resources in Azure might be inaccessible without it being 'azure, everywhere, is down'

    Fiber cut outside our HQ as the example I used above

    I mean if you have Azure and you don't have redundant Internet access then I don't know what the fuck you're doing.
    would you be interested in my new book? it's called:

    "When End to End Circuit Diversity Isn't Really End to End: How to Make Customers Angry with a Single Backhoe"

    While I agree with this on its face, if you're getting transport links to different datacenters through different fiber providers, and they're not shared providers, you really avoid this problem. If you want to be really picky you can also use different backbone providers at those datacenters as well (although backbone transports from datacenter to datacenter are extremely cheap and awesome and hard to pass up). Not only does this offer you more flexibility in these scenarios, but depending on how old the fiber in your area is, it can also have the merit of being cheaper and faster than your current service. This is a strategy I've proposed to ownership and it literally came in at $2,000 less/month for the entire enterprise for 4x the throughput.

    If the problem is so big that it affects multiple backbone providers, or Azure's end is down in any way, you go back in the flowchart to: "Not our problem, time to pass the case of beer around the office." At some point, you pitched Azure, or someone else did, and upper management signed off on it. Either you explained the risks, or you didn't. If the concern is that upper management won't remember these risks or care when shit's down, again I circle back to, "Not my problem." If they want to fire you because they're fucking stupid then they can enjoy trying to hire IT staff in today's environment. Goes X2 Combo if they're unwilling to let IT staff work from home. In the meantime you collect the 20 jobs on Indeed that offer pay raises and spin the wheel.

    If your goal is to accomplish the herculean task of having more redundancy and uptime than a megacorp, I humbly request that you ask for a lot more fucking money to do your job.

    Twitch: Thawmus83
  • wunderbarwunderbar What Have I Done? Registered User regular
    That's great. I've definitely been in meetings like that before. At some point you just have to say "if x happens literally no one is going to care if our business is still running that day"


    In a pure practical sense I did work for a company that built a new head office in the blast zone of an oil refinery. Being in the meetings about how the company had to buy a specific kind of blast proof glass, or have concrete walls bs so many inches thicker than normal buildings are, etc was actually kind of fascinating.

    The end result of that was we ended up having to install about 2x as many wifi points that you would think we would need and needed to contract a company to install a bunch of cellular repeaters in the office because, this may surprise you all, we basically built a giant faraday cage. Before we got the repeaters installed I had no cell signal in my office. It was great.

    XBL: thewunderbar PSN: thewunderbar NNID: thewunderbar Steam: wunderbar87 Twitter: wunderbar
  • shadowaneshadowane Registered User regular
    edited January 2022
    This image is so true:
    6ghcfknz0tpu.jpg

    shadowane on
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    Thawmus wrote: »
    Feral wrote: »
    schuss wrote: »
    Feral wrote: »
    What's your plan if Azure is inaccessible?

    Considering if that happens, Azure AD is down and basically the entirety of the business world is down, I'd say "go to an early lunch".

    There are a lot of situations where your cloud resources in Azure might be inaccessible without it being 'azure, everywhere, is down'

    Fiber cut outside our HQ as the example I used above

    I mean if you have Azure and you don't have redundant Internet access then I don't know what the fuck you're doing.

    You're probably working for a company whose attitude towards the cloud is "let the cloud handle it so we don't have to invest in on-prem infrastructure"
    schuss wrote: »
    I don't think anything about that is materially different from how you'd handle in an on-prem era

    Right. That's what I'm saying. Moving to cloud doesn't abrogate the need for a disaster recovery plan. Not that cloud is necessarily worse than on-prem.

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    Don't hyperfocus too much on the Internet outage. That's an example.

    There were two widespread Azure outages in 2021 (one in April, another in October).

    I've encountered a lot of smaller-scale outages over the years. Stuff that doesn't make it to the news, but are localized to specific customers or specific nodes.

    Nor am I picking on Azure. My point is that managers and execs will drink the marketing koolaid that cloud is sooooooo reliable so we don't have to spend any resources or time or labor or thought on DR, diversifying our infrastructure, building out hybrid, training employees on offline procedures, etc.

    If your execs are genuinely on board with "take an early lunch" or "hang up a sign" or "pass around a case of beer" as your disaster plans, then great. In my experience, that's not really what happens - they assume incorrectly that they'll never have an outage again... then when the outage inevitably happens, they bark at IT to fix it.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • ThawmusThawmus +Jackface Registered User regular
    I mean I feel like the challenge has less to do with Azure and more to do with stupid execs and unfortunately there's no cure for that.

    If your position is that Azure and/or cloud services are uniquely positioned to let execs be stupider than normal, then I'm afraid I'd have to disagree, they really have no lower limit on dumb shit and if they weren't putting faith in the cloud they'd be putting faith in you, whether they give you resources or not, whether you made promises to them or not, to give them 100% uptime. I'm not sure I understand why one is preferable over the other if the goal is literally to keep management off your back.

    Twitch: Thawmus83
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    Thawmus wrote: »
    I mean I feel like the challenge has less to do with Azure and more to do with stupid execs and unfortunately there's no cure for that.

    If your position is that Azure and/or cloud services are uniquely positioned to let execs be stupider than normal, then I'm afraid I'd have to disagree, they really have no lower limit on dumb shit and if they weren't putting faith in the cloud they'd be putting faith in you, whether they give you resources or not, whether you made promises to them or not, to give them 100% uptime. I'm not sure I understand why one is preferable over the other if the goal is literally to keep management off your back.

    My position is that disaster plans like "take a nap," "take an early lunch," "crack open a beer," aren't acceptable to the majority of organization or customers, in my admittedly anecdotal experience.

    To put it a different way, the dichotomy of "if they weren't putting faith in the cloud they'd be putting faith in you" doesn't really exist. Maybe it does in some places. But in most, IT = cloud and cloud = IT. I don't get to say "it's in the cloud, it's not my problem."

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • ThawmusThawmus +Jackface Registered User regular
    Feral wrote: »
    Thawmus wrote: »
    I mean I feel like the challenge has less to do with Azure and more to do with stupid execs and unfortunately there's no cure for that.

    If your position is that Azure and/or cloud services are uniquely positioned to let execs be stupider than normal, then I'm afraid I'd have to disagree, they really have no lower limit on dumb shit and if they weren't putting faith in the cloud they'd be putting faith in you, whether they give you resources or not, whether you made promises to them or not, to give them 100% uptime. I'm not sure I understand why one is preferable over the other if the goal is literally to keep management off your back.

    My position is that disaster plans like "take a nap," "take an early lunch," "crack open a beer," aren't acceptable to the majority of organization or customers, in my admittedly anecdotal experience.

    To put it a different way, the dichotomy of "if they weren't putting faith in the cloud they'd be putting faith in you" doesn't really exist. Maybe it does in some places. But in most, IT = cloud and cloud = IT. I don't get to say "it's in the cloud, it's not my problem."

    These are pretty clearly hyperbole, man. None of us are actually doing that shit in an outage. I'm sending emails or calling execs and letting them know what the situation is, at a minimum. But the situation being communicated can absolutely be, "There's a mass outage, it's out of our hands."

    And if it can't be that, if that kind of communication is 100% unacceptable, man you should find a new job because even my shitty bosses will accept that kind of answer. The worst they'll do is ask to have a meeting 3 days later to discuss what happened and what can be done about it going forward, and even then when I explain the pros and cons of our strategies, they'll either accept it or ask for changes, and not get all hrrgllbrrgll. And I'm not even in a position to blame shit on a megacorp, I've got zero cloud services running in our org, often to my chagrin.

    Twitch: Thawmus83
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    Thawmus wrote: »
    Feral wrote: »
    Thawmus wrote: »
    I mean I feel like the challenge has less to do with Azure and more to do with stupid execs and unfortunately there's no cure for that.

    If your position is that Azure and/or cloud services are uniquely positioned to let execs be stupider than normal, then I'm afraid I'd have to disagree, they really have no lower limit on dumb shit and if they weren't putting faith in the cloud they'd be putting faith in you, whether they give you resources or not, whether you made promises to them or not, to give them 100% uptime. I'm not sure I understand why one is preferable over the other if the goal is literally to keep management off your back.

    My position is that disaster plans like "take a nap," "take an early lunch," "crack open a beer," aren't acceptable to the majority of organization or customers, in my admittedly anecdotal experience.

    To put it a different way, the dichotomy of "if they weren't putting faith in the cloud they'd be putting faith in you" doesn't really exist. Maybe it does in some places. But in most, IT = cloud and cloud = IT. I don't get to say "it's in the cloud, it's not my problem."

    These are pretty clearly hyperbole, man. None of us are actually doing that shit in an outage. I'm sending emails or calling execs and letting them know what the situation is, at a minimum. But the situation being communicated can absolutely be, "There's a mass outage, it's out of our hands."

    I know "take a nap" is hyperbole. I'm speaking to any variation of "it's out of our hands."

    Sometimes "it's out of our hands" is a cromulent answer, for low criticality services. But in my experience, if a cloud service is high-criticality (eg, it shuts down the business when it's down), then it very much is my problem and it very much is in my (and my team's) hands.

    Thawmus wrote: »
    And if it can't be that, if that kind of communication is 100% unacceptable, man you should find a new job because even my shitty bosses will accept that kind of answer.

    I'm not basing this on a single job or a single boss. I'm basing it on different management and different companies. I admit my experience is anecdotal, but it's pretty vast. Any technical problem that causes a major work stoppage is usually going to be IT's problem.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • ThawmusThawmus +Jackface Registered User regular
    See stuff like this makes me really think if my job here ever folds up I'm just out of this industry. Even if everything is completely shitted up, and you're completely at the mercy of other parties fixing it, you can't tell your bosses that? What do they expect you to do? Do they know? What do they do? Do you put on a performance? I don't understand.

    Like if your exchange server takes a shit, I understand them being up your ass until you fix it. I hate it to death, but I understand it. But if a major datacenter gets hit by a tornado and your entire state is without Internet for a week or three, what is the DR strategy they want from you? What is their strategy for that hellish scenario? Because that shit has happened. That shit can still happen. Almost all of Nebraska runs out of one datacenter in Omaha. One good tornado, and the downtime is estimated to be 3 weeks for the whole goddamn state, all carriers down. What is your team supposed to do, then? What is the expectation? Because if it's not telling them the goddamn truth, and working with them on a strategy for recovery, I'd be fucking outta there.

    Because I'm curious, when these things happen, who is communicating to the heads of these companies, is it you, or are you giving these answers to another manager who is interpreting and relaying this shit to someone else? Because I gotta tell you, I've seen some night and day difference now that I'm the one calling the owner and CEO and telling them what's up. I don't know what the fuck they were being told before but it sure as shit wasn't what I was telling my supervisor during these events.

    Twitch: Thawmus83
  • schussschuss Registered User regular
    The thing is - business critical apps and capabilities will cost real money to have a good DR plan. That means running active/active or similar. Things like fiber cuts? Either pay for the alternate provider or plan on employees relocating for the duration. Your critical tier should be a minimum of services, as the random employee lunch portal etc. can be down for a day.
    That said - critical services need to have failover designed in at the development stage, not just foisted on sysadmins or a similar tier, because the interconnected nature of today's world means that if it's not architected with failover baked in, it's going to be an absolute nightmare in a real DR scenario. Also, if you are truly gaming out "Azure is down" scenarios and have the decree of "it needs to be up", get ready to sign big contracts with AWS or GCP as an alternate provider AND put the handcuffs on your devs as you can no longer use native services given the strategy. Oh, and now you need to replicate your permissions and security layers as well, have fun. The whole point of the cloud is risk transfer to the cloud provider, so if people want to ensure it's up no matter what, it's not as simple as having an alternate deploy location. If this is truly needed, you also need to do it for the whole company and systematically enforce which calls and tech is allowed in critical tier apps as part of code review.
    At some point, it's ok being down for a period if the proverbial neck to choke isn't yours, so when execs sign big cloud contracts, it's important that they understand what problems are no longer yours.

  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    Thawmus wrote: »
    See stuff like this makes me really think if my job here ever folds up I'm just out of this industry. Even if everything is completely shitted up, and you're completely at the mercy of other parties fixing it, you can't tell your bosses that? What do they expect you to do? Do they know? What do they do? Do you put on a performance? I don't understand.

    Well, some real-world emergency examples:

    - Internet was down due to fiber cut. CTO tells one part of IT department to go to an AT&T store, buy and activate all the wifi hotspots they can, while the other part of the IT department starts pulling old laptops out of our spares (and recycling) pile and provisioning them for users to be used with the hotspots. Meanwhile I just keep yelling at the ISP over the phone to fix it, even though I know that doesn't do any good, because that's what the CTO is telling me to keep doing.
    - A web app hosted on Azure and managed by a third-party is down. The first step here is to confirm that there's nothing "on our end" that could be causing the outage. So make sure there's no firewall rule blocking it. I know there isn't, but I have to check anyway. Same for web filtering. Same for routing tables. Once I've checked each and every downstream device, then IT starts implementing a workaround. In a particular situation that I'm thinking of, that workaround was taking a backup of the database behind this web app, importing it into a local SQL Server VM, getting that local copy of the SQL database online, and exporting CSVs of the data contained therein and putting them up on network drives so at least they can be read by the users who would normally rely on that web app.
    - Very commonly at my current job: I have to get on the phone and call up the vendor responsible, and stay on the phone with them, and escalate to supervisors, and so on and so forth, because if there's nothing else I can do, at the very least I can "light a fire" or "keep the pressure on" to get the service back up. Personally, I know that this is typically useless, and probably counterproductive, but I have to show that I'm doing whatever I can to hasten the process.

    Non-emergency scenarios:

    - In the second example, we ended up setting up a VPN tunnel to Azure and doing SQL database synchronization to keep a live copy of the database in our on-prem datacenter. We didn't replicate the web app, so when the hosted web app failed, users were heavily impacted. But both IT and that department had people savvy enough in SQL to export reports and give users enough data to keep them productive through the outage. (I still had to go through the whole song and dance of proving that it's not "on our end" every time there was a problem with it.)
    - O365: we are implementing O365 this year. We started it last year but had to backburner it for other priorities. Our plan is to do hybrid Exchange; if O365 has a problem, then we still have the on-prem servers and vice versa.

    General principles that go into a lot of the cloud and/or hosted service adoptions I've been involved with:

    - implementing redundancy that isn't reliant on a single cloud provider. Maybe the redundancy is Azure + on-prem, or maybe it's Azure + Rackspace. For example, we have one critical system right now that has instances on Azure and instances on a smaller hosting provider (similar to Rackspace), and these instances replicate data with each other.
    - Having backups that aren't also stored on that cloud provider, so we can restore from backup if necessary.
    - Reports or data exports on file shares or in other databases so end users don't have to wait for us to restore from backup to at least have read-only access.
    - Making sure staff is trained in offline procedures so they don't just sit on their hands when the hosted service is down.

    If all else fails:

    - Getting decision-makers to explicitly commit in writing when they accept the risk of relying entirely on a single cloud provider. Preferably in the form of a formal risk assessment, but sometimes just in an email, so when the crisis is over and folks have calmed down, you have something to point at to say (paraphrased of course), "Look, I told you this would happen someday, and you accepted that risk, so you really shouldn't have been crawling so far up my ass during the outage." That doesn't necessarily stop them from freaking the fuck out when the cloud is inaccessible, but I've found it helps.

    Thawmus wrote: »
    Like if your exchange server takes a shit, I understand them being up your ass until you fix it. I hate it to death, but I understand it. But if a major datacenter gets hit by a tornado and your entire state is without Internet for a week or three, what is the DR strategy they want from you? What is their strategy for that hellish scenario? Because that shit has happened. That shit can still happen. Almost all of Nebraska runs out of one datacenter in Omaha. One good tornado, and the downtime is estimated to be 3 weeks for the whole goddamn state, all carriers down. What is your team supposed to do, then? What is the expectation? Because if it's not telling them the goddamn truth, and working with them on a strategy for recovery, I'd be fucking outta there.

    Because I'm curious, when these things happen, who is communicating to the heads of these companies, is it you, or are you giving these answers to another manager who is interpreting and relaying this shit to someone else? Because I gotta tell you, I've seen some night and day difference now that I'm the one calling the owner and CEO and telling them what's up. I don't know what the fuck they were being told before but it sure as shit wasn't what I was telling my supervisor during these events.

    Here's what I'm getting at, and why the tornado isn't analogous. Microsoft is an IT vendor. Amazon is an IT vendor. AWS and Azure are IT services. That means that we're responsible when they fail.

    The CTO or CIO or VP of IT Operations chose Azure, they're presenting Azure to the organization as the foundation for the services (in an ITIL sense) that IT offers to the business.

    Maybe the decision to go to Azure or AWS wasn't made by IT. That sucks, but users (including other department managers) aren't going to understand that. That's why this isn't just a situation of one bad boss or one bad company. Cloud is just another way for IT to manage our infrastructure. But we're still managing it.

    Non-techies are usually more understanding about bona fide natural disasters. IT didn't implement the tornado.


    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    schuss wrote: »
    The thing is - business critical apps and capabilities will cost real money to have a good DR plan. That means running active/active or similar. Things like fiber cuts? Either pay for the alternate provider or plan on employees relocating for the duration. Your critical tier should be a minimum of services, as the random employee lunch portal etc. can be down for a day.
    That said - critical services need to have failover designed in at the development stage, not just foisted on sysadmins or a similar tier, because the interconnected nature of today's world means that if it's not architected with failover baked in, it's going to be an absolute nightmare in a real DR scenario. Also, if you are truly gaming out "Azure is down" scenarios and have the decree of "it needs to be up", get ready to sign big contracts with AWS or GCP as an alternate provider AND put the handcuffs on your devs as you can no longer use native services given the strategy. Oh, and now you need to replicate your permissions and security layers as well, have fun. The whole point of the cloud is risk transfer to the cloud provider, so if people want to ensure it's up no matter what, it's not as simple as having an alternate deploy location. If this is truly needed, you also need to do it for the whole company and systematically enforce which calls and tech is allowed in critical tier apps as part of code review.
    At some point, it's ok being down for a period if the proverbial neck to choke isn't yours, so when execs sign big cloud contracts, it's important that they understand what problems are no longer yours.

    BTW, I don't work for tech companies, I work for non-tech companies. So my company can't really do code review or implement coding standards on devs.

    (at my current job we do have devs that work on internal tools, and web devs, and such, but none of them are doing any sort of cloud-native or cloud-first apps)

    We can (and do) ask questions like, "Can this software work in a hybrid environment (cloud + on-prem)?" "Am I tied to a single cloud or hosting provider?" "Can we run geographically-distributed redundant instances of this software?" etc.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • schussschuss Registered User regular
    Feral wrote: »
    schuss wrote: »
    The thing is - business critical apps and capabilities will cost real money to have a good DR plan. That means running active/active or similar. Things like fiber cuts? Either pay for the alternate provider or plan on employees relocating for the duration. Your critical tier should be a minimum of services, as the random employee lunch portal etc. can be down for a day.
    That said - critical services need to have failover designed in at the development stage, not just foisted on sysadmins or a similar tier, because the interconnected nature of today's world means that if it's not architected with failover baked in, it's going to be an absolute nightmare in a real DR scenario. Also, if you are truly gaming out "Azure is down" scenarios and have the decree of "it needs to be up", get ready to sign big contracts with AWS or GCP as an alternate provider AND put the handcuffs on your devs as you can no longer use native services given the strategy. Oh, and now you need to replicate your permissions and security layers as well, have fun. The whole point of the cloud is risk transfer to the cloud provider, so if people want to ensure it's up no matter what, it's not as simple as having an alternate deploy location. If this is truly needed, you also need to do it for the whole company and systematically enforce which calls and tech is allowed in critical tier apps as part of code review.
    At some point, it's ok being down for a period if the proverbial neck to choke isn't yours, so when execs sign big cloud contracts, it's important that they understand what problems are no longer yours.

    BTW, I don't work for tech companies, I work for non-tech companies. So my company can't really do code review or implement coding standards on devs.

    (at my current job we do have devs that work on internal tools, and web devs, and such, but none of them are doing any sort of cloud-native or cloud-first apps)

    We can (and do) ask questions like, "Can this software work in a hybrid environment (cloud + on-prem)?" "Am I tied to a single cloud or hosting provider?" "Can we run geographically-distributed redundant instances of this software?" etc.

    I work for an insurance company. If you aren't doing code reviews, don't develop software. Full stop. Everything should get two sets of eyes before hitting a production environment.

    In your examples:
    3rd party managed app is down - that's the 3rd parties problem and there should be performance standards in the contract. If you're fighting vendors constantly, you need better contracts people to ensure that when they go down they're the ones eating the cost and headache.
    Internate down - to Thawmus' point - there should be mitigation plans that aren't "go fucking waste money at AT&T". That can be table-planned pretty easily with equipment or redundancy secured ahead of time.

    Cloud is a way to transfer the risk and responsibility of uptime and scaling to the cloud vendor.

    All the things you describe are management not knowing how to manage modern technology environments. If IT isn't in the conversation and have agreed upon SLA's based on yours vs. cloud outages before contracts are signed, your management is out to lunch and has no business being decision-makers, as clearly they have no agency in broader decisions or standards that directly impact their staff and products.

  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    schuss wrote: »
    All the things you describe are management not knowing how to manage modern technology environments.

    Well, yeah. Of course. Most managers & execs don't know how to manage tech. (The ones that do often have other fatal flaws.) That's not a surprise for anybody in this thread. Or really anybody who has worked in corporate America for any length of time. Most businesses are kakistocracies, in general, and especially when it comes to technology.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • schussschuss Registered User regular
    Feral wrote: »
    schuss wrote: »
    All the things you describe are management not knowing how to manage modern technology environments.

    Well, yeah. Of course. Most managers & execs don't know how to manage tech. (The ones that do often have other fatal flaws.) That's not a surprise for anybody in this thread. Or really anybody who has worked in corporate America for any length of time. Most businesses are kakistocracies, in general, and especially when it comes to technology.

    No, your technology management is what I'm referring to. Part of that job is breaking down minimum investment and standards around technology things, just as a marketing or ops manager does the same for their discipline.
    The contract stuff should be standard, as would you stand on your building ops person to fix things if the janitors don't show up or go after their contract and hire new ones (if needed)?

  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    schuss wrote: »
    Feral wrote: »
    schuss wrote: »
    All the things you describe are management not knowing how to manage modern technology environments.

    Well, yeah. Of course. Most managers & execs don't know how to manage tech. (The ones that do often have other fatal flaws.) That's not a surprise for anybody in this thread. Or really anybody who has worked in corporate America for any length of time. Most businesses are kakistocracies, in general, and especially when it comes to technology.

    No, your technology management is what I'm referring to. Part of that job is breaking down minimum investment and standards around technology things, just as a marketing or ops manager does the same for their discipline.
    The contract stuff should be standard, as would you stand on your building ops person to fix things if the janitors don't show up or go after their contract and hire new ones (if needed)?

    Well, let's run with this analogy for a moment.

    Let's say that the VP of Facilities signed a contract with a custodial firm that, for whatever reason, once in a while left a mess so bad that it caused a work stoppage. I dunno, let's imagine that we're working onsite, and garbage piles up in the break rooms so badly that it starts to stink and people sitting in the vicinity have to vacate.

    Do you think the VP of Facilities would get away with saying, "oh, it's okay, we have a clause in our contract with them that says that we get a refund if they fuck up like this?" (Basically, an SLA.) Do you think that the business (either the execs, or the users) would tolerate the facilities staff sitting around and acting like this is an acceptable situation to be in once every year or two?

    No, that response wouldn't be tolerable to anybody. The goal of the custodial service is to keep the building clean. The goal of IT is to keep systems online. If the facilities VP's contracted custodians failed to take out the trash, somebody down the org chart in facilities would have to get up from their desk and do it themselves. When system-critical infrastructure goes down, even if a contractor is at fault, IT has to mobilize and find a way to work around it. Not because that's the CTO's (or the VP of Facilities's) expectation, but because that's the expectation from the organization at large.

    Users do not care what your SLAs are. Their managers don't care what your SLAs are. Non-IT executives don't care what your SLAs are. The board doesn't care what your SLAs are. "We've transferred the risk of a system outage to our outsourced infrastructure vendor" isn't an acceptable answer to the VP of HR who is on the phone to the CTO, griping about how it's open enrollment season and they can't get to their benefits documents. And in my experience, if the CTO is on the receiving end of that phone call, he's going to seagull in and start lighting fires under IT to fix it - even if it was the CTO's decision to put us in that position in the first place.

    Does that make the CTO a bad manager? Yeah, probably. There's no law that says that CTOs or CIOs have to be good managers. So when you say "that's bad management!" my reaction is "Yeah? So what? Most managers are bad in some way or another." Good managers are like good cops: the ones who are individually good are overpowered by systemic problems and perverse incentives.

    The friction here is how he's a bad manager. Your position seems to be (if you don't mind me paraphrasing) that he's a bad manager because he's making lower-level IT staff responsible for a system outage that is beyond our control. My position is a bit different. IT is responsible for keeping critical systems online (regardless of the vendors or the contracts or the SLAs). No amount of expectation management will change that. He's a bad manager for putting IT in a position where we couldn't fulfill our expectations. He'd still be a bad manager even if he didn't come in and start ordering people to buy hotspots from AT&T.

    You and I seem to have at least partly compatible outlooks? You said this:
    The thing is - business critical apps and capabilities will cost real money to have a good DR plan. That means running active/active or similar. Things like fiber cuts? Either pay for the alternate provider or plan on employees relocating for the duration. Your critical tier should be a minimum of services, as the random employee lunch portal etc. can be down for a day.

    Which is compatible with my position here.

    The CTO in my example was a bad manager when he rejected my proposal for redundant Internet service, prior to that (inevitable) outage. The AT&T hotspot incident was just the predictable result of him being a grasshopper carried by ants in that Aesop's fable.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • That_GuyThat_Guy I don't wanna be that guy Registered User regular
    God, Feral. Your job sounds like an absolute nightmare. If your bosses are breathing down your neck so hard that you have to pretend to look busy during an outage you have no control over, is it even worth it to keep the job? Do you even have time to take breaks during the day?

  • SiliconStewSiliconStew Registered User regular
    You seem to be assuming 100% uptime is both possible and a reasonable expectation and implying every possible problem can be directly fixed by your own staff. Of course people want things to work all the time, that doesn't make it a reasonable request, and failure to manage and communicate those expectations throughout the organization on uptime or what fixes can reasonably be implemented during an outage is a failure in IT management.

    Yes, you need to work to remove single points of failure from your environment to the extent possible, but if it's an outage with a 3rd party, the only thing you may be reasonably able to do is keep people in the org up to date on repair ETA's. We use O365 because MS can provide far better infrastructure redundancy than we can for business-critical email, yet their SLA is still only three 9s. When O365 goes down, IT's response, and CxO expectations, should not be to attempt to reimplement your enterprise email system in a panic. You suggest people temporarily use alternative communication means and wait for MS to fix the issue. To counter your garbage collection example, you don't suggest your own IT staff fix MS' problems.

    Just remember that half the people you meet are below average intelligence.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    That_Guy wrote: »
    God, Feral. Your job sounds like an absolute nightmare. If your bosses are breathing down your neck so hard that you have to pretend to look busy during an outage you have no control over, is it even worth it to keep the job? Do you even have time to take breaks during the day?

    I'm not basing this on a single job.

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    We use O365 because MS can provide far better infrastructure redundancy than we can for business-critical email, yet their SLA is still only three 9s.

    Why not both?

    When O365 is inaccessible (rare, but it does happen), your on-prem in your hybrid Exchange deployment can take over.

    Bonus points: many companies have to run their own SMTP servers anyway for legacy devices and applications that need to relay but don't support the authentication & encryption requirements of O365. I hate them, but we have a few of these. Or they point their backup solution to their on-prem mailbox server instead of trying to implement an O365-enabled backup solution.

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • That_GuyThat_Guy I don't wanna be that guy Registered User regular
    Feral wrote: »
    That_Guy wrote: »
    God, Feral. Your job sounds like an absolute nightmare. If your bosses are breathing down your neck so hard that you have to pretend to look busy during an outage you have no control over, is it even worth it to keep the job? Do you even have time to take breaks during the day?

    I'm not basing this on a single job.

    Your experience in IT has been very different from mine. I'm used to being buried under a pile of work but not like that. Sometimes you just have to tell people "no" or "just wait". In the most diplomatic way possible of course.

    Your example of the fiber cut is one I can personally directly address. Going out and buying a bunch of wifi hotspots while others dust off old laptops is probably the least effective way of addressing the problem. I'd argue that simply doing nothing would be a better solution. When I am in that situation I diplomatically encourage everyone to chill the fuck out and/or work from home until they fix it. Than I would get on my distributor's website and overnight order a 4g modem (like a FortiExtender) that I can plug directly into the firewall and setup as a failover connection. That way people can keep using their normal workstations in the event of another outage and no one has to run around like a chicken with it's head cut off. If the company can't survive a partial day of downtime, that company shouldn't survive.

  • Inquisitor77Inquisitor77 2 x Penny Arcade Fight Club Champion A fixed point in space and timeRegistered User regular
    The thing is, if it's really mission-critical that you have 100% uptime then you make that part of the cost of doing business or you eventually go out of business. Hospitals literally have their own power generators because they know that if the power grid goes down they need to keep machines running or people will die. I'm guessing there are also regulatory responsibilities in play as well.

    If your business requires that kind of uptime then the people in charge should already know what kinds of DR they need and how much it will cost them. They are the ones who ultimately decide the risk:reward/cost:benefit. So if the shit hits the fan and a data center goes down and you didn't have backups in place and you told them that not having backups meant that you would lose all your data, then that's entirely on them.

    But the question is whether you gave them the information to make that decision in the first place. If they signed off either way, then that's on them. But the expectation that you can somehow pull money and resources out of your ass to get 100% uptime across every known possible scenario is just not a thing, and if you work in a place where that is a thing then that's entirely unreasonable and not a good place to work because all shit will roll downhill to you.

  • wunderbarwunderbar What Have I Done? Registered User regular
    For me, it comes down to the simple fact that the vast majority of companies are not nearly as mission critical as they think they are. The good companies realize this.

    Yes, moving to a cloud provider is risk mitigation, and in every case I've been involved in the decision making matrix it always comes down to the fact that yes, Azure/AWS/Google Cloud can go down, and if they go down that will impact our business. But if Azure goes down there will be hundreds of engineers at Microsoft whose sole job is to, you know, fix it. Where as if an on prem infrastructure goes down, it's up to what is likely a too small IT team to fix. Personally, I'd rather have Microsoft trying to fix my email outage instead of me. It's not that I can't fix most problems given enough time, but generally I'd rather leave that work to the people who actually make the product.

    XBL: thewunderbar PSN: thewunderbar NNID: thewunderbar Steam: wunderbar87 Twitter: wunderbar
  • schussschuss Registered User regular
    Feral wrote: »
    schuss wrote: »
    Feral wrote: »
    schuss wrote: »
    All the things you describe are management not knowing how to manage modern technology environments.

    Well, yeah. Of course. Most managers & execs don't know how to manage tech. (The ones that do often have other fatal flaws.) That's not a surprise for anybody in this thread. Or really anybody who has worked in corporate America for any length of time. Most businesses are kakistocracies, in general, and especially when it comes to technology.

    No, your technology management is what I'm referring to. Part of that job is breaking down minimum investment and standards around technology things, just as a marketing or ops manager does the same for their discipline.
    The contract stuff should be standard, as would you stand on your building ops person to fix things if the janitors don't show up or go after their contract and hire new ones (if needed)?

    Well, let's run with this analogy for a moment.

    Let's say that the VP of Facilities signed a contract with a custodial firm that, for whatever reason, once in a while left a mess so bad that it caused a work stoppage. I dunno, let's imagine that we're working onsite, and garbage piles up in the break rooms so badly that it starts to stink and people sitting in the vicinity have to vacate.

    Do you think the VP of Facilities would get away with saying, "oh, it's okay, we have a clause in our contract with them that says that we get a refund if they fuck up like this?" (Basically, an SLA.) Do you think that the business (either the execs, or the users) would tolerate the facilities staff sitting around and acting like this is an acceptable situation to be in once every year or two?

    No, that response wouldn't be tolerable to anybody. The goal of the custodial service is to keep the building clean. The goal of IT is to keep systems online. If the facilities VP's contracted custodians failed to take out the trash, somebody down the org chart in facilities would have to get up from their desk and do it themselves. When system-critical infrastructure goes down, even if a contractor is at fault, IT has to mobilize and find a way to work around it. Not because that's the CTO's (or the VP of Facilities's) expectation, but because that's the expectation from the organization at large.

    Users do not care what your SLAs are. Their managers don't care what your SLAs are. Non-IT executives don't care what your SLAs are. The board doesn't care what your SLAs are. "We've transferred the risk of a system outage to our outsourced infrastructure vendor" isn't an acceptable answer to the VP of HR who is on the phone to the CTO, griping about how it's open enrollment season and they can't get to their benefits documents. And in my experience, if the CTO is on the receiving end of that phone call, he's going to seagull in and start lighting fires under IT to fix it - even if it was the CTO's decision to put us in that position in the first place.

    Does that make the CTO a bad manager? Yeah, probably. There's no law that says that CTOs or CIOs have to be good managers. So when you say "that's bad management!" my reaction is "Yeah? So what? Most managers are bad in some way or another." Good managers are like good cops: the ones who are individually good are overpowered by systemic problems and perverse incentives.

    The friction here is how he's a bad manager. Your position seems to be (if you don't mind me paraphrasing) that he's a bad manager because he's making lower-level IT staff responsible for a system outage that is beyond our control. My position is a bit different. IT is responsible for keeping critical systems online (regardless of the vendors or the contracts or the SLAs). No amount of expectation management will change that. He's a bad manager for putting IT in a position where we couldn't fulfill our expectations. He'd still be a bad manager even if he didn't come in and start ordering people to buy hotspots from AT&T.

    You and I seem to have at least partly compatible outlooks? You said this:
    The thing is - business critical apps and capabilities will cost real money to have a good DR plan. That means running active/active or similar. Things like fiber cuts? Either pay for the alternate provider or plan on employees relocating for the duration. Your critical tier should be a minimum of services, as the random employee lunch portal etc. can be down for a day.

    Which is compatible with my position here.

    The CTO in my example was a bad manager when he rejected my proposal for redundant Internet service, prior to that (inevitable) outage. The AT&T hotspot incident was just the predictable result of him being a grasshopper carried by ants in that Aesop's fable.

    A big part of technical management is communicating with non-technical management and setting expectations around what happens. If non-IT executives don't know or care what your SLA's are, either they weren't agreed to properly (as SLA's should have the business stakeholders as signatories). You hold the line and set expectations with people so what when shit goes bad, you're able to clearly articulate "we're currently engaged with the vendor responsible and ensuring that this is resolved with all known haste. We do not own or maintain these systems due to the contractual decisions, so be aware this will be managed by *incident manager/director person* in concert with the vendor."
    Making your people run around because you have not built the rapport with your peers is a bad excuse. This is what I'm talking about. Cloud and other related items are MORE expensive by a significant margin if you're still going to act like it's On-Prem (as EC2 instances will be more than similar hardware in your datacenter, you're paying for the management and scaling).
    In the example - Facilities goes after the vendor and finds a new one, not paying the first as they built the contract to have performance standards that say "hey, you owe us X if you don't perform as described." Words and contracts matter a lot, as does the enforcement of them. Just this morning we were discussing two large vendors not performing on major shit that are being kicked to the curb and we'll have to eat some small additional cost, but because of contract structure we're able to basically come out neutral. The things I'm describing are a significant portion of an IT execs job. The reactions you're describing indicate IT is not seen as a peer or a group with agency, but just a whipping post any time things go wrong. That is not IT's job. IT's job is building and maintaining a resilient set of environments for work to happen in and things to run in.
    "We've transferred the risk of a system outage to our outsourced infrastructure vendor" is a perfectly acceptable response provided you have some level of management going on to ensure it's being addressed. The CTO does not report to the VP of HR, so unless a specific promise was made around uptime during a period, it's business as usual. You are not responsible for keeping a system online that you have no control over.

    Note that as I say all this - I understand it's not easy to get to a place that non-tech folks understand the complexities of maintaining even simple systems or relative hidden costs or benefits of various platforms or strategies. In my company we literally sat down every exec and ran them through very specific education on how modern technology works so everyone could work better together.

  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    edited January 2022
    schuss wrote: »
    The reactions you're describing indicate IT is not seen as a peer or a group with agency, but just a whipping post any time things go wrong. That is not IT's job. IT's job is building and maintaining a resilient set of environments for work to happen in and things to run in.

    I agree: IT's job is building and maintaining a resilient set of environments for work to happen in and things to run in.

    And yes, you're accurate. IT is (often) not seen as a peer or a group with agency, but just a whipping post any time things go wrong.

    I think where we disagree is the causality of that relationship.

    When IT actually does build a resilient set of environments, our reputation improves, we gain respect and political traction.

    When IT relies on excuses and passing the buck, our reputation declines, and we are more likely to be seen as a whipping post.

    It's not a linear causal arrow. It's a self-perpetuating cycle.

    IT departments have poor reputations in a lot of companies. I think that attitudes like (paraphrased) "well, when Azure/AWS/ISP/etc fail, we have no control over it so our job is just to communicate to the business that it's a problem with our vendor" contribute to that poor reputation. Even if it seems like coworkers & management accept that messaging in the moment, or even if it seems like they signed off on that ahead of time, a lot of people hear that and it gets filed in their brains as "IT can't help me and they're making excuses."

    I'll expound on that:
    schuss wrote: »
    Words and contracts matter a lot, as does the enforcement of them. Just this morning we were discussing two large vendors not performing on major shit that are being kicked to the curb and we'll have to eat some small additional cost, but because of contract structure we're able to basically come out neutral. The things I'm describing are a significant portion of an IT execs job.

    Words and contracts matter a lot. Shared cultural understandings and implicit assumptions matter almost as much.

    Let me use an example that doesn't involve an outage. My company established a VPN tunnel with a contractor so their employees could access our network resources. While on a Zoom call with one of the contractor's internal system administrators, he strongly asked us if we could reduce the complexity of our VPN tunnel's pre-shared key. We said no. A few days later, he asked us if we could just set all the Active Directory passwords for the contractors' user accounts (on our network) to the same password - a (simple, stupid) password that he suggested. We again said no.

    Our contract covered some basic cybersecurity verbiage. The problem here wasn't that our passwords were short and simple (because we wouldn't let them be.) The problem here was that this contractor, who had (limited, controlled) access to our environment, and (limited) access to our internal data, had a system administrator on staff who didn't know or didn't care what good password policy is. If his attitude is just "set everybody's password to contoso1, please" then who knows what he's setting up on his network. Who knows which of his own critical systems have stupid, shared passwords. And while we could treat this contractor's staff and systems with limited trust, they would unavoidably be handling some of our confidential data.

    We had a discussion with the vendor about that and they pledged to do some additional cybersecurity training. But we didn't drop the vendor over it, as much as I would have liked to. (Not merely for this, but for other reasons as well.)

    Now, I'm sure that on an Internet forum, and in hindsight, we could say "well, that's poor vendor management" or "you should have covered that in the contract." Anybody who says that is missing the point. No contract, and no vendor management process, is going to cover everything in explicit detail. I would love to get a full SOC report from an independent security auditor for every vendor I ever engage with. I would love it if every vendor were contractually required to read and implement the NIST cybersecurity framework before they come within a metaphorical mile of our data. It ain't gonna happen.

    The employee handbook doesn't have to say "don't shit in the urinal." If somebody shits in the urinal, they don't get to say "nobody told me not to." We have a shared, cultural understanding that shitting in the urinal is beyond the pale. Just like we have a shared cultural understanding in this thread that running to AT&T for hotspots when the Internet goes down is beyond the pale, or using shared short stupid passwords for critical systems containing confidential data is beyond the pale.

    (Unfortunately, because of shit like the Equifax breach and because of widespread underinvestment in IT and underinvestment in cybersecurity, and lack of meaningful punishment for companies who drop the ball, the anglosphere is starting to develop a shared cultural assumption that data privacy is impossible. But that's it's own separate tangent.)

    And I've found that there is often a friction between the culture of IT people and the culture of non-IT people when it comes to third-party outages. In IT, we see it as normal and natural that when a third-party vendor fails, it's not up to us to bring them back up and we just have to wait for them to fix the problem. Outside of IT, most non-IT people don't know what a JPEG is, they just want a picture of a gosh dang hot dog don't know what Azure is, they just want to get their work done.

    Thankfully, most people understand Internet outages, just like they understand power outages and natural disasters and inclement weather. What I've found most non-IT people don't understand is why an Internet outage takes down some systems and not others. Unless it's accessed through a web browser, and branded with some other company's name and logo, most people don't intuit it as a cloud or hosted site.

    Explicit agreements between stakeholders, expectation management, formal risk assessments, and SLAs help a lot. I don't mean to imply that they are useless. But we're still swimming against humans' habits and biases, which is always difficult.
    Thawmus wrote: »
    See stuff like this makes me really think if my job here ever folds up I'm just out of this industry.

    I've been trying to get out of this industry for 20 years.

    https://www.youtube.com/watch?v=UneS2Uwc6xw

    Feral on
    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • schussschuss Registered User regular
    Yeah man, I feel that pain. We audit vendors and our customers audit us on cybersecurity controls. In your current culture - the stuff that's flying is acceptable because clearly they don't have understanding/care around security practices, which makes you the bad guy in all of this. It took us YEARS, but we're at a reasonable point where every exec is semi-scared shitless of cybersecurity breaches/events and security is just everyone's albatross to deal with effectively, because if we're unsecure, there's material business loss as no other company will trust their confidential info with us. We do the same to vendors - if their controls look like shit, they aren't allowed anywhere near confidential or restricted data.
    Unfortunately, many have not gotten out of the "just make it work" era and accept second rate controls and management. It's why again - I'm not putting this on you, you're stuck between knowing the right thing and being pushed the wrong things, so you're trying to actively mitigate in the middle. Management should be the one holding the line on security and other factors as "this is non-negotiable and here are the very real losses in dollars AND reputation that WILL happen if you don't listen to me".

    For the non-IT folk, I usually defer to traffic-jam or more physical breakage metaphors - like if there's a massive pile-up, is it ok that they're late to work/a meeting? Probably. Unavoidable and unforeseeable, but it's been known to happen. 3rd parties going down is similar. Bridging the gap is the most important thing for management and any product or portfolio ownership functions to be doing, as it gives people a path to understanding instead of "WHAT THE FUCK THIS IS BROKEN FIX IT" toxic culture bullshit.

  • AiouaAioua Ora Occidens Ora OptimaRegistered User regular
    edited January 2022
    I'm not sure I agree with your general point Feral... Like I guess I agree for when you're talking about the kind of attitudes we instill in IT workers who will eventually become management and leadership, then yes they should be thinking first how to build an environment as resilient as possible for the resources they are allocated.

    And that's really the rub, if the people holding the pursestrings don't want to pay for a highly available environment and also want to complain when things go down, then there's no winning.

    Like you're basically saying that IT gains nothing from negotiating SLAs with leadership because leadership will never honor their side of the agreement... But that leaves you either under-delivering as you attempt to build a smaller, more resilient system without enough resources or you're just in a shit situation with a shit company.

    Like this is why I was so happy to get out of IT because the only way to win in a bad company is doing twice as much with half the money, and fuck that.

    edit:
    I dunno man, like... I work on a team that does management of major outages of a Product You've Heard Of for a Big Old Tech Firm and when us-east-1 went down the other week we absolutely sat on our hands, sending out updates. Our boss was ok with it, our director was ok with it, our VP was ok with it. Nobody was trying to scramble and get the product running on some backup tech stack. And we're not even now trying to add more redundancy in case of a similar outage because it just isn't worth the money and our leadership understands that.
    I think it's fair to say that if you have shitty leadership you need to plan for them to throw IT under the bus despite previous assurances. But you should also be planning your exit from that abusive company with objectively bad leadership.

    Aioua on
    life's a game that you're bound to lose / like using a hammer to pound in screws
    fuck up once and you break your thumb / if you're happy at all then you're god damn dumb
    that's right we're on a fucked up cruise / God is dead but at least we have booze
    bad things happen, no one knows why / the sun burns out and everyone dies
  • SiliconStewSiliconStew Registered User regular
    That you believe communicating about issues to the company and managing expectations throughout the business somehow contributes to a "bad reputation" for IT is rather telling. Management doesn't support you on security issues? You've failed to communicate the importance to the business to them. Communicating the enormous costs of fines, ransoms, downtime, data loss, lost business, or lost trust/reputation is typically the best approach to get buy in from the Board/CxO/VP levels. Users don't understand why things are broken? You are failing to explain it in terms they can understand. And beyond that, all most people are looking for is the answer to "Is it being worked on?" and "When will it be back up?". Everything you describe stems from a communication problem in your company. Work on that communication, and yes it can take a lot of effort and time to get there, and all those issues you've mentioned go away. If IT management isn't working on improving communication, you have a management problem. If they have been working on communication, but continue to get no buy in, then you have a likely unfixable cultural problem at the top and you should probably be looking for work elsewhere for your own sake.

    Just remember that half the people you meet are below average intelligence.
  • wunderbarwunderbar What Have I Done? Registered User regular
    unrelated to any of that, funny story that isn't super funny at all and I hate everything.

    Our business insurance is coming up for renewal, which leads to the usual "fill out this questionaire" type stuff. Remember, I'm less than a month at this job, with 2 weeks of that disrupted by Christmas.

    one question in the insurance form was "do you have any unsupported software" The answer to that is yes, which had me worried when answering it. We have some Server 2008 R2 VM's, but more importantly our ERP system, which is where our business runs off of, came out of support at the end of 2021. We are planning on moving off of it in Q4, that project is underway, but we are running on unsupported software, even with a plan to move off of it I was worried about pushback from the insurance company.

    Another question is whether or not we have MFA enabled on user accounts. The answer to that is no, and we currently have no plans to roll it out. Not that I don't want to, but again, I'm less than a month into this job, so I said no current plans to roll it out just because I'm still trying to grasp what my capacity is and I know rolling out MFA to this infrastructure is going to take some work.

    Insurance company comes back and says that our answers aren't acceptable and they will not insure us. But not because of the unsupported software.... because of the lack of MFA. That gave me a bit of a head tilt, but whatever. I assumed that it was probably because we don't at least have a plan to move to MFA. Without getting to far into it, the ERP project involves actually moving it into Azure, on Dynamics 365. That is due in Q4, and once we have that done we don't have any major workloads on prem so instead of going through the effort of getting MFA working on our legacy systems I said I thought a reasonable timeline was to roll out MFA in conjunction with the ERP project in Q4. The business, and the rest of the IT team, agreed, so we took that to the insurance company.

    Not good enough for the insurance company. They've given us a month to come up with a new plan, or to switch providers. And if you're thinking it's just one insurance company being difficult, a second one we quoted also said that rolling out MFA in Q4 2022 is not an acceptable timeline.

    So I guess I'm getting MFA working on this hacky, hybrid cloud on prem deployment with a ton of real old stuff in it. Stuff like email is easy, just turn it on in O365. But the on prem stuff is going to require quite a bit of work that is suddenly my highest priority.

    Insurance companies are the worst. I fully agree that MFA is important and I use it on every personal service I can. But to not ensure a business because they won't have MFA rolled out until Q4? What the hell?

    XBL: thewunderbar PSN: thewunderbar NNID: thewunderbar Steam: wunderbar87 Twitter: wunderbar
  • SiliconStewSiliconStew Registered User regular
    edited January 2022
    We had to fully implement MFA for insurance this year, but it was specifically for cyber security coverage. But they also told us it was going to be a requirement for renewal nearly a year in advance. I wonder if your predecessor ignored prior warnings if this is both a surprise and if they are only giving you 30 days to implement.

    I will say Duo Security was really easy to implement for admin access to our on-prem server infrastructure. Azure MFA covered all our normal users but took a while to roll out fully due to all the MFA/Modern Auth incompatible mail accounts/apps on people's phones/tablets. We weren't going to just turn it on and break everyone all at once.

    Edit: Though I agree that some insurance requirements are just stupid. For example, they required us to block MS RDS Gateway access but didn't care at all that we have Citrix Access despite both providing the same level of secure remote access.

    SiliconStew on
    Just remember that half the people you meet are below average intelligence.
  • schussschuss Registered User regular
    As someone who works for an insurance company - this is absolutely because you have some form of cyber and continuity coverage. Years of saying "hey, do this" didn't work, so many providers just moved the method you mention to reduce attack vectors to make underwriting feasible.

  • wunderbarwunderbar What Have I Done? Registered User regular
    We had to fully implement MFA for insurance this year, but it was specifically for cyber security coverage. But they also told us it was going to be a requirement for renewal nearly a year in advance. I wonder if your predecessor ignored prior warnings if this is both a surprise and if they are only giving you 30 days to implement.

    I will say Duo Security was really easy to implement for admin access to our on-prem server infrastructure. Azure MFA covered all our normal users but took a while to roll out fully due to all the MFA/Modern Auth incompatible mail accounts/apps on people's phones/tablets. We weren't going to just turn it on and break everyone all at once.

    I have experience with Duo, and it's my fallback option. But I'd like to go with the Microsoft Authenticator since we are almost entirely in the Microsoft ecosystem, and that would have been good enough. In the medium term it looks like I can get microsoft authenticator working with our firewall VPN with some hoops. That'll let me say we have 2FA on our VPN, which is how remote users connect to the on prem resources we have left.

    XBL: thewunderbar PSN: thewunderbar NNID: thewunderbar Steam: wunderbar87 Twitter: wunderbar
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    schuss wrote: »
    As someone who works for an insurance company - this is absolutely because you have some form of cyber and continuity coverage. Years of saying "hey, do this" didn't work, so many providers just moved the method you mention to reduce attack vectors to make underwriting feasible.

    Yeah. I'm glad insurance companies and regulatory agencies are cracking down harder on stuff like this. Seems to be the most effective way to get movement.

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • DarkewolfeDarkewolfe Registered User regular
    The dark secret on the other part of that is that most large orgs have unsupported software. The real strength is knowing where it is, where it's exposed, what risk there is, and having a plan to deal with it.

    What is this I don't even.
  • MyiagrosMyiagros Registered User regular
    Security audits are just silly with their questions sometimes. One I saw recently was whether default passwords are changed on new devices which I replied Yes to. They came back a few days later asking how we know they are changed..... because I changed them myself...?

    iRevert wrote: »
    Because if you're going to attempt to squeeze that big black monster into your slot you will need to be able to take at least 12 inches or else you're going to have a bad time...
    Steam: MyiagrosX27
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    Myiagros wrote: »
    Security audits are just silly with their questions sometimes. One I saw recently was whether default passwords are changed on new devices which I replied Yes to. They came back a few days later asking how we know they are changed..... because I changed them myself...?

    Other common answers to that question, depending on the size and maturity of the organization:
    • The techs doing the deployments have a checklist they have to fill out when they deploy a new device, and that is one of the items on the checklist. We do spot checks of newly deployed devices at random to make sure they're complying with the checklist.
    • We pass every new device through a vulnerability scan (eg, Nessus) capable of detecting default passwords on common OSes/devices/applications. We save the results of that scan in our inventory management.

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    Most of these security questionnaires are written from the presumption that you're a larger organization and you either have an IT manager filling them out, or you have a separate information security department watching over IT. They aren't written for the techs and admins who are actually doing the day to day work. The idea is that your IT people might make errors (or even more flagrantly flaunt procedures), so you need somebody watching them and verifying they're getting their work done.

    The reality I've found in medium sized orgs is that even if you have a manager hypothetically capable with filling out the security questionnaire, he'll often delegate that directly to techs/admins. "Hey, Myiagros, you deploy new phones and laptops, so can you take a look at page 10 and answer controls 16-24 about our device deployment procedures?" That's not how it's supposed to go, but ¯\_(ツ)_/¯

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
  • FeralFeral MEMETICHARIZARD interior crocodile alligator ⇔ ǝɹʇɐǝɥʇ ǝᴉʌoɯ ʇǝloɹʌǝɥɔ ɐ ǝʌᴉɹp ᴉRegistered User regular
    Darkewolfe wrote: »
    The dark secret on the other part of that is that most large orgs have unsupported software. The real strength is knowing where it is, where it's exposed, what risk there is, and having a plan to deal with it.

    Yeah, in regards to wunderbar's insurance company, I don't know for sure, but I'd guess that they don't need to see MFA everywhere. We don't have MFA everywhere, and we haven't run into any issues with insurance or any complaints from our security folks. (Sometimes our pentesters suggest it, but always in a tone of voice that's like, 'yeah, we know that's asking an awful lot.')

    When we get into the details, they're usually looking for MFA on:
    - Remote entry points (VPNs, OWA, etc)
    - High-sensitivity systems (password vaults, backup systems, jump boxes, perimeter firewalls)
    - End-of-life devices (Windows Server 2008, old Cisco switches that no longer get updates)
    - Privileged accounts (domain administrators)

    For us, our strategy with EOL stuff is that we just want it gone. We endeavor to retire everything before it goes EOL. (Hardware EOL, end of hardware support isn't great, but it is manageable. It's end of software support, specifically end of security patches, that force our lifecycle.) We don't always succeed. But in the minority of cases where we miss the deadline, we're really much more focused on migrating off the old device than we are about adding MFA or otherwise shoring it up.

    every person who doesn't like an acquired taste always seems to think everyone who likes it is faking it. it should be an official fallacy.

    the "no true scotch man" fallacy.
Sign In or Register to comment.