CleanMPG down for what yesterday?

Discussion in 'Website news & discussions' started by xcel, Aug 11, 2011.

  1. xcel

    xcel PZEV, there's nothing like it :) Staff Member

    Hi All:

    In case you did not know, CleanMPG was down for about 7-hours yesterday. It was caused by a local blackout in the Dallas area and an ABT to an emergency backup bus that failed which in turn took the whole server farm out.

    We will review more but here is their apology letter received a few hours ago.
    Wayne
     
  2. JusBringIt

    JusBringIt Be Inspired

    I was wondering wth happened. I started going through withdrawal :eek:
     
  3. rfruth

    rfruth Well-Known Member

    7 hours, so this wasn't a rolling blackout ?
     
  4. herm

    herm Well-Known Member

    anything to do with the solar storm activity?
     
  5. WriConsult

    WriConsult Super Moderator

    I hope they're giving their customers a free month of service, at a minimum. Can't blame the service provider for the blackout, but they should be responsible for the UPS not working. Being down for most of a day can really clobber an online-based business, as some of the end consumers who couldn't find the missing site won't be coming back.
     
  6. Chuck

    Chuck just the messenger

    I think it was, from the near-record summer North Texas has been getting.
     
  7. bestmapman

    bestmapman Fighting untruth and misinformation

    How is the service with and are they going to compensate you for the lost revenue.
     
  8. xcel

    xcel PZEV, there's nothing like it :) Staff Member

    Hi Jud:

    There service has been excellent and because I only make about $25 per day, I lost approximately $18.

    Wayne
     
  9. xcel

    xcel PZEV, there's nothing like it :) Staff Member

    Hi All:

    The full details regarding the "Why's and How" of the CleanMPG downtime last Wednesday.

    Dear 1-800-HOSTING Customers,

    Thank you for your patience and understanding with the equipment failure this week. We apologize for the disruption to your business and the stress and frustration that you experienced. As promised, we have compiled this Reason for Outage report as part of our after-action assessment.

    What Happened: On Wednesday, August 10, 2011 at 11:01AM CDT, the primary 1-800-HOSTING Dallas datacenter (DFW02) experienced an equipment failure with one of the automatic transfer switches (ATS) at service entrance #2, which supports some of our long-term customers. The ATS device was damaged and did not allow either commercial or generator power automatically -- or through bypass mode. Thus, to restore the power connection, a temporary replacement ATS was required to be put into service.

    The datacenter’s standard redundant power offering has commercial power backed up by diesel generator and UPS. Each of the six ATSs reports to its own generator and service entrance. The five other ATSs and service entrances at the facility were unaffected.

    The ATS failure at service entrance #2 affected customers who had single circuit connectivity (one power supply). For customers who had redundant circuits (or true A/B dual power supplies), they access two ATS switches, so the B circuit automatically handled the load.

    Also, 1 pair of the 5 distribution switch clusters that each connects to their own A/B power source experienced a failure condition when it lost the stack member. This resulted in degraded network service to many downstream access switches, servers and internal systems until traffic was re-routed around the offending distribution cluster.

    Response Actions: As soon as this incident occurred we worked to mobilize the proper professionals in the facility and extended teams. The on-site electrical contractors and technical team, worked quickly with the general contractors and UPS contractors to assess the situation and determine fastest course of action to bring customers back online.

    As part of the protocol, they first conducted a thorough check of the affected ATS as well as the supporting PDU, UPS, transformer, generator, service entrance, HVAC, and electrical. It was determined that all other equipment was functioning properly and that the failure was limited to the ATS device. This step was important to ensure that the problem did not affect other equipment or replicate at other service entrances.

    It was further determined that the ATS would need extensive repairs and that the best scenario for customers would be to install a temporary ATS. As the ATS changeover involved high-voltage power, it was important that they moved cautiously and deliberately to ensure the safety of the employees, contractors and customers in the building as well as to protect both customer owned equipment as well as our own. Safely bringing the new unit online was the top priority.

    After the temporary ATS was installed and tested, the team brought up the HVAC, UPS and PDU individually to ensure that there was no damage to those devices. Then, the team restored power to customer equipment. Power was restored as of 6:31PM CDT.

    The UPSs were placed in bypass mode on the diesel generator to allow the batteries to fully charge. The transition from diesel generator to commercial power occurred at 9:00PM CDT with no customer impact.

    Technicians worked with customers and based on the internal checks and alerts to help bring any equipment online that did not come back on with the power restore or to help reset devices where breakers tripped during the power restoration. This process continued throughout the evening.

    Assessment: As part of the after-action assessment, the datacenter management team has debriefed with all on-site technical team and electrical contractors as well as the equipment manufacturer, UPS contractors and general contractors to provide assessments on the ATS failure. While an ATS failure is rare, it is even rarer for an ATS to fail and not allow it to go into bypass mode.

    While the ATS could be repaired, the decision was made to order a new replacement ATS. This is certainly a more expensive option, but it is the option that provides the best solution for the long-term stability for our customers.

    Lessons Learned: Thankfully the datacenter has experienced few issues in its 11 years in operation, though any issue is one too many. As part of the after-action review, additional improvements to the existing emergency/disaster recovery plans were made.

    The technical team, HVAC, electrical and general contractors brought exceptionally fast, sophisticated thinking and action to get customers back in business as quickly as possible. The complexity of working with power of that size and scale at any time, but especially under pressure, shows the level of merit, knowledge and resolve that these individuals have. Thank you to the technical team and all of the contractors for a job well done to safely restore power.

    Next Steps: Once the datacenter receives and tests the new ATS, we will schedule a maintenance window to replace the equipment. We will provide advance notice and timelines to minimize the disruption to your service.

    Future Prevention: Equipment can and will fail. Unfortunately, it’s a simple fact of life that we have to live with. However, one can take precautions to mitigate risk of these occurrences through the implementation of additional redundancies.

    1-800-HOSTING can offer additional redundancies ranging from simple dual power supplies in a server, to connecting each of these power supplies to independent power plants preventing similar problems in the future. We also have the ability to put multiple servers in a load balanced environment, or even hosted additional servers in multiple datacenters. Feel free to contact myself or sales department if you are interested in learning more about high availability configurations.

    Feel free to contact myself or sales department if you are interested in learning more about high availability configurations.

    Communicating more effectively:
    1. We will create an offsite status site that will be independent of our traditional datacenter locations and networks that you can visit should you not be able reach our network status page.

    2. Call overflow – It is rare that you are unable to reach us, but we plan to partner with an external contact center to answer calls regarding status updates during times of extreme high call volume. Many times those who can restore your services are out doing exactly that, and unfortunately that means a longer than acceptable hold time for customers looking for status updates as we have less people to answer your inquiries.

    3. More prompt updates to our 411 status message when you call in as well as our network status message on our website.
    Once again I want to apologize for the pain that this may have caused you, your customers, employees, and others affected by this. I know I spoke with many of you during this event, and believe me when I say I understand what this does to your businesses and operations. I look forward to answering questions you may have.

    Feel free to reply to this email with any questions, concerns, or suggestions. You may also reach me directly at chris@800hosting.com.

    On behalf of the entire 1-800-HOSTING staff we thank you for your patience, understanding, and patronage as customers over the years and I want you to know that we value both you and your business.

    Christopher Shaffer
    Director, Network Services
    1-800-HOSTING, Inc.

    Wayne
     

Share This Page