from the ProQuest blog:
As CIO of ProQuest I’d like to apologize for the interruption in our services that began on July 17 at around 3:50PM EDT. It’s our mission to provide better research, better learning, better insights – regrettably a highly unusual incident with a key piece of hardware caused an outage of some ProQuest systems.
Our teams worked through the night to restore service, and all ProQuest services are now operating normally. We are working closely with our vendors to fully analyze the root cause of this incident and ensure that any issues that are identified are addressed quickly.
This outage is especially disappointing for me because every year we invest millions of dollars to build and deliver the best products possible, with substantial investments in robust platforms from leading hardware and cloud vendors. While we are proud of our long track record of availability of ProQuest Platforms, I assure you that we will do everything we can to learn from this event and continue to improve our products and services.
To minimize further customer impact, we have decided to delay the ProQuest maintenance window scheduled for July 29, 2017. We will provide an update shortly with the new timing for the window.
I’ve included below a summary of what happened, additional technical details, and a FAQ to attempt to answer questions that you may have.
If you have additional questions, please don’t hesitate to contact me at richard.belanger@proquest.com.
Richard C. Belanger
CIO, ProQuest
Summary
An investigation by the ProQuest engineering team determined that the issue was caused by the failure of a core hardware component – a fault tolerant storage platform. Although the platform has multiple layers of redundancy, we experienced a complete failure with no warning. The failure impacted not only the storage environment, but also the management environment, which significantly delayed the restoration process. Over 1,200 virtual servers were impacted by this outage and all required intervention to restart and bring back online. While this is a significant number of servers, it represents just 20% of our overall environment.
We are committed to preventing the recurrence of such an outage and are undertaking an architectural review to evaluate changes to our environment to improve the resiliency of our products.