Selected IVLE services affected by database maintenance on Saturday 7 April from 0300 hrs to 0700 hrs

The following IVLE services will be unavailable on Saturday 7 April from 0300 hrs to 0700 hrs (4 hours) due to maintenance:

  • GEM Module Listing
  • IVLE Profile Page
  • IVLE Gradebook Import
  • Modules Taken
  • NUS IT Care Admin Module

Please contact the IVLE Team for any enquiries.

We apologize for any inconvenience caused. Thank you for your understanding and cooperation.

Various services affected by server relocation, from Friday 6 April, 2000 hrs to Monday 9 April, 0800 hrs

The following services will be disrupted from Friday 6 April, 2000 hrs to Monday 9 April, 0800 hrs (60 hours) due to server relocation.

  • IVLE Multimedia (non-streaming media viewing & uploading)
  • nuscast.nus.edu.sg (Webcast Events)
  • emodule.nus.edu.sg
  • staffpages.nus.edu.sg
  • courses.nus.edu.sg (k:\ drive)
  • courseware.nus.edu.sg
  • breeze.nus.edu.sg
  • screencast.nus.edu.sg (Camtasia Relay Uploading)

If you have any queries, please call IT Care at 6516 2080.

We apologize for any inconvenience caused, and thank you for your understanding and cooperation.

Selected IVLE services affected by database maintenance on Saturday 24 March from 0300 hrs to 0700 hrs

The following IVLE services will be unavailable on Saturday 24 March from 0300 hrs to 0700 hrs (4 hours) due to ISIS maintenance:

  • GEM Module Listing
  • IVLE Profile Page
  • IVLE Gradebook Import
  • Modules Taken
  • NUS IT Care Admin Module

Please contact the IVLE Team for any enquiries.

We apologize for any inconvenience caused. Thank you for your understanding and cooperation.

Summary of findings on IVLE disruption on 9 March 2012

Introduction

There was a data center exercise scheduled on the 9 March, 2000 hrs till 12 March, 0800 hrs. During this exercise, all servers in the data center had to be powered off. As IVLE has certain infrastructure redundancies built into it, it is generally resilient to such exercises, and they do not affect the running capabilities of IVLE in any way.

I sincerely apologize for the disruption and stress this incident has caused. We will also be learning from this disruption to ensure that it does not happen again.

 

Hardware redundancies in IVLE

Everything comes in pairs for IVLE, 2x load balancers, >2 web servers, 2x database servers.

The load balancers are configured to perform automatic failovers so they should never pose any issues. All our web servers are similar in every way so if any goes down, any other web server can easily take its place with zero or minimal service disruption to the users.

The database servers are configured with database mirroring with a witness database server to achieve automatic failover as well. We did not go with cluster due to the long downtimes involved for automatic failovers.

 

Overview of the disruption

IVLE went down at 2006 hrs on 9 March 2012 when both the database secondary and witness server were powered off for the exercise. With only the primary database running, there was a high chance of data corruption, so the primary database disabled all connections to preserve data integrity.

Email notifications were sent out indicating failure in the database, but due to the exercise, it got mixed together with false positives and went by unnoticed until 2100 hrs when a team member did a follow up test to ensure that IVLE is working after the power test.

Once the team realized that IVLE was down, several members were quickly mobilized to find out the root cause and bring it back up as soon as possible. They soon realized that the failure is caused by the database being taken offline.

As the team did not want to compromise data integrity, they tried to find the cause of the problem and realized that it is due to the mirroring configuration. The systems team was then notified to power on the secondary database server to get the mirroring up and running again and transact all pending changes.

Once the secondary database server came back online, certain checks were done to make sure that all is in order. We then enabled back the connections and IVLE was up and running at 2139 hrs.

 

Moving on

Prevention: We are exploring new high availability and disaster recovery options being offered in Microsoft SQL Server 2012 and see how that can work in our context.

Detection: Checks are now in place during data center exercises to do preliminary checks to ensure IVLE is working correctly. Failure to do so will trigger both email and SMS notifications.

Notifications: From now onwards, we will be using our Facebook group to post any unscheduled downtimes and updates to these. Scheduled downtimes will still be posted using NUS Message of the day (MOTD) and here at CIT System Updates.

 

Conclusion

The team will strive to learn from such incidents and do our best to ensure a reliable service, good communication and customer support to NUS staff and students.

 

Sincerely,

Jeffery Tay and the IVLE Team