High Availability

Continuous Service

Businesses and organizations usually deliver their IT services to big data centers. The data centers, in turn, are responsible to provide the highest level of reliability and availability to business owners. Field studies report the annual market effect of downtime and data loss by over $100 billion dollars. Even a few minutes of service downtime can affect a business reputation and some cases result in business bankruptcy. Many causes can threaten the continuous service and reliability of a storage system including unmanaged events such as human errors, processor and board failures, power failures, and network failures, and managed events such as updates, upgrades, taking backups, and recovery. The storage designers should be aware of the possibility of these incidences and assure minimum downtime in the case of managed and unmanaged events.

Availability metrics in data storage systems

SAN storage systems are responsible for 24/7 continuous service while the storage manufacturers should report the availability provided by their product. Storage availability depends on the availability of all hardware and software components in the storage stack and data center. The table below shows conventional metrics reporting storage system availability:

Availability % Unavailability % Downtime per Year Downtime per Week
99% 1% Less than 4 days Less than 2 hours
9/99% 0.1% Less than 9 hours 11 Minutes
99/99% 0.01% Less than 1 hours 1 Minutes
999/99% 0.001% Less than 6 minutes Less than 6 seconds
9999/99% 0.0001% Less than 30 seconds Less than 0.6 seconds
Major unavailability causes in data storage systems
  • • Administrative and operational human errors
  • • Maintenance, update, and upgrade services
  • • Hardware/Software extension
  • • Failure of hardware/software components
  • • Disasters such as floods, earthquakes, and fires
Methodologies for increasing availability

A data storage system should provide continuous service by tolerating the failure of hardware and software, using appropriate redundancies and dependability mechanisms such as:

  • • Redundant hardware/software
  • • Automatic error/failure detection
  • • System reconfiguration
  • • Automatic test and fault detection

Mechanisms for decreasing the component failure rate, decreasing the failure effects in system level, decreasing the repair/recovery time, and removing single point of failure (SPF) can improve the availability of data storage systems. We can also note mechanisms for disaster recovery such as remote backups and mirrors.

Disk subsystem availability

Disk subsystem of SAN storage systems is composed of different components, each of which considered as SPF. Hence the failure and each component result in the entire system failure and fault tolerance mechanisms should be applied to all components of disk subsystem. In the following, we note some major fault tolerance mechanisms:

  1. Using RAID mechanisms: upon a disk failure its data is permanently lost. RAID mechanisms distribute the user data between several disks and store their parity in one or two redundant disks to tolerate disk failures. In a RAID architecture, the data of the failed disk is recoverable using the data of other operating disks.
  2. Each disk locally uses hamming code for error detection and correction.
  3. Each disk is connected to controllers via dual channel. Hence, the failure of one channel can be tolerated.
  4. Disk subsystem can have redundant controllers to tolerate controller failures.
  5. Redundancies can be applied to power supply, cooling system, and backup batteries.
  6. Multiple paths between hosts and storage system keep the connection available even in the case of one path failure.
  7. Periodical instant copies can help in failure case such as logical errors. For example, in the case the address table is accidentally lost, the system can be recovered by using a recent instant copy.
  8. Using remote mirroring facility, disasters and terrorist attacks do not result in data loss.

Using LUN masking, the access of other hosts to a virtual space is controlled/restricted that prevents unintentional data modification/remove by other hosts.

