How to avoid delays on HA storage during RAID malfunction

Troubleshooting
June 03, 2015
Download as PDF

Thought RAID arrays offer increased redundancy, capacity, and performance over standard disk systems, some administrators often have the perception that RAIDs will never fail, which is wrong. Should a hard drive or volume fail on the RAID array, it’s always a good practice to fix the problem as soon as possible. Since, in most cases, StarWind VSAN HA devices locate on top of RAID array, RAID malfunction can affect the performance and the availability of underlying storage itself, which is critical for the HA storage operation. Operational delays on RAID array become significant when its state changes to the “Degraded,” thereby resulting in the i/o delays and whole environment slow down.

RAID performance degradation can affect StarWind HA storage behavior in the following ways.

1. HA storage high response time – events about the delays on either write or read operations will be logged in StarWind and Windows Application logs. VMs/databases may stop responding for a while or completely hang.

Cause:

background tasks during the high storage workload;
disk(s) got a predictive failure state;
huge queue on i/o operations on the underlying storage;

Troubleshooting steps:

reschedule background task during the high storage workload;
check the physical disks and RAID health state;
review RAID cache configuration settings;
stop and disable StarWind service on the problematic node before fixing (rebuilding) the RAID;

2. Critical device response time – StarWind HA devices becomes “Not synchronized” for 30 minutes on the problematic node because of high storage response time. The issue can also appear as looping synchronization. VMs/databases may stop responding for a while or completely hang.

Cause:

RAID controller background tasks during the high storage workload;
RAID got a “Degraded” state;

Troubleshooting steps:

reschedule background task during the high storage workload;
check physical disks and RAID health state;
stop and disable StarWind service on the problematic node before fixing (rebuilding) the RAID;

3. HA device got “Non-active” state on one of the nodes – StarWind Management console reporting about the “Non-active” state and iSCSI initiator having trouble connecting to the targets on the problematic node.

Cause:

RAID got a “Failed” state;
RAID controller malfunction;

Troubleshooting steps:

check physical disks and RAID health state;
stop and disable StarWind service on the problematic node before fixing (rebuilding) the RAID;
investigate RAID controller logs and contact hardware vendor to fix the issue;
recreate HA device replica to the problematic node;

NOTE: The StarWind VSAN service may be started again and replica may be recreated only in case if RAID issues were fixed (i.e. all faulty drives were replaced and RAID rebuild was finished).