Monitoring Domains
- Availability Monitoring
- Performance Monitoring
- Resource Usage Monitoring
- Alerting
- Healthy - when both the HA functions of the controller cluster are still being ensured and no critical errors are being reported by the monitoring system for a service.
- Degraded - when one or more critical errors are reported by the monitoring system for a service but the HA functions of the controller cluster are still being ensured.
- Failed - when both the HA functions of the controller cluster are not being ensured anymore and one or more critical errors are being reported by the monitoring system for a service.
Monitoring Activities
- Services, Processes and Clusters Checks
- Metering
- Logs Processing
- Logs Indexing
- OpenStack Notifications Processing
- Diagnosing versus Alerting. I think people normally will ignore Diagnosing and put lots of effort on Alerting.
- Time Synchronization
Hardware and System Monitoring
- IPMI
- Components temperature
- Fan rotation
- Components voltage
- Power supply status (redundancy check)
- Power status (on or off)
- Disks Monitoring (rely on the S.M.A.R.T interface)
- Host Monitoring
- Disk Usage Monitoring
- Soft RAID Monitoring
- Filesystem Usage Monitoring
- CPU Usage Monitoring
- RAM Usage Monitoring
- Swap Usage Monitoring
- Process Statistics Monitoring
- Network Interface Card (NIC) Monitoring
- Firewall (iptables) Monitoring
Virtual Machine Monitoring
- Block IO
- read_reqs
- read_bytes
- write_reqs
- write_bytes
- Network IO
- rx_bytes
- rx_packets
- rx_errors
- rx_drops
- tx_bytes
- tx_packets
- tx_errors
- tx_drops
- CPU
- cputime
- vcputime
- systemtime
- usertime
- VM Network Traffic (sFlow)