Monitoring Guide

Monitoring Domains

  • Availability Monitoring
  • Performance Monitoring
  • Resource Usage Monitoring
  • Alerting
    • Healthy - when both the HA functions of the controller cluster are still being ensured and no critical errors are being reported by the monitoring system for a service.
    • Degraded - when one or more critical errors are reported by the monitoring system for a service but the HA functions of the controller cluster are still being ensured.
    • Failed - when both the HA functions of the controller cluster are not being ensured anymore and one or more critical errors are being reported by the monitoring system for a service.

Monitoring Activities

  • Services, Processes and Clusters Checks
  • Metering
  • Logs Processing
  • Logs Indexing
  • OpenStack Notifications Processing
  • Diagnosing versus Alerting. I think people normally will ignore Diagnosing and put lots of effort on Alerting.
  • Time Synchronization

Hardware and System Monitoring

  • IPMI
    • Components temperature
    • Fan rotation
    • Components voltage
    • Power supply status (redundancy check)
    • Power status (on or off)
  • Disks Monitoring (rely on the S.M.A.R.T interface)
  • Host Monitoring
  • Disk Usage Monitoring
  • Soft RAID Monitoring
  • Filesystem Usage Monitoring
  • CPU Usage Monitoring
  • RAM Usage Monitoring
  • Swap Usage Monitoring
  • Process Statistics Monitoring
  • Network Interface Card (NIC) Monitoring
  • Firewall (iptables) Monitoring

Virtual Machine Monitoring

  • Block IO
    • read_reqs
    • read_bytes
    • write_reqs
    • write_bytes
  • Network IO
    • rx_bytes
    • rx_packets
    • rx_errors
    • rx_drops
    • tx_bytes
    • tx_packets
    • tx_errors
    • tx_drops
  • CPU
    • cputime
    • vcputime
    • systemtime
    • usertime
  • VM Network Traffic (sFlow)

OpenStack Ceilometer, cloud performance, and hardware requirements

Performance testing results summary

We performed Ceilometer benchmark tests and collected results primarily in the 20-node lab configuration. As expected, we found that the main load on the cloud (i.e., on the nodes running Ceilometer, MongoDB, and related controllers) resulted from polling. Our goal was to determine some guidelines for setting the polling interval to provide the greatest information granularity possible without imperiling overall system performance.

Polling load (and on average, all Ceilometer load on the cloud) actually depends on two factors:

1
2
3
* Number of resources from which metrics are collected. In our benchmark testing, we used VMs as units of measurement, and we tried 360, 1000, and 2000 VMs.

* Polling interval. Generally speaking, the smaller the polling interval is, the bigger the load.

Together, these imply that for the purposes of our benchmark tests, we could use minimally configured VMs, since in this case, any given VM served merely as a unit for information collection. The VMs we created and polled were set up as single CPU systems, each having 128MB of RAM.

Results and recommendations

This section summarizes some significant results and recommendations. (See the section €œLab configurations, testing processes, and data collected for specifics of the data collected.)

Tests results showed that 2000 VMs with a 1-minute polling period load is permissible for Ceilometer configured with MongoDB.

It’s important to note two key points. First, the IO load in this case was too heavy for running MongoDB instances on the cloud controllers (as we did). The MongoDB IOStat util indicated a peak load of close to 100%. The second point is that many data samples are written to the database and after only one day running, the MongoDB cluster held 170 GB per device.

To avoid this problem, we recommend that you if you use 2000 VMs with a 1-minute polling interval (or a configuration with a similar or greater load), use separate nodes for instances of MongoDB processes running together as a replica set.

If you are using 1000 VMs, with 1-minute polling, there is a lighter IO load. In this case, MongoDB isn’€™t blocking other IO operations and it works correctly with other services.

Some key takeaway from OpenStack User Survey 2016 April

What’s OpenStack User Survey

A snapshot of OpenStack users’ attitudes and deployments.

Takeaway

  • Use Net Promoter Scores (NPS) to generate an accurate comparison. What’s NPS , btw?
  • The most dominant category in the user survey remains information technology (68%).
  • Most industries have more than half of respondents running OpenStack in production, which is consistent with the overall 65% of deployments recorded in this survey at a production stage.
  • Community members were asked to select their top reasons for choosing OpenStack and rank these in terms of priority. The vast majority (66%) still focus first on cost, just one point off last cycle’s response.
  • Why do users recommend OpenStack and why don’t they? On the positive side, community support, avoiding vendor lock-in, consistency, stability, and the importance of open source were key drivers. On the negative side, complexity, difficulty in deployment, inconsistency, and lack of stability were cited.
  • Which workloads and frameworks are running on OpenStack? Software development and testing remains the top use case.
  • What container and PaaS tools are used to manage OpenStack applications? Kubernetes (42%) surged ahead of CloudFoundry (24%) in this cycle, increasing 8 points to be the top Platform-as-a-Service (PaaS) tool, while CloudFoundry lost 11 points.

Metrics Definitions

Metrics measured in each OpenStack program:

  • Commit: this is defined as the action(s) that performes a change in the source code. Bots, merges and other type of automatic activity is removed from the records. In addition, when aggregating several git repositories, this metric only counts unique revisions (unique hashes found in the git repositories). Finally, all branches are aggregated to the analysis.
  • Submitted changeset: a changeset is the process of peer reviewing source code changes. A commit is not merged to the master code of a given project till this is approved for at least one core reviewer of such project. A submitted changeset is defined as any changeset submitted to the Gerrit system. However, given the limitations of the current version of the tool, with at least 5,900 changesets detected as having an erroneous creation date, this metric counts the first patchset upload action.
  • Merged and abandoned changsets: a merge is defined as the patchset that was finally submitted to the source code. An abandoned change- set is a potential merge that was finally dismissed by developers as being part of the source code. This status is found in the status of the final patchset. However, although a patchset can be merged or aban- doned, this action can be reverted. If a patchset presents several of these changes in the same period of time, only one of them is counted (the very last one). On the other hand, if those changes take place in different periods of analysis, both status would be counted.
  • Open and closed ticket: a ticket in Launchpad is counted as closed if the status of such ticket is defined as ’Fix Released’. The rest of the tickets are counted as opened tickets.
  • Active Core Reviewer: a core reviewer has the possibility to use +2 or -2 actions when reviewing the code. However, if there are developers that for some period do not use those actions, those can not be mea- sured as core reviewer. Thus, this metric provides information about ’active’ core reviewers. This can be also defined as those developers that actively have used the +2 or -2 review action. This metric is also filtered by branch of activity, only using ’master’. This helps to detect actual core reviewers in each of the projects.
  • Authors: a developer is defined as author if she is the owner of the patchset sent for reviewing and this is merged into the source code. As previously indicated, automatic commits such bot’s are removed from this analysis.
  • Efficiency closing issues: this metric is a derivation of the Backlog Management Index (BMI) that measures the number of closed tickets out of the opened tickets in a period of time. Values under 1.0 indicates that the number of closing issues is lower than the number of opened issues arriving. On the contrary, higher charts would indicate better maintenance effort by the community.
  • Efficiency closing changesets: this metric is a derivation of the Backlog Management Index as it is named as Review efficiency index (REI). As similarly used in the BMI index, this metrics measures the number of closed changesets (merged or abandoned) out of the total number of new changesets.
  • Time to Merge: this time consists of the time between the first upload of the first patchset (as defined as a submitted changeset) till the last patchset of the changeset is merged into the code and this is indicated in the comments side of the Gerrit tool. This metric is provided in number of days.
  • Patchsets per changeset: this metric calculates the total number of iterations in a changeset till this is abandoned or merged.
  • Time waiting for the reviewer or the submitter: a changeset is waiting for a reviewer action if a new patchset upload or a new changset arrives to the system. On the other hand, a submitter action is required when a specific negative verification or reviewing action takes place (Verified -1/-2 or Code-Review -1/-2). In addition, when a Code-Review +2 action takes place, it is assumed that the changset is closing and no more times are registered either for the reviewer or the submitter. For this analysis, those patchsets flagged as work in progress are ignored.

Metrics measured in the general overview:

  • Community structure, core, regular and casual developers: developers are ordered in descendant order by the number of commits authored for a given period. Core developers are defined as the list of developers that reach 80% of the total commits. Regular is the set of developers that are between that 80% and 95% of the commits. Casual develop- ers are found in the rest of the 5%. Bots are ignored in this list of developers.
  • Developer per month: average of developers per month ignoring bots.
  • Emails sent: number of emails sent by people to the several mailing lists. Bots are not registered.
  • People sending emails: number of people sending those emails ignoring bots.
  • People initiating threads: a thread is defined as a list of emails that has the same root. There may exist threads of one email.
  • Top threads: this list provides the longest threads in terms of number of emails that have a common root email.
  • Questions, answers and comments in Askbot.
  • People asking questions in Askbot: number of people sending a new question.
  • Top visited questions.
  • Top tags: each of the questions has a list of associated tags. The top tags are those with the highest number of repetitions aggregating all of the questions.
  • Messages and people in IRC: this analysis ignores as a message those entries in the IRC channels that provide information about people entering or leaving the system.