Performance monitoring
You can monitor your cluster performance using the following views:
| View | Description |
|---|---|
| Metrics | Displays a comprehensive array of performance metrics, including for YSQL and YCQL API performance. |
| Queries>Slow Queries | Shows YSQL queries run on the cluster, sorted by running time. Identify slower running database operations and evaluate query execution times over time. |
| Queries>Live Queries | Shows the queries that are currently "in-flight" on your cluster. |
| Cluster Load | Shows the load on the cluster at a glance. Use this view to answer the question Was the system overloaded, and why. |
| Insights | Scans clusters for performance optimizations, including index and schema changes, and to detect potentially hot nodes. |
Use these tools to monitor the performance of your cluster and to determine whether the configuration needs to change. For information on changing or scaling your cluster, refer to Scale and configure clusters.
Overview
The cluster Overview tab displays a summary of the cluster infrastructure, along with time series charts of four key performance metrics for all the nodes in the cluster - Operations/sec, Average Latency, CPU Usage, and Disk Usage, averaged over all the nodes in the cluster.
You can show metrics by region and by node, for the past hour, 6 hours, 12 hours, 24 hours, or 7 days.
You can enable alerts for CPU usage and disk usage. Refer to Alerts.
The following table describes the metrics available on the cluster Overview tab.
| Graph | Description | Use |
|---|---|---|
| Operations/sec | The number of YB-TServer read and write operations per second. | Spikes in read operations are normal during backups and scheduled maintenance. If the count drops significantly below average, it might indicate an application connection failure. If the count is much higher than average, it could indicate a DDoS, security incident, and so on. Coordinate with your application team because there could be legitimate reasons for dips and spikes. |
| Average Latency (ms) | Read: the average latency of read operations at the tablet level. Write: the average latency of write operations at the tablet level. |
When latency starts to degrade, performance may be impacted by the storage layer. |
| CPU Usage (%) | The percentage of CPU use being consumed by the tablet or master server Yugabyte processes, as well as other processes, if any. In general, CPU usage is a measure of all processes running on the server. | High CPU use could indicate a problem and may require debugging by Yugabyte Support. An alert is issued when node CPU use exceeds 70% (Warning) or 90% (Severe) on average for at least 5 minutes. |
| Disk Usage (GB) | Shows the amount of disk space provisioned for and used by the cluster. | Typically you would scale up at 80%, but consider this metric in the context of your environment. For example, usage can be higher on larger disks. An alert is issued when the free storage on any node in the cluster falls below 40% (Warning) and 25% (Severe). |