Stats

OpenTSDB offers a number of metrics about its performance, accessible via various API endpoints. The main stats are accessible from the GUI via the “Stats” tab, from the Http API at /api/stats or the legacy API at /stats. The Telnet style API also supports the “stats” command for fetching over CLI. These can easily be published right back into OpenTSDB at any interval you like.

Additional stats available include JVM information, storage details (e.g. per-region-client HBase stats) and executed query details. See /api/stats for more details about the other endpoints.

All metrics from the main stats endpoint include a host tag that includes the name of the host where the TSD is running. If the tsd.stats.canonical configuration flag is set, this will change to fqdn and the TSD will try to resolve its host name to return the fully qualified domain name. Currently all stats are integer values. Each request for stats will fetch statistics in real time so the timestamp will reflect the current time on the TSD host.

Note

The /api/stats endpoint is a good place to execute a health check for your TSD as it will execute a query to storage for fetching UID stats. If the TSD is unable to reach the backing store, the API will return an exception.

Metric

Tags

Type

Description

tsd.connectionmgr.connections

type=open

Gauge

The number of currently open Telnet and HTTP connections.

tsd.connectionmgr.connections

type=total

Counter

The total number of connections made to OpenTSDB. This includes all Telnet and HTTP connections.

tsd.connectionmgr.exceptions

type=closed

Counter

The total number of exceptions caused by writes to a channel that was already closed. This can occur if a query takes too long, the client closes their connection gracefully, and the TSD attempts to write to the socket. This includes all Telnet and HTTP connections.

tsd.connectionmgr.exceptions

type=reset

Counter

The total number of exceptions caused by a client disconnecting without closing the socket. This includes all Telnet and HTTP connections.

tsd.connectionmgr.exceptions

type=timeout

Counter

The total exceptions caused by a socket inactivity timeout, i.e. the TSD neither wrote nor received data from a socket within the timeout period. This includes all Telnet and HTTP connections.

tsd.connectionmgr.exceptions

type=unknown

Counter

The total exceptions with an unknown cause. Check the logs for details. This includes all Telnet and HTTP connections.

tsd.rpc.received

type=telnet

Counter

The total number of telnet RPC requests received

tsd.rpc.received

type=http

Counter

The total number of Http RPC requests received

tsd.rpc.received

type=http_plugin

Counter

The total number of Http RPC requests received and handled by a plugin instead of the built-in APIs. (v2.2)

tsd.rpc.exceptions

Counter

The total number exceptions caught during RPC calls. These may be user error or bugs.

tsd.http.latency_50pct

type=all

Gauge

The time it took, in milliseconds, to answer HTTP requests for the 50th percentile cases

tsd.http.latency_75pct

type=all

Gauge

The time it took, in milliseconds, to answer HTTP requests for the 75th percentile cases

tsd.http.latency_90pct

type=all

Gauge

The time it took, in milliseconds, to answer HTTP requests for the 90th percentile cases

tsd.http.latency_95pct

type=all

Gauge

The time it took, in milliseconds, to answer HTTP requests for the 95th percentile cases

tsd.http.latency_50pct

type=graph

Gauge

The time it took, in milliseconds, to answer graphing requests for the 50th percentile cases

tsd.http.latency_75pct

type=graph

Gauge

The time it took, in milliseconds, to answer graphing requests for the 75th percentile cases

tsd.http.latency_90pct

type=graph

Gauge

The time it took, in milliseconds, to answer graphing requests for the 90th percentile cases

tsd.http.latency_95pct

type=graph

Gauge

The time it took, in milliseconds, to answer graphing requests for the 95th percentile cases

tsd.http.latency_50pct

type=gnuplot

Gauge

The time it took, in milliseconds, to generate the GnuPlot graphs for the 50th percentile cases

tsd.http.latency_75pct

type=gnuplot

Gauge

The time it took, in milliseconds, to generate the GnuPlot graphs for the 75th percentile cases

tsd.http.latency_90pct

type=gnuplot

Gauge

The time it took, in milliseconds, to generate the GnuPlot graphs for the 90th percentile cases

tsd.http.latency_95pct

type=gnuplot

Gauge

The time it took, in milliseconds, to generate the GnuPlot graphs for the 95th percentile cases

tsd.http.graph.requests

cache=disk

Counter

The total number of graph requests satisfied from the disk cache

tsd.http.graph.requests

cache=miss

Counter

The total number of graph requests that were not cached and required a fetch from storage

tsd.http.query.invalid_requests

Counter

The total number data queries sent to the /api/query endpoint that were invalid due to user errors such as using the wrong HTTP method, missing parameters or using metrics and tags without UIDs. (v2.2)

tsd.http.query.exceptions

Counter

The total number data queries sent to the /api/query endpoint that threw an exception due to bad user input or an underlying error. See logs for details. (v2.2)

tsd.http.query.success

Counter

The total number data queries sent to the /api/query endpoint that completed successfully. Note that these may have returned an empty result. (v2.2)

tsd.rpc.received

type=put

Counter

The total number of put requests for writing data points

tsd.rpc.errors

type=hbase_errors

Counter

The total number of RPC errors caused by HBase exceptions

tsd.rpc.errors

type=invalid_values

Counter

The total number of RPC errors caused invalid put values from user requests, such as a string instead of a number

tsd.rpc.errors

type=illegal_arguments

Counter

The total number of RPC errors caused by bad data from the user

tsd.rpc.errors

type=socket_writes_blocked

Counter

The total number of times the TSD was unable to write back to the telnet socket due to a full buffer. If this happens it likely means a number of exceptions were happening. (v2.2)

tsd.rpc.errors

type=unknown_metrics

Counter

The total number of RPC errors caused by attempts to put a metric without an assigned UID. This only increments if auto metrics is disabled.

tsd.uid.cache-hit

kind=metrics

Counter

The total number of successful cache lookups for metric UIDs

tsd.uid.cache-miss

kind=metrics

Counter

The total number of failed cache lookups for metric UIDs that required a call to storage

tsd.uid.cache-size

kind=metrics

Gauge

The current number of cached metric UIDs

tsd.uid.ids-used

kind=metrics

Counter

The current number of assigned metric UIDs. (NOTE: if random metric UID generation is enabled ids-used will always be 0)

tsd.uid.ids-available

kind=metrics

Counter

The current number of available metric UIDs, decrements as UIDs are assigned. (NOTE: if random metric UID generation is enabled ids-used will always be 0)

tsd.uid.random-collisions

kind=metrics

Counter

How many times metric UIDs attempted a reassignment due to a collision with an existing UID. (v2.2)

tsd.uid.cache-hit

kind=tagk

Counter

The total number of successful cache lookups for tagk UIDs

tsd.uid.cache-miss

kind=tagk

Counter

The total number of failed cache lookups for tagk UIDs that required a call to storage

tsd.uid.cache-size

kind=tagk

Gauge

The current number of cached tagk UIDs

tsd.uid.ids-used

kind=tagk

Counter

The current number of assigned tagk UIDs

tsd.uid.ids-available

kind=tagk

Counter

The current number of available tagk UIDs, decrements as UIDs are assigned.

tsd.uid.cache-hit

kind=tagv

Counter

The total number of successful cache lookups for tagv UIDs

tsd.uid.cache-miss

kind=tagv

Counter

The total number of failed cache lookups for tagv UIDs that required a call to storage

tsd.uid.cache-size

kind=tagv

Gauge

The current number of cached tagv UIDs

tsd.uid.ids-used

kind=tagv

Counter

The current number of assigned tagv UIDs

tsd.uid.ids-available

kind=tagv

Counter

The current number of available tagv UIDs, decrements as UIDs are assigned.

tsd.jvm.ramfree

Gauge

The number of bytes reported as free by the JVM’s Runtime.freeMemory()

tsd.jvm.ramused

Gauge

The number of bytes reported as used by the JVM’s Runtime.totalMemory()

tsd.hbase.latency_50pct

method=put

Gauge

The time it took, in milliseconds, to execute a Put call for the 50th percentile cases

tsd.hbase.latency_75pct

method=put

Gauge

The time it took, in milliseconds, to execute a Put call for the 75th percentile cases

tsd.hbase.latency_90pct

method=put

Gauge

The time it took, in milliseconds, to execute a Put call for the 90th percentile cases

tsd.hbase.latency_95pct

method=put

Gauge

The time it took, in milliseconds, to execute a Put call for the 95th percentile cases

tsd.hbase.latency_50pct

method=scan

Gauge

The time it took, in milliseconds, to execute a Scan call for the 50th percentile cases

tsd.hbase.latency_75pct

method=scan

Gauge

The time it took, in milliseconds, to execute a Scan call for the 75th percentile cases

tsd.hbase.latency_90pct

method=scan

Gauge

The time it took, in milliseconds, to execute a Scan call for the 90th percentile cases

tsd.hbase.latency_95pct

method=scan

Gauge

The time it took, in milliseconds, to execute a Scan call for the 95th percentile cases

tsd.hbase.root_lookups

Counter

The total number of root lookups performed by the client

tsd.hbase.meta_lookups

type=uncontended

Counter

The total number of uncontended meta table lookups performed by the client

tsd.hbase.meta_lookups

type=contended

Counter

The total number of contended meta table lookups performed by the client

tsd.hbase.rpcs

type=increment

Counter

The total number of Increment requests performed by the client

tsd.hbase.rpcs

type=delete

Counter

The total number of Delete requests performed by the client

tsd.hbase.rpcs

type=get

Counter

The total number of Get requests performed by the client

tsd.hbase.rpcs

type=put

Counter

The total number of Put requests performed by the client

tsd.hbase.rpcs

type=rowLock

Counter

The total number of Row Lock requests performed by the client

tsd.hbase.rpcs

type=openScanner

Counter

The total number of Open Scanner requests performed by the

client

tsd.hbase.rpcs

type=scan

Counter

The total number of Scan requests performed by the client. These indicate a scan->next() call.

tsd.hbase.rpcs.batched

Counter

The total number of batched requests sent by the client

tsd.hbase.flushes

Counter

The total number of flushes performed by the client

tsd.hbase.connections.created

Counter

The total number of connections made by the client to region servers

tsd.hbase.nsre

Counter

The total number of No Such Region Exceptions caught. These can happen when a region server crashes, is taken offline or when a region splits (?)

tsd.hbase.nsre.rpcs_delayed

Counter

The total number of calls delayed due to an NSRE that were later successfully executed

tsd.hbase.region_clients.open

Counter

The total number of connections opened to region servers since the TSD started. If this number is climbing the region servers may be crashing and restarting. (v2.2)

tsd.hbase.region_clients.idle_closed

Counter

The total number of connections to region servers that were closed due to idle connections. This indicates nothing was read from or written to a server in some time and the TSD will reconnect when it needs to. (v2.2)

tsd.compaction.count

type=trivial

Counter

The total number of trivial compactions performed by the TSD

tsd.compaction.count

type=complex

Counter

The total number of complex compactions performed by the TSD

tsd.compaction.duplicates

type=identical

Counter

The total number of data points found during compaction that were duplicates at the same time and with the same value. (v2.2)

tsd.compaction.duplicates

type=variant

Counter

The total number of data points found during compaction that were duplicates at the same time but with a different value. (v2.2)

tsd.compaction.queue.size

Gauge

How many rows of data are currently in the queue to be compacted. (v2.2)

tsd.compaction.errors

type=read

Counter

The total number of rows that couldn’t be read from storage due to an error of some sort. (v2.2)

tsd.compaction.errors

type=put

Counter

The total number of rows that couldn’t be re-written to storage due to an error of some sort. (v2.2)

tsd.compaction.errors

type=delete

Counter

The total number of rows that couldn’t have the old non-compacted data deleted from storage due to an error of some sort. (v2.2)

tsd.compaction.writes

type=read

Counter

The total number of writes back to storage of compacted values. (v2.2)

tsd.compaction.deletes

type=read

Counter

The total number of delete calls made to storage to remove old data that has been compacted. (v2.2)