OpenTSDB

Getting Started

This page will walk you through the setup process to get OpenTSDB running. It assumes you've read and understood the overview. With no prior experience, it should take about 15 minutes to get OpenTSDB running, including the time needed to setup HBase on a single node.

Setting up OpenTSDB

Additional compile-time dependencies:
  • GWT 2.4 (ASLv2)
Additional unit test dependencies:
OpenTSDB comes pre-packaged with all the necessary dependencies except the JDK and Gnuplot.
The runtime dependencies for OpenTSDB are: You need to have Gnuplot (custom open-source license) installed in your PATH version 4.2 minimum, 4.4 recommended.

Before getting started, you need an instance of HBase 0.94 (ASLv2) up and running. If you don't already have one, you can get started quickly with a single-node HBase instance. Earlier versions of HBase will work too, but being on the last major release is heavily recommended.

Almost all the following instructions can be copy-pasted directly into a terminal on a Linux or Mac OS X (or otherwise POSIXy) machine. You will need to edit the placeholders which are typeset like-this. A Bourne shell (such as bash or zsh) is assumed. No special privileges are required.

Checkout, compile & start OpenTSDB

OpenTSDB uses the usual build process that consists in running ./bootstrap (only once, when you first check out the code), followed by ./configure and make. There is a handy shell script named build.sh that will take care of all of that for you, and build OpenTSDB in a new subdirectory named build:
git clone git://github.com/OpenTSDB/opentsdb.git cd opentsdb ./build.sh
From there on, you can use the command-line tool by invoking ./build/tsdb or you can run make install to install OpenTSDB on your system. Should you ever change your mind, there is also make uninstall, so there are no strings attached.

If it's the first time you run OpenTSDB with your HBase instance, you first need to create the necessary HBase tables:

env COMPRESSION=NONE HBASE_HOME=path/to/hbase-0.94.X ./src/create_table.sh
This will create two tables: tsdb and tsdb-uid. If you're just evaluating OpenTSDB, don't worry about compression for now. In production / at scale, make sure you use COMPRESSION=lzo and have LZO enabled.

Now start a TSD (Time Series Daemon):

tsdtmp=${TMPDIR-'/tmp'}/tsd # For best performance, make sure mkdir -p "$tsdtmp" # your temporary directory uses tmpfs ./build/tsdb tsd --port=4242 --staticroot=build/staticroot --cachedir="$tsdtmp"
If you're using a real HBase cluster, you will also need to pass the --zkquorum flag to specify the comma-separated list of hosts serving your ZooKeeper quorum. The --cachedir can be purged periodically, e.g. by a cron job.

At this point you can access the TSD's web interface through 127.0.0.1:4242 (if it's running on your local machine).

Using OpenTSDB

Create your first metrics

Metrics need to be registered before you can start storing data points for them.
./tsdb mkmetric mysql.bytes_received mysql.bytes_sent
This will create 2 metrics: mysql.bytes_received and mysql.bytes_sent

New tags, on the other hand, are automatically registered whenever they're used for the first time. Right now OpenTSDB only allows you to have up to 224 = 16777216 different metrics, 16777216 different tag names and 16777216 different tag values. This is because each one of those is assigned a UID on 3 bytes. Metric names, tag names and tag values have their own UID spaces, which is why you can have 16777216 of each kind. The size of each space is configurable but there is no knob that exposes this configuration parameter right now. So bear in mind that using user ID or event ID as a tag value will not work right now if you have a large site.

Start collecting data

So now that we have our 2 metrics, we can start sending data to the TSD. Let's write a little shell script to collect some data off of MySQL and send it to the TSD (note: this is just an example, in practice you can use tcollector's MySQL collector.):
cat >mysql-collector.sh <<\EOF #!/bin/bash set -e while true; do mysql -u USER -pPASS --batch -N --execute "SHOW STATUS LIKE 'bytes%'" \ | awk -F"\t" -v now=`date +%s` -v host=`hostname` \ '{ print "put mysql." tolower($1) " " now " " $2 " host=" host }' sleep 15 done | nc -w 30 host.name.of.tsd PORT EOF chmod +x mysql-collector.sh nohup ./mysql-collector.sh &
Every 15 seconds, the script will collect 2 data points from MySQL and send them to the TSD. You can use a smaller sleep interval for more real-time monitoring, but remember you can't have sub-second precision, so you must sleep at least 1 second before producing another data point.

What does the script do? If you're not a big fan of shell and awk scripting, it may not be obvious how this works. But it's simple. The set -e command simply instructs bash to exit with an error if any of the commands fail. This simplifies error handling. The script then enters an infinite loop. In this loop, we query MySQL to retrieve 2 of its status variables:

$ mysql -u USER -pPASS --execute "SHOW STATUS LIKE 'bytes%'" +----------------+-------+ | Variable_name | Value | +----------------+-------+ | Bytes_received | 133 | | Bytes_sent | 190 | +----------------+-------+
The --batch -N flags ask the mysql command to remove the human friendly fluff so we don't have to filter it out ourselves. Then the output is piped to awk, which is told to split fields on tabs (-F"\t") because with the --batch flag that's what mysql will use. We also create a couple of variables, one named now and initialize it to the current timestamp, the other named host and set to the hostname of the local machine. Then, for every line, we print put mysql., followed by the lower-case form of the first word, then by a space, then by the current timestamp, then by the second word (the value), another space, and finally host= and the current hostname. Rinse and repeat every 15 seconds. The -w 30 parameter given to nc simply sets a timeout on the connection to the TSD.

Bear in mind, this is just an example, in practice you can use tcollector's MySQL collector.

If you don't have a MySQL server to monitor, you can try this instead to collect basic load metrics from your Linux servers.

cat >loadavg-collector.sh <<\EOF #!/bin/bash set -e while true; do awk -v now=`date +%s` -v host=`hostname` \ '{ print "put proc.loadavg.1m " now " " $1 " host=" host; print "put proc.loadavg.5m " now " " $2 " host=" host }' /proc/loadavg sleep 15 done | nc -w 30 host.name.of.tsd PORT EOF chmod +x loadavg-collector.sh nohup ./loadavg-collector.sh &
This will store a reading of the 1-minute and 5-minute load average of your server in OpenTSDB by sending simple "telnet-style commands" to the TSD:
put proc.loadavg.1m 1288946927 0.36 host=foo put proc.loadavg.5m 1288946927 0.62 host=foo put proc.loadavg.1m 1288946942 0.43 host=foo put proc.loadavg.5m 1288946942 0.62 host=foo

Batch imports

Let's imagine that you have a cron job that crunches gigabytes of application logs every day or every hour to extract profiling data. For instance, you could be logging the time taken to process a request and your cron job would compute an average for every 30 second window. Maybe you're particularly interested in 2 types of requests handled by your application, so you'll compute separate averages for those requests, and an another average for every other request type. So your cron job may produce an output file that looks like this:
1288900000 42 foo 1288900000 51 bar 1288900000 69 other 1288900030 40 foo 1288900030 59 bar 1288900030 80 other
The first column is a timestamp, the second the average latency for that 30 second window, and the third the type of request we're talking about. If you run your cron job on a day worth of logs, you'll end up with 8640 such lines. In order to import those into OpenTSDB, you need to adjust your cron job slightly to produce its output in the following format:
myservice.latency.avg 1288900000 42 reqtype=foo myservice.latency.avg 1288900000 51 reqtype=bar myservice.latency.avg 1288900000 69 reqtype=other myservice.latency.avg 1288900030 40 reqtype=foo myservice.latency.avg 1288900030 59 reqtype=bar myservice.latency.avg 1288900030 80 reqtype=other
Notice we're simply associating each data point with the name of a metric (myservice.latency.avg) and naming the tag that represents the request type. If each server has its own logs and you process them separately, you may want to add another tag to each line like the host=foo tag we saw in the previous section. This way you'll be able to plot the latency of each server individually, in addition to your average latency across the board and/or per request type.

In order to import a data file in the format above (metric timestamp value tags) simply run the following command:

./tsdb import your-file
If your data file is large, consider gzip'ing it first. This can be as simple as piping the output of your cron job to gzip -9 >output.gz instead of writing directly to a file. The import command is able to read gzip'ed files and it greatly helps performance for large batch imports.

Self monitoring

Each TSD exports some stats about itself through the simple stats command. You can collect those stats and feed them back to the TSD every few seconds. First, create the necessary metrics:
echo stats | nc -w 1 localhost 4242 \ | awk '{ print $1 }' | sort -u \ | xargs ./tsdb mkmetric
This requests the stats from the TSD (assuming it's running on the local host and listening to port 4242), extract the names of the metrics from the stats and assigns them UIDs.

Then you can use this simple script to collect stats and store them in OpenTSDB:

#!/bin/bash INTERVAL=15 while :; do echo stats || exit sleep $INTERVAL done | nc -w 30 localhost $1 \ | sed 's/^/put /' \ | nc -w 30 localhost $1
This way you will collect and store stats from the TSD every 15 seconds.