Instrumenting your monitoring checks with New Relic

This post is part 3 of 3 in a series on monitoring scalability.

In parts 1 and 2 of this series I talked about check latency and how you can mitigate its effects by splitting data collection + storage out from alerting, while looking at monitoring systems through the prism of an MVC web application.

This final post in the series provides a concrete example of how to instrument your monitoring checks so you can identify which exact parts of your checks are inducing latency in your monitoring system.

When debugging performance bottlenecks, I tend to use a simple but effective workflow:

  1. observe the system
  2. analyse the results
  3. optimise the bottleneck that is having the most impact
  4. rinse and repeat until the system is performing within the expected performance parameters

What if we continue to look at monitoring checks as micro MVC web applications? What tools exist to aid this optimisation workflow, and how can we hook instrumentation into our checks?

The crème de la crème of web app performance monitoring + optimisation tools is New Relic, boasting an incredibly rich feature set that lets you drill down deep into your application while also providing a high level view of app-wide performance.

But is it possible to hook New Relic into applications that aren't web apps? Let's give it a go.

Here's an example monitoring check:

#!/usr/bin/env ruby
#
# Usage: check.rb <time>

class Check
  attr_reader :opts

  def initialize(opts={})
    @opts = opts
  end

  def model(opts={}
    i = opts[:time]
    sleep(1)
    raise [Exception, RuntimeError, StandardError][rand(2)] if rand(i) == 1
    return i
  end

  def view(data)
    i = data
    sleep(rand(i) / 5)
    raise [Exception, RuntimeError, ArgumentError][rand(2)] if rand(i) == 2

    puts "OK: we made it!"
  end

  def run
    data = model(@opts)
    view(data)
  end
end

Check.new(:time => ARGV[0].to_i).run

As you can see, it's flat out like a lizard drinking inducing latency by sleeping and spicing things up by randomly throwing exceptions. All things considered, it's actually a pretty good example of a monitoring check that aims to misbehave.

Let's start instrumenting!

First up we need to load some libraries:

#!/usr/bin/env ruby

require 'rubygems'
require 'newrelic_rpm'

class Check
  include NewRelic::Agent::Instrumentation::ControllerInstrumentation

Reading through the New Relic API documentation...

# When the app environment loads, so does the Agent. However, the
# Agent will only connect to the service if a web front-end is found. If
# you want to selectively monitor ruby processes that don't use
# web plugins, then call this method in your code and the Agent
# will fire up and start reporting to the service.

...it looks like we need to manually start up the agent:

class Check
  # ...
end

NewRelic::Agent.manual_start

Now we need to tell the New Relic agent what to instrument. The API provides methods to do this at the transaction and method level:

class Check
  # ...

  add_transaction_tracer :run,   :name => 'run', :class_name => '#{self.class}'
  add_method_tracer      :model, 'Nagios/#{self.class.name}/model'
  add_method_tracer      :view,  'Nagios/#{self.class.name}/view'
end

In New Relic parlance, a transaction is an end-to-end process that is comprised of many smaller units of work, and a method is an individual unit of work. In this monitoring check scenario, a transaction is an invocation of the check.

When using the New Relic agent with Rails, by default it captures the query parameters passed to the controller action. This helps massively when debugging why a certain transaction takes longer to complete on particular inputs.

Wouldn't it be cool if we could treat the command line arguments to the monitoring check as query parameters to the controller action? That way we could identify which services are running slowly and holding up the check.

Turns out this is just another option to add_transaction_tracer:

add_transaction_tracer :run, :name => 'run', :class_name => '#{self.class}', :params => 'self.opts'

Provided you store all your options in an instance variable with an attr_reader, you can capture whatever data is passed to the check on execution.

One piece of data the New Relic agent captures is an Apdex score for each request. An Apdex score is a measurement of user satisfaction when interacting with an application or service.

In this particular scenario, the "user" is actually a monitoring system, so the score may not be that meaningful. Let's disable it for now:

class Check
  # ...

  newrelic_ignore_apdex
end

So far everything has been very smooth - we've taken an existing check and added some instrumentation points with New Relic - but we're about to hit a complication.

Internally the New Relic agent spawns a separate thread from which it sends all this instrumented data to the New Relic service. Establishing a connection to the New Relic service actually takes a while (15+ seconds in the worst cases), which doesn't quite fit the paradigm we're working in where monitoring checks are returning sub-second results.

Essentially this means that we're collecting all this interesting data with the New Relic agent but it's never actually sent to the New Relic service.

In the PHP world this is a very real problem as PHP processes will exit at the end of each request. In the PHP edition of New Relic there's quite a cute workaround for exactly this problem - each PHP process sends data to a daemon running in the background that buffers it and sends it to New Relic at a regular interval.

Let's emulate this functionality in Ruby:

at_exit do
  NewRelic::Agent.save_data
end

This will serialise the captured data to log/newrelic_agent_store.db as a marshalled Ruby object. The last step is to send this data to New Relic at a regular interval:

#!/usr/bin/env ruby
#
# Usage: collector.rb
#

require 'rubygems'
require 'newrelic_rpm'

module NewRelic
  module Agent
    def self.connected?
      agent.connected?
    end
  end
end

$stdout.sync = true
NewRelic::Agent.manual_start

print "Waiting to connect to the NewRelic service"
until NewRelic::Agent.connected? do
  print '.'
  sleep 1
end
puts

NewRelic::Agent.load_data
NewRelic::Agent.shutdown(:force_send => true)

This waits for the New Relic agent to establish a connection to the New Relic service, loads the data serialised by the checks, and sends it to New Relic.

Just for testing, we can run our pseudo collector like this:

while true; do echo "Sending" && ruby send.rb && echo "Sleeping 30" && sleep 30 ; done

And invoke the monitoring check like this:

while true ; do RACK_ENV=development bundle exec ruby main.rb 5 ; done

Now we've got all this set up, we can log into New Relic to view some pretty visualisations of our monitoring check latency:

New Relic dashboard screenshot

New Relic automatically identifies which transactions are the slowest, and lets you deep dive to identify where the slowness is:

New Relic transaction deep dive screenshot

If you haven't got a brass razoo there are plenty of Open Source alternatives to New Relic, but you'll have to do a bit more grunt work to get them going.

This post concludes this series on monitoring scalability! The TL;DR series summary:

  • Check latency is the monitoring system killer.
  • Even in simple environments check latency slows down your monitoring system and obfuscates incidents.
  • To eliminate latency, separate data collection from alerting.
  • Make your monitoring checks as non-blocking as possible.
  • Whenever debugging monitoring performance problems, think of your monitoring system as an MVC web app.
  • Instrument your monitoring checks to identify sources of latency.

You can find the above code examples on GitHub.

If you've enjoyed this series of posts, you can find more of my keen insights, witty banter, and Australian colloquialisms on Twitter, or subscribe to my blog.

monitoring system == web app (when diagnosing performance bottlenecks)

This post is part 2 of 3 in a series on monitoring scalability.

In part 1 of this series I talked about check latency, and how it can batter you operationally if it gets out of hand.

In this post I'm going to propose an alternative way of looking at monitoring systems that can hopefully shed light on some typical performance bottlenecks.

Architecturally, monitoring systems and web applications share many of the same design characteristics:

  • A check is a request to an action on a controller
  • Actions fetch data from a model, and expose a result through a view

Overview diagram of monitoring system/web application request lifecycle

If you look at monitoring systems through this prism, many monitoring performance and scalability problems become simpler to understand:

  • Poorly optimised actions can take a variable amount of time to return a response
  • You get the best performance out of your monitoring system by optimising actions that are slow, and working towards a consistent throughput across all your monitoring checks

Diagram explaining how latency at one end of the pipeline effects the other

Bearing this in mind, what methodologies do we use to remove performance bottlenecks from a web application? Can we apply those same techniques to monitoring systems?

One very common technique is to precompile data to eliminate computationally expensive operations when serving up a result. The precompilation should almost always be a separate process from the main process serving requests.

This has multiple benefits:

  • You shift the computationally expensive and latency inducing work in a monitoring check to a separate process. This makes acheiving a low and consistent monitoring check response time vastly easier.
  • You can throw specialisied hardware at particular parts of the monitoring pipeline. For example, use a SAN with a huge memory cache or SSDs exclusively in your data storage layer to speed up reads + writes, and beefy multicore machines in your alerting layer to increase your check parallelism.

Diagram explaining where to focus optimisation efforts

Separating data collection + storage from thresholding + notifications is the most crucial part of ensuring consistent check throughput in your monitoring system

In September of 2011 Stephen Nelson-Smith covered why this separation is so important in his article We alert on what we draw. The article can be boiled down to "Your graphs and your alerts should be created from the same data source. This simplifies incident response and analysis."

The other advantage that Stephen didn't cover was the massive throughput boost this gives your monitoring system. It's tempting to say that the throughput boost is a bigger advantage than the operational gains, however the two are inextricably linked. You have massive operational issues if your monitoring system is "running late" on executing monitoring checks, but you've got Buckley's chance of effectively responding to incidents if you have no visibility of those incidents.

My preference is to collect + store the data with collectd + OpenTSDB, however the DevOps community as a whole seems to be very keen on Ganglia + Graphite. YMMV, do your research and use what's best for you.

The most time consuming part of adopting this separation strategy is reworking your monitoring checks to fetch from these data stores. I'd highly recommend writing a small DSL for doing common things like fetching data and comparing results.

No approach is perfect, and separating your data from your alerting introduces a different set of problems.

Even by separating the collection from the alerting, your monitoring checks are still essentially going to block when retrieving data from your storage layer. Keeping in mind you will never be able to truly eliminate blocking checks, it is imperative you ensure these new checks block as little as possible, otherwise you'll be subjecting yourself to the same problems.

Write your checks with the expectation that your data store will become unreachable. The biggest drawback to separation is that when your data store becomes unreachable, all of your checks will fail simultaneously.

Diagram explaining where things will break

Operationally this can be a complete nightmare. I have seen many a pager and mobile phone melt under a deluge of notifications saying that data for a check could not be read.

There are two workarounds for this problem:

  • Set up a parent check for all your monitoring checks that simply reads a value out of the data store, and goes critical if the data store can't be accessed. If your monitoring system does parenting properly and you have a good check throughput, this should minimise the explosion of alerts.
  • Build a manual or automatic notification kill switch into your monitoring system so if the shit does hit the fan and your storage layer disappears, you don't suffer from information overload and do something fatally stupid.

So how do you ensure your monitoring checks aren't suffering from check latency?

In the next post in this series, we'll look at instrumenting your monitoring checks themselves to identify which parts of the checks have bottlenecks.

Monitoring Sucks. Latency Sucks More.

This post is part 1 of 3 in a series on monitoring scalability.

The Monitoring Sucks conversation has been an awesome step in the right direction for defining a common language for describing monitoring concepts and documenting the available tools.

The reasons monitoring sucks are many and varied - poor configuration, poor visualisation, poor scalability, poor data retention - there is a lot of well-founded hate for the available tools (some of which I have authored!)

I want to take a closer look into a problem I grapple with on a daily basis as part of my job: monitoring scalability.

What do I mean by "monitoring scalability"?

For a monitoring system to be considered scalable, I would expect it to execute large volumes of monitoring checks under a variety of conditions (good + bad) with a consistent throughput.

Why is monitoring scalability a problem? Are there deeper, subtler problems that underly monitoring system architectures in general?

Nagios handles 6000+ checks like a champ. I say this with a completely straight face. At Bulletproof, we have several large instances of Nagios that have been running for years with thousands of checks.

There is one caveat, and it is pretty massive - if your monitoring checks take a variable amount of time to return a result (they have high check latency), you will get reduced throughput, and thus your incident response times becomes unreliable. This leads to a lack of trust in the monitoring system which can kill you operationally if you don't nip it in the bud.

Let's work through some of the scalability problems by looking at a hypothetical and simplified monitoring system:

Imagine you have a very small monitoring system with 150 checks running. The type of check is irrelevant (in Nagios parlance they could be "service" or "host" checks), however each check is scheduled to be executed every 300 seconds (for the sake of argument, lets just ignore that a 300 second interval is way too long).

To simplify this hypothetical, let's posit that all the checks are running serially in a single thread, and each check takes 1 second to execute and return a result.

At this point, you're golden. All checks are executing in 150 seconds, well within the 300 second window.

Now double the number of checks to 300.

That's one check executed every second. All the checks execute within the execution window, but things are getting tight, and you don't have any spare capacity to add more checks.

Worst of all: what happens when the check response time goes up to 2 seconds? Now you can only execute 50% of your checks within the 300 second window, and your monitoring is 300 seconds "behind".

Now you're suffering from check latency - a world of pain filled with plenty of insidious edge cases to cut yourself on.

My favourite edge case is when a service failure occurs just after a check has executed and returned an OK result. In the above hypothetical, you would be unaware of the failure for 599 seconds. In a monitoring system suffering heavily from check latency, that period of time could be much much longer. Furthermore, the problem is amplified when you're using soft/hard states to eliminate false-positives.

The above hypothetical is a tad contrived as pretty much all monitoring systems execute checks in parallel, but it illustrates the scalability challenges even in a simple scenario.

Executing checks in parallel certainly helps stave off this type of bottleneck, but as you increase the number of checks and the parallelism of your monitoring system, you start running into operating system limitations such as context switching, memory exhaustion (if you use a language that gobbles up memory), or simply running out of CPU time to execute all the checks.

The other enormous gotcha is that when catastrophic failures happen, it's very common to have monitoring checks that simply timeout because various network resources between your monitoring server and the machine you're checking are down or misbehaving.

The last thing you want in an emergency situation is delayed alerts that may hide the root cause or feed you bad information.

So how do you mitigate check latency problems to improve your monitoring scalability?

In the next post in this series, we'll look at monitoring systems as a type of complex web application, and investigate some performance optimisation techniques you can apply.

Treetop PEG for Puppet resources

Earlier this year at Puppet Camp EU, Randall Hansen ran an open space session on improving the Puppet user experience.

Lots of sharp edges were identified, but one issue that I raised was the annoying need for trailing commas to break up parameters in resource declarations.

I chatted about this briefly with Luke and for a laugh I decided to write a Treetop Parsing Expression Grammar (PEG) for Puppet resources that supported newlines as the parameter delimeter:

# puppet.treetop
grammar Puppet
  rule resource
    whitespace
    type
    whitespace
    open
    whitespace
    name
    whitespace
    parameters
    whitespace
    close
    whitespace
    {
      def resource_type
        type.text_value
      end

      def resource_name
        name.word.text_value
      end
    }
  end

  rule type
    word
  end

  rule open
    "{"
  end

  rule close
    "}"
  end

  rule name
    quotes word quotes ":"
    {
      def name
        word
      end
    }
  end

  rule word
    [a-zA-Z]+
  end

  rule quotes
   "'" / '"'
  end

  rule parameters
    newline* (whitespace parameter comma_or_newline*)*
  end

  rule parameter
    whitespace
    word
    whitespace
    arrow
    whitespace
    word
    whitespace
  end

  rule arrow
    "=>"
  end

  rule comma_or_newline
    comma / newline
  end

  rule comma
    ","
  end

  rule newline
    "\n"
  end

  rule whitespace
    "\s"* / "\n"+
  end
end

It's throwaway code, but as far as I'm aware it's relatively idiomatic Treetop.

It came in handy earlier this week when explaining PEGs to a new recruit into the R&D team at work.

Said recruit suggested that I publish it, as there aren't too many examples of Treetop PEGs floating around.

To run the PEG over an example snippet:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'polyglot'
require 'treetop'

Treetop.load "puppet"

snippet = <<-SNIPPET
  package { "foobar":
    ensure => present, another => bar, spoons => doom
    foo    => bar
  }
SNIPPET

parser = PuppetParser.new
if @root = parser.parse(snippet.strip)
  puts 'success'
  p @root.resource_type
  p @root.resource_name
else
  puts 'failure'
  puts parser.failure_reason
  puts parser.failure_column
  puts parser.failure_line
  puts parser.failure_index
end

Gemfile for running it and all the above code is in a Gist.

Standing desk adventures

I've been using a standing desk for a bit over four months now, and thus far it's been quite successful.

I decided to test out the idea because I spend a lot of time in front of the screen, and my back has been getting progressively sorer over the last year. Not wanting to transform into the Hunchback of Notre Dame before I turn 30, and aware of the current research suggesting that sitting down for long stretches increases the risk of heart disease, a standing desk seemed like a good alternative.

Because I work from home and the office, I actually have two standing desk setups - one for each location.

The home setup is incredibly makeshift, with the monitor placed on a box of our things left over from the last move, and the laptop sitting on a discontinued IKEA storage box not too dissimilar from the current Prant offering.

The reason for the home setup dodginess is twofold:

  • I wanted to try out the standing desk thing without a large financial commitment
  • We're between houses, and have to use what we've got on hand

Once I discovered that the standing desk was the way I wanted to work for the foreseeable future, I decided to up the ante and buy a real desk for work.

There are plenty of purpose built standing desk options that are far beyond my budget, so the search was on for finding a desk for a reasonable price.

I stumbled across a Frankenstein IKEA desk on Lifehacker, but it:

  • Was too long for the space in the office
  • Required a non-trivial amount of construction with tools not on hand

The same Lifehacker article linked to another blog about repurposing an Utby kitchen table as a standing desk:

Standing desktop and base

This was the model I settled on.

The Utby kitchen table is sold as two separate products, the 105cm high stainless steel underframe (not be confused with the 90cm one), and the 120x60x3.4cm Vika Amon table top.

At the time the local IKEA store did not have stock of the Vika Amon table top, so based on the advice of a shop assistant I picked up the Galant table top instead, minus the normal Galant frame.

I was assured in-store this would be a reasonable substitute, however as I was finishing off the construction in the office I discovered that the Galant table top I purchased is 2cm thick, as opposed to the Vika Amon which is 3.4cm. This meant that the supplied screws to mount the table top to the underframe would have broken through the surface, so alternative screws are required.

Chalk that one up to a lack of research.

To date I'm very impressed with the desk, with the bottom rail of the frame providing a suitable place to rest my feet against and store my bag behind.

Standing desk at office

As for the longer term effects of using a standing desk 12 hours a day 5 days a week: I haven't been able to find many others sharing their experiences. The sum of what you generally read online is "I just switched over to a standing desk a few hours ago and it's feeling great!!!".

In my experience, the two key factors are:

  • Start out with a comfortable, well worn pair of shoes
  • Make sure the surface you stand on isn't too hard (timber floorboards or carpet are best)

My desk at home has me standing on tiles in Chucks, and boy can I feel it if I'm working long hours. If I use the desk for more than 12 hours at a time, my feet start aching pretty badly.

I see this is a good thing though: if my feet are aching, it means I need to stop work for the day.

I've tried working around this by wearing in different types of shoes (Birkenstock Arizonas and Shimano SH-MT40s), but I generally end up with a pretty nasty headache within an hour.

Once we move into our new house I'll be working on timber floorboards, so I have my fingers crossed that the pain will ease up.

I find that I'm shifting my weight between legs every 5-10 minutes, and am much more inclined to bop along to music now that I'm standing up.

There's also an unspoken advantage to having a standing desk in a busy office environment: people will interrupt you for much shorter periods of time if they have nowhere to sit.

If you're doing pair programming this can be pretty brutal on your partner if they're not used to standing up for long stretches, but it has a distinct advantage when you're trying to shut the world out and keep in the zone.

I also find that the by the end of the day I have a mild tingly sensation in my calves, not too dissimilar from the sensation felt when returning from a bike ride with lots of climbs.

Since setting up the standing desk in the office, there are two more setups that have showed up on IKEA Hackers:

I quite like the idea of the extra storage space gained with the CD-riser design, and may opt for that design when we move.

Would I go back to sitting at a desk? Not in the foreseeable future.

I find most of my back pain has gone, and I now value sitting a lot more. :-)

Would I recommend standing desks for others? If you are working long hours in front of a screen and have trouble finding a comfortable setup to work from, it might be worth a shot.

Devops Down Under 2011 videos online

The final videos from Devops Down Under 2011 have just been uploaded to Vimeo.

The list of videos is:

Thanks again to all this year's sponsors, speakers, and attendees for making the conference awesome!

On assholes, ideas, and actions

Almost 2 weeks ago Rusty Russell wrote about assholes in the Open Source community.

Some people intepreted Rusty's post as a tacit toleration of assholes, with Matt Zimmerman commenting on the greader post I shared:

If we didn't tolerate assholes in our community, we would still be producing great software.

I don't believe Rusty's point is that "tolerance of assholes is a necessary evil". His point is that people of all walks of life are flawed, and the tech community is far from being a utopian exception.

The distinction is between ideas and actions.

I have no problem with people holding ideas or beliefs I find offensive or plain wrong. It's that lack of homogeny that makes life interesting and keeps me thinking. It becomes an issue if said person takes action on those ideas and that action hurts other people.

Hence, if you're being an asshole to me or someone else I won't tolerate that and will call you out on it. If you believe that the world is flat or that fasting cures cancer I'll vigorous argue against your position but I won't wield the ban hammer on the work you do.

Getting consistent media source access on Boxee with CIFS

Although a bit late to the party, I've been using Boxee at home for watching movies and TV shows for almost a year now. Its indexing engine is pretty accurate, and the killer feature has to be the Boxee remote app for the iPhone.

I've tried using XBMC but found it suffers from the common open source problem of exposing all its knobs and dials to make everything infinitely tweakable for the end user (also known as: the KDE user experience). Although the free Boxee release is drifting towards an unmaintained state now Boxee are focusing on the Boxee Box, it's simple and stable enough to use that everyone in my family can get their heads around it in under a few minutes.

The one bug that's consistently annoyed me has been disappearing CIFS shares when a configured media source (a file server) exists on a different subnet to the Boxee itself. This only happened recently when we moved house and put our file server on a separate subnet that's not DHCP serviced.

I scratched my head for a little while on how to solve this. The obvious solution is to shorten the DHCP lease range and put the file server on the same subnet as the Boxee machine, but I'm cautious of what other CIFS bugs lurk beneath the surface, and it feels vastly more complicated than it should be.

The simplest solution I could find was to pass off handling CIFS shares to Linux and add a plain old directory as a media source:

# /etc/fstab: static file system information.
#
# <file system>         <mount point>   <type>  <options>           <dump>  <pass>
proc                    /proc           proc    nodev,noexec,nosuid 0       0
/dev/sda1               /               ext4    errors=remount-ro   0       1
/dev/sda5               none            swap    sw                  0       0
//jules/videos          /media/videos   cifs    guest               0       0

This bypasses Boxees dodgy CIFS code completely, and has the added benefit of abstracting away the source from Boxee, making it vastly easier to shuffle file shares and disks around without Boxee getting upset.

Testing daemons with Cucumber

Sometimes when writing daemons and doing outside-in testing, you want to fire up the daemon in the background, interact with it, and test the interactions match the behaviour you're expecting.

The biggest challenge is starting and stopping the daemon reliably - you don't want daemons hanging around between tests, consuming extra resources, and mucking up state.

Your test might look something like this:

Feature: Some daemon
  As an operator
  I want to run my daemon
  And have it do things

  Scenario: Command line tool
    When I start my daemon with "kelpie start"
    Then a daemon called "kelpie" should be running

  Scenario: Seeing where my daemon is getting its data from
    When I start my daemon with "kelpie start"
    Then I should see "Using data from /.*mysql" on the terminal

I've found this problem can be handled quite nicely with with IO.popen and an at_exit callback.

When /^I start my daemon with "([^"]*)"$/ do |cmd|
  @root = Pathname.new(File.dirname(__FILE__)).parent.parent.expand_path
  command = "#{@root.join('bin')}/#{cmd}"

  @pipe = IO.popen(command, "r")
  sleep 2 # so the daemon has a chance to boot

  # clean up the daemon when the tests finish
  at_exit do
    Process.kill("KILL", @pipe.pid)
  end
end

Then /^a daemon called "([^"]*)" should be running$/ do |daemon|
  `ps -eo cmd |grep ^#{daemon}`.size.should > 0
end

The other way to do this is with the usual backtick method, and poke at $?.

command = "kelpie"
output = `#{command}`
pid = $?.pid

at_exit do
  Process.kill("KILL", pid)
end

The issue here is blocking - if the daemon is doing its job, that command won't return at all, and you certainly won't see any output from the command.

IO.popen's main advantage here is that it spawns a subprocess to execute the command, which won't block Ruby.

So how do you get at the output of the daemon? Easy - we've bound the IO.open instance to @pipe, so we can just interact with that.

Then /^I should see "([^"]*)" on the terminal$/ do |string|
  output = @pipe.read(250)
  output.should =~ /#{string}/
end

The above code is fairly naive and you'll have to tweak just how much data you read, otherwise Ruby will block on reading that pipe.

The last gotcha is that Ruby buffers output to STDOUT by default, so if the daemon you're testing is also written in Ruby, you may not see anything on that pipe even though the daemon has executed its puts and print statements.

You can disable buffered output by including this statement somewhere in your daemon (I like putting it just after requires):

$stdout.sync = true

The times, they are a-changin'

It's been a stupidly long time since I've updated this thing, so here it goes!

30 weeks 5 days

Julia and I are expecting! The baby's due on October 19, so not long to go.

New House New House

We have a new house. We're situated between Julia's parents and mine, which will work out really well when the baby arrives.

On the work front, I've stopped I've manned up and gotten a Real Job. After a brief stint at Rails Machine (awesome dudes, you can't go past them for Ruby on Rails hosting and scalability consulting), I'm now working at Bulletproof Networks, being paid to hack on my various open source monitoring projects.

In open source land, there's been a major new release of Visage, cucumber-nagios has had several bugfix and minor feature releases, and Visage also has a brand spanking new website. Due to the baby and new job I had to cancel appearances at Agile 2010 and FrOSCon, however Stephen Nelson-Smith stepped up to the plate and gave a great talk on Visage and Reconnoiter (Theora, h264).

The baby means we'll be calling Sydney home for a while, however (fingers crossed) the new year should bode well for travel!

Behaviour driven infrastructure through Cucumber

Martin Englund posted an open question to the Puppet mailing list a few days ago asking how people are verifying their systems are built as expected:

When you write code, you always use unit testing & integration testing to verify that the application is working as expected, but why don't we use that when we install a system?

What are you using to verify that your system is correctly configured and behaves the way you want?

He linked to a blog post demonstrating how he was verifying his machines using Cucumber.

Coincidentally, about a week earlier at Devopsdays in Gent, I was talking to Felix Kronlage and Bernd Ahlers from bytemine about doing similar things through testing SSH and mail delivery with cucumber-nagios.

It's pretty cool people are thinking about doing BDD/TDD with infrastructure, and it's even cooler that the tools are at the point where doing this is actually possible.

When doing software testing, your testing tool is normally separate from the language and libraries you're building the software with (but almost always written in the same language). When testing your infrastructure, I think it makes perfect sense to apply this practice.

So to practise Behaviour Driven Infrastructure right now, you can use Cucumber as the testing tool, and Puppet as the programming language.

One advantage of practicising BDD within sysadmin world is that the testing tools aren't closely coupled to the language our systems are built with - i.e. if you hate Puppet you can use Cfengine, and if Cucumber isn't cutting it use PyUnit.

But to something tangible!

Building on Martin's excellent examples, i've pushed out a new version of cucumber-nagios that includes some basic SSH interaction steps, so you can start building behavioural tests for your infrastructure:

Feature: example.org ssh logins
  As a user of example.org
  I need to login remotely

  Scenario: Basic login
    Given I have no public keys set
    Then I can ssh to "example.org" with the following credentials: 
     | username | password    |
     | lindsay  | spoonofdoom |

  Scenario: Login to multiple hosts
    Given I have no public keys set
    Then I can ssh to the following hosts with these credentials: 
     | hostname           | username | password      |
     | example.org        | matthew  | spladeofpain  |
     | mail.example.org   | john     | forkoffury    |
     | web04.example.org  | steve    | sporkofpork   |

  Scenario: Login with a key
    Given I have the following public keys: 
     | keyfile                   |
     | /home/user/.ssh/id_dsa |
    Then I can ssh to the following hosts with these credentials: 
     | hostname         | username |
     | example.org      | matthew  |
     | mail.example.org | mark     |
    
  Scenario: Login with an inline key
    Then I can ssh to the following hosts with these credentials: 
     | hostname         | username | keyfile                   |
     | example.org      | luke     | /home/luke/.ssh/id_dsa |
     | mail.example.org | john     | /home/john/.ssh/id_dsa |

The above example shows there's lots of ways to test the same thing (all depending on what you're trying to achieve), but there is now also suppport for executing shell commands remotely:

  Scenario: Checking /etc/passwd
    When I ssh to "example.org" with the following credentials: 
     | username | password      | keyfile                 |
     | jacob    | spifeofstrife | /home/jacob/.ssh/id_dsa |
    And I run "cat /etc/passwd" 
    Then I should see "jacob" in the output

I don't expect you would do a cat /etc/passwd in a real test, however the step definition is a good example of how to interact with an established SSH connection:

When /^I run "([^\"]*)"$/ do |command|
  @output = @connection.exec!(command)
end

Then /^I should see "([^\"]*)" in the output$/ do |string|
  @output.should =~ /#{string}/
end

You'd use this to write specific tests for checking system behaviour, such as local user logins vs LDAP logins, or the presence of a daemon.

So the resulting process may look something like this:

  1. Use cucumber-nagios to write a specification of how you expect your infrastructure to behave.
  2. Hook your new cucumber-nagios checks into Nagios.
  3. Start writing your manifests/cookbooks.
  4. Run your configuration management tool on the node you're configuring.
  5. Iterate until your monitoring system is silent.

Not only do you have a functional definition of how your machines work that you can use to build your machines, but if your systems deviate from the expected behaviour at any point in the future, you'll get an alert from your monitoring system.

Maintaining both a configuration management system and a set of integration tests might get annoying after a while, but if you ever decide to migrate to another configuration management system or move your machines into the cloud you'd have a set of tests you could apply immediately.

This could also be useful for moving existing machines into a configuration management system. Write a set of integration tests for your unmanaged machines, run your configuration management system over the existing machines, see if anything is broken.

I'd be interested to hear how this process or similar works for people!

Slides from Devopsdays 2009

On cucumber-nagios:

And Flapjack:

Using Cucumber as a scripting language

Yesterday at the excellent Devopsdays in Gent, Belgium, I proposed an open session to flesh out an idea I had a few weeks ago - to use Cucumber as a general scripting language.

Cucumber's Given/When/Then steps are well suited to procedural tasks like shell script, and you would be writing your "scripts" in straightforward language that non-technical users such as managers and clients could understand. Also, as writing a scenario without a Then to close it feels unbalanced, you'd get in the mindset of testing the actions of your "scripts" fairly quickly.

With little more than the hypothesis above, a group of us found a room and started modeling some scenarios. Our focus was on file manipulation, as it was a low hanging fruit and something most scripts do.

We came up with this:

Feature: Copy files around
  
  Scenario: A single file
    Given I am in "/tmp"
    And the file "spoons" exists
    When I copy the file "spoons" to "forks"
    Then the file "forks" should exist
    And the file "forks" should be readable

  Scenario: Multiple files
    Given I am in "/tmp"
    Given the following table of tasty fruit:
      | filename |
      | apples   | 
      | oranges  |
      | bananas  |
      | ananas   |
      | file with lots o spaces |
      | spoons of : doom |
    When I create the directory "/tmp/some_other_dir"
    When I copy the tasty fruit in the table to "/tmp/some_other_dir"
    Then the tasty fruit in the table should exist in "/tmp/some_other_dir"

The first scenario is fairly self explanatory, but the second one is where the interesting stuff starts happening.

In the implementation of the "following table" step, we create an instance variable that persists the list of files between steps. This way, we can reference the "tasty fruit" throughout our other steps:

Given /^the following table of (.+):$/ do |name, table|                          
  @tables = {}                                                                   
  @tables[name] = table.hashes                                                   
end

We use the (.+) regex to capture the name of the table so we can poke at it later on. This design lets you easily use multiple tables throughout your steps that won't conflict with one another:

  Scenario: Multiple files from multiple tables
    Given the following table of tasty fruit:
      | filename |
      | apples   | 
      | oranges  |
    And the following table of baggy baggage:
      | filename |
      | suitcase | 
      | backpack |
    When I copy the baggy baggage in the table to "/tmp/some_other_dir"
    And I copy the tasty fruit in the table to "/tmp/some_other_dir"
    Then the tasty fruit in the table should exist in "/tmp/some_other_dir"
    And the baggy baggage in the table should exist in "/tmp/some_other_dir"

Other steps can reference data in the table by accepting a name and looking it up in the hash of tables:

Then /^the (.+) in the table should exist in "([^\"]*)"$/ do |name, destination| 
  @tables[name].each do |file|                                                   
    File.exists?(File.join(destination, file["filename"])).should be_true        
  end                                                                            
end 

We also looked at handling permission problems:

  Scenario: Do things i'm not allowed to 
    When I create the directory "/usr/bin/wtf"

Here the step will raise an Errno::EACCES exception, and as Cucumber uses a pretty formatter by default, the failed step will appear in red.

Finally we tried copying files with a glob. The initial implementation I banged out was very Unix focused (it used *, which is a very explicit globbing syntax), so we scrapped that idea and wrote our intentions in plain English:

  Scenario: Copy based on a pattern
    Given I am in "/tmp"
    When I create the directory "/tmp/pattern_dir"
    And I copy files beginning with the letters z,y,x to "/tmp/pattern_dir"
    Then they should exist there

The implementation is obvious, and is very understandable (and seemingly powerful) to someone with no knowledge of globbing.

People who have used Cucumber in web development will likely note that the above implementation is an example of tightly coupled steps, which is sometimes regarded as an anti-pattern. I'm of the opinion that this is a lot more painful in a web development context than in a procedural/scripting tool one.

From my recollection of Euruko earlier this year, when Aslak was asked whether he considers it an antipattern, he said it can be ok to use depending on the problem you're trying to solve, so I take that as tacit permission that it is ok this context. :-)

I posted the results of the session to a Gist yesterday, and I have also published a repo with a bundler-ready install process, so people can hack on it more.

After the session I remembered that the feature file doesn't actually have to start with Feature, so it's possible to write standalone scenarios one after another.

When wrapping up, someone in the room pointed out that our implementation actually went one better than being readable by non-technical users - they could probably write the scripts themselves.

This is pretty powerful, and coupled with Cucumber's very cool step generation when running scenarios with undefined steps, makes it very easy to start prototyping a standard library of human readable scripting commands.

There was chatter on the Cucumber mailing list a few weeks ago about providing alternate interfaces for writing and executing Cucumber features, and it could be cool to see a drag-and-drop interface with a library of common tasks that calls out to Cucumber to execute them. You could even build something quite beautiful with HotCocoa.

Anyhow, if you think anything mentioned above is a cool idea, check out the code and start hacking!

cucumber-nagios 0.5.0

I've just released a new version of cucumber-nagios, and this release is quite a milestone!

Big changes in this release include:

  • Removal of the ghetto bundler in favour of wycats/carllerche's bundler.

    In previous releases, you'd use a rake task to freeze in dependencies. This produced all sorts of weird problems when new versions of the dependencies were released, it didn't handle gems with C extensions that well, and could be very slow if you ran it multiple times.

    Now that bundler has started maturing, cucumber-nagios has made the switch. It eliminates all the aforementioned issues, and integrates cleanly with RubyGems.

  • Renaming of the gem to cucumber-nagios from auxesis-cucumber-nagios, as GitHub have discontinued building gems. The gem is now published on Gemcutter.

  • The project generator now prints out helpful instructions when you generate a new project.

  • cucumber-nagios projects have built-in steps for benchmarking response times. The following example explains it best:

Feature: slashdot.com
  To keep the geek masses satisfied
  Slashdot must be responsive
    
  Scenario: Visiting a responsive front page
    Given I am benchmarking
    When I go to http://slashdot.org/
    Then the elapsed time should be less than 5 seconds
  • A --debug switch can be passed to cucumber-nagios to print out the command line built and executed. This can be useful when writing your features.

  • Removal of several unnecessary support files, and cleanups of helpers and Cucumber's World object setup, in line with an updated version of Webrat.

  • Refactoring of the Nagios formatter for Cucumber to use Cucumber 0.4.0's formatter interface. For users, this simply means cucumber-nagios now works with Cucumber 0.4.0 (the latest at time of this release).

Although i've done a fair amount of testing, there will invariably be bugs, which can be reported on GitHub.

Switching to Jekyll

After a quick migration, i've switched this blog from WordPress to Jekyll.

I've done this for several reasons:

Cool things now i've migrated:

  • I can version control my blog.
  • My blog content is flat file, so I just edit the content and push. This also means my blog can be easily distributed and backed up.
  • Pulling in Flickr photos, Last.fm listening and tweets no longer blocks the page load. I wrote a cute little MooTools class to display the info, and a cron job to fetch it in the background.
  • Comments are all preserved, as I switched to Disqus several weeks ago. The WordPress => Disqus import was mind numbingly easy using the Disqus plugin.

If you want minimalism in your blogging engine and full control over its appearance, Jekyll might be worth checking out.