Designing a monitor and control system for 200+ servers

A few months ago i had to design a proactive monitoring system that could handle 200+ servers with ease. The idea was not to build a simple monitor that passively watched the server farms notifying the admins when some threshold was reached.

Keeping a team watching the servers 24h/7 has its problems, if the system could lighten up the load on them would be great.

I wanted the system to have some capability of reacting according with the scenario it had at the moment. This scenario is represented by all the readings of each sensor loaded at the time and it may be a single server contained scenario or farm/cluster wide. With this reactive capability humans are notified only for situations that the system couldn’t handle/contain.

Sorry if i offended someone with the project name (Skynet), too much movies… lol, but fyi it has TTS library for many things but one of them is saying “hasta la vista baby” 😛

Architecture

  • Starting from the core piece, it was written in Java for two main reasons. First was because at the time i had only a few days to implement the prototype of this and since i have years of experience in Java so it is where i was most productive.
  • Second reason was “Reflection“, i know many other languages let you inspect and execute code at runtime, but again previous experience in the technology allowed me to cut corners. Runtime inspection/execution was obligatory since i wanted to be able to add components/sensors/… at any time and more important abstract all this.

Skynet Schematic

Input sources

  • Currently Skynet has many input sources, the mainly one is sessions over SSH opened to each server which allow to monitor everything in each server, accordingly with each server profile the right set of sensors will be loaded at runtime using reflection.
  • This SSH sessions are, of course, used by Skynet to actively interact with the servers. For example block an ip, keeping mail queues clean, stop some non critical services if a server is under stress, etc. All this is done automatically and if the problem fails to be contained then humans are alerted for the problem.
  • The second main input source is Mail, this is great since end-users/customers can interact with the system without knowing and without human intervention, for example: requesting an ip unblock from a server in an shared hosting cluster.
  • There are many others like: RSS feeds, SMS and so on. RSS feeds support is a funny history, Skynet actively scans defacements feeds (like zone-h and others) for IPs from any one of the servers connected to it. If a match is found it alerts the admins allowing them to alert the website owner.
  • Applications are endless.

Data

  • All events and readings are stored in a offsite Redis instance, adding persistence capability.

Ouput

  • Current version have modules for SMS, Mail and  Twitter. Twitter is used almost like a timeline log for each action Skynet does and since there is almost a twitter client in any electronic device nowadays, its the perfect on the go log solution. (feed is kept private)

Security

  • The machines where Skynet core is running are in a secure location without any direct input connections form the web. Since SSH sessions are used to talk with the servers, there were a real danger if the location was compromised.
  • Key authentication is used and keys are saved only in volatile memory. If the power goes down they are lost, so if even someone steal the machines they will not be able to reestablish the sessions with the servers in the new location.
  • It is totally autonomous, accepting only emergency shutdown in case something starts to deviate. This shutdown command is not sent directly to the Skynet since theres no direct connection to it from the outside, instead its saved in a location where Skynet connects to check for emergency commands. (Botnet style)

Web Architecture

  • Here goes my favorite part of all this. That Redis instance had to be accessed  someway, for me the only web that makes sense (in these kind of things) is in realtime.
  • In order to achieve realtime and bragging rights you have to build it full Javascript, so i needed to have a good async data controller at server side, this was the big opportunity for Node.JS in this project.
  • Node.JS allowed to build something using socket.io real quick, since some code was reused  in the webclients. This allowed quick, painless and direct access in realtime to the data at the Redis instance.
  • Added a few cool UI libraries into the pan (like Google Chart, jQuery, jGrowl) and a realtime dashboard was built overnight.

After Skynet was online and “reactive” human intervention in maintenance tasks and solving simple event scenarios dropped drastically. More important it filters the problems, solving the simple ones and only passing the harder ones to the sysadmins, boosting productivity.

Advertisements
Leave a comment

5 Comments

  1. open-source, please? 😀

    Reply
    • Yeah im thinking on releasing something, but first need to work the deployment/installation. Atm it isn’t a pain free task, would love to see more people involved.
      Probably in codebits i will have news about this 🙂

      Reply
      • I’m doing Java (mostly EE, but also SE) for the last 5 years and am really interested in assessing this solution and most probably contribute, if you open it. Feel free to reach me on my e-mail and perhaps I can help with the current issues.

  2. Nice write-up. Please fix the typo on socket.io, it’s not sockets.io.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s