On Talk of Monitoring…

Over the years as a developer and ops guy, I’ve had to be on-call a lot. Much of my hate stems from being a part of small teams having to take on monitoring for hundreds of instances without the resources of a NOC or lacking adequate monitoring/alerting software. However, since I like to set the bar high, I still believe that a team of a few admins should be all it takes to react to problems (read: it shouldn’t take a NOC). By creating a very high signal-to-noise ratio, minimal human oversight is required by the tools and everyone gets to sleep better at night.

I’ve seen attempts to have incidents automatically trigger creation of tickets. It fails because the philosophy is not in place: what should be a ticket? what should be an email? what should get auto-resolved? or what counts as a failure? It’s all about cutting out the clutter. To enable ticketing across-the-board is a bad idea; it should be enabled on a on a one-off basis per human-necessitated alert until the noise is silenced. The goal with all operations (especially as it relates to virtualized resources) should be to automate ourselves out of a job. The reality is we’ll never fully accomplish that goal (so we’ll keep our jobs). My feeling is dashboards convey status so much better than individual alerts. Alerts should only call our attention to look at the dashboards.

The only things that should go to email or enter into tickets are issues which are directly and immediately actionable by a human.

For example, alerts that often go to humans, but generally should fix themselves:

  • Load spikes (autoscale)
  • Disk out of space due to /var/log or /tmp filling up (log rotate)
  • Process dies (monit, supervisord)
  • Instance dies (terminate, launch new one)
  • High swap (only a problem if SLA degraded, probably due to high i/o as a result)

Things that should go to humans (and could probably be tickets):

  • Replace disk in server (not disk failure, raid took care of that)
  • Resize database server
  • MySQL master failure
  • Death of any physical hardware

Here’s why I hate singular email alerts: E.g. “CRITICAL: MySQL replication has fallen behind”

Replication doesn’t slow down by itself. It’s because of any number of external factors:

  1. out of disk space
  2. high disk i/o (slow disks, too much swap usage, raid failure, mysql backup, other process)
  3. too slow disks
  4. blocked queries
  5. network connectivity/latency problems
  6. locked tables

Now, all of those things (1-6) should be monitored. If we alerted on each one, however, we would get a barrage emails for every issue (rather than a single incident). It is harder to sift through a stream of emails than it is to view a simple dashboard that shows everything that’s wrong. It’s also much harder to identify how bad a problem is if we have to visually determine severity by the rate of emails. It’s hard to on-board a new hire who has to learn how to read the SnR from emails. Knowing that something is wrong is best determined by the reduction in an SLA, which triggers an alert. That way every time we get an email, we know that a human is needed.

My Philosophy on Site Monitoring and Alerting

Here’s my high-level philosophy on how site monitoring and alerting should be implemented:

  1. Numeric SLAs should be used as the benchmark for success. They are an aggregate of things like application response time, network performance, data integrity, data timeliness, high-availability, etc.
  2. Multiple SLAs can be used for various aspects of a site’s operations (frontend, backend, network db, etc) as they correspond to engineering divisions.
  3. An incident is defined as any event where the SLA drops below a certain threshold
  4. Incidents get passed along to humans (otherwise it can wait or should get auto-corrected)
  5. SLA thresholds are broken into internal (what we see) and external (what end-users see). Internal thresholds are lower than external thresholds.
  6. If a serious outage or degradation of service happens without affecting the SLA or triggering the threshold, then the SLA calculation is broken and must get fixed.
  7. Dashboards should exist that summarize all outstanding issues on one screen (everything that could possibly be wrong).
  8. Dashboards should be displayed on office wallboards (big TVs or projectors) to communicate QoS/SLAs and open issues for all to see.
  9. On a day-to-day basis, issues on the dashboard should get addressed by admins at their leisure, unless there’s an incident associated with them. Issues without incidents are low priority because SLAs are not affected.
  10. Every issue on the dashboard should be linked to a knowledge-base on the specific topic. This knowledge-base includes documentation and a running discussion about how to resolve the issue or what has been done in the past to rectify the situation.
  11. Any issues that get repeatedly triggered which cannot be fixed at root should have automated resolutions.

Ahhh… monitoring. How do I hate thee? Let me count the ways.

Ahhh… monitoring. How do I hate thee? Let me count the ways. I hate thee to the depth and breadth and height My soul can reach, when feeling out of sight… and, if God choose, I shall but hate thee better after death.

Let’s start with saying all the things I hate about the way most monitoring works or is implemented:

  1. I hate how every time a server starts running out of disk space (e.g. 20% free), that an alert fires off an email even though there’s 200GB free and it took 3 years to fill up the disk. This means it probably isn’t pertinent to respond to it immediately. The alert should only fire if the rate at which the disk fills up is problematic. Better yet, the alert should tell me that at the given rate, when the disk will be full.

  2. I hate when every alert triggers an email. This leads to email alert satiety. Many times (myself included) I see people judging the the state of operations by how large the unread count is in their inbox. Clearly too much noise.

  3. I hate it when alerts aren’t actionable enough. If they were actionable enough, then most of the time they should be auto-recoverable/scriptable so no human needs to get involved. If it can’t be automatically fixed, it should automatically open up a ticket or re-open an existing ticket depending on the nature of the problem. Tickets too frequently re-opened need to be permanently addressed.

  4. I hate how every time something fails, I am notified about it. I only care to have my inbox violated if it affects the SLA/QoS and if it’s happening with regularity. If it’s happening with regularity, it should create a ticket for me.

  5. I hate that monitoring systems are usually more oriented to the state of individual machines, when I care more about the aggregate performance about a cluster of machines performing some function. The system shouldn’t email about any one node. This should just be reported on a dashboard somewhere.

  6. I hate 90’s looking RRD charts. I want to see pretty charts. Not just rendered images. I want to interact with the data like I do when looking at stocks. There are great graphing packages that do this. Also, I want to be able to hyperlink to any chart at any point in time so it can be referred to in tickets or discussions.

  7. I hate that my charts don’t move. I want my charts to stream so I can leave them open on my screen or on a wallboard in the office for all to see.

  8. I hate to have to re-provision a server to add new checks or monitors. I should be able to add checks/alerts to a tag of machines and it just works. I don’t want to have to manage the HA of the plugins. They should be controlled by a master process like supervisord and not be standalone.

  9. I hate having to configure alerting thresholds manually because I don’t always know what they are. I want to be told what the range of values are so that I can pick what’s critical. Of course, the system should automatically figure out what normal operations look like and I’d only need to tune them on special occasions. Boundry.com does this very well.

  10. I hate when looking at a chart and I see an anomaly, that I cannot create an alert right then and there. I don’t want to have to figure out where to setup that alert.

  11. I hate that information about resolving an issue are never tightly coupled with the alert. When a critical alert fires, I should be able to see all tickets and documents related to the service. Admins should be able to communicate acknowledgements, status/progress, and rectifications all in one place. I think similar tickets should automatically pop up when an incident is reported so admins can immediately delve into fixing the problem. This makes on-boarding even easier for new hires.

  12. I hate that systems like Nagios weren’t built from the ground up to be made highly-available. The monitoring system should be the most HA component in the infrastructure. It should not offload HA to Mysql, since making MySQL HA is non-trivial and rarely ever automatic. An interesting design would be to build a monitoring product on top of Zookeeper.

  13. I hate relying on email to view alerts. I think infrastructure monitoring has more similarities to application monitoring than people think. Apps like Sentry, Airbrake, Exceptional, NewRelic have a nice way of bubbling up information without creating an information overload.

  14. I hate that SLAs aren’t always used as the barometer of success. I like SLAs because it’s a single number that summarizes the state of operations. A well designed SLA will encompass everything about your system. SLAs can be broken down to frontend, backend, and data tiers. These should be graphed and prominently displayed for all to see on wallboards around the office. If a serious problem arises that doesn’t affect the SLA, then the SLA needs to be adjusted accordingly.

  15. I hate that I cannot enforce that on-call people get phone calls in PagerDuty when alerted. I want to make it requirement. I’ve seen many times where admins said: “I didn’t get woken up by the SMS” or “I didn’t notice that I got an email”, or “I misread the alert and thought it was something else so I went back to bed”.

  16. I hate how ugly all monitoring applications are. They look like they were written by admins. This doesn’t have to be the case any more. A little bit of bootstrap and jQuery goes a long way. There are lots of wonderful free/paid charting libraries.

  17. I hate having to manage escalations both on PagerDuty and in the monitoring system. I think they should be integrated. Hooking directly up to Twillio/Plivo might be asking too much though, so not sure how to fix it.

  18. I hate that I cannot see a calendar view of who is on-call (like in PagerDuty). This should be tied into PTO schedules so that an honest mistake doesn’t lead to prolonged downtime.

  19. I hate how nagios won’t reload/restart if there’s a single configuration file error, including a check associated with an empty host group.

  20. I hate that there are many ways to disable a check in nagios, but they aren’t in-sync. This goes for many other aspects of nagios configuration.

  21. I hate the email format of most alerts. When I get the email, it should provide me with enough information to act from my phone without having to open up my laptop and view 4 other dashboards. If the SLA drops, show me all open issues and give me links to the knowledge-base and any open ticket(s) associated with the issues.

  22. I hate to use another monitoring system to monitor the monitoring system. It should be self-contained. Pingdom monitoring the login page of the monitoring system should be enough to know everything is working.

  23. I hate that we have to pay NewRelic $$$/host/month (no matter how small or big) to monitor our web applications, but that there is no open source alternative that is robust enough.

  24. I hate that software deployments don’t automatically schedule downtime for associated services.

  25. I hate false alarms.

What I love about monitoring:

  1. That I am no longer on-call. =)

Dust.js: Client-Side Templating

On Hacker News today there was a mention of how LinkedIn is using the Dust.js templating system. Dust.js is doing what XSLT always promised to do but never delivered on.

In the post, they make a pretty compelling argument for why to use client-side templating. Their problems are all similar to the ones I’ve seen countless times — anywhere a services based architecture is employed like at TV.com.

At LinkedIn they have a services based architecture with applications written in a slew of frameworks across multiple languages. Because of this, it’s hard to re-use visual components. Requiring that all components be written in the same language reduces productivity of developers and hinders their ability to rapidly prototype new features, but having the templates reimplemented in each language makes maintenance a chore. What LinkedIn settled on doing was to only require applications to produce JSON responses (light weight and efficient to render) and rely on client-side (browser) rendering to transform the layout templates with JavaScript. This puts the expensive cost of rendering on the browser (totally scalable), and allows for LinkedIn to cache the layout on CDNs to accelerate their delivery (in addition to the CSS, Images, JS, etc.)














Some might argue that well written HTML+CSS is essentially the same thing. It’s not. First, HTML is verbose; the amount of data needed to even express a simple document far exceeds that of JSON because it includes layout information. Then generating the HTML is an expensive operation on the server involving parsing some source templates in (ERB, JSP, Jinja, Smarty, etc) and assembling them in a string object with lots of concatenation. Lastly, no matter how “light” you make the HTML, you’re still mixing layout with data so you can’t statically cache the layout on a CDN without using ESI (fancy XSLT).

Compare this to simply requiring apps to produce JSON responses. JSON is lightweight and efficient to generate. Most scripted languages have native bindings to libjson so that rendering is done in C.

Lastly, if the thought of templating done entirely in the browser is not appealing (perhaps for SEO reasons), there’s always Node.js which can sit in between; however, this negates the CDN-effect for template caching and puts the computational onus back into your datacenter / cloud.

I’m not a “frontend” guy, so maybe that’s why this immediately appeals to me. I’m curious what frontend developers think about this approach?

Why I Love PayTrust

There is a service I use that doesn’t get enough credit. The service acts as your permanent billing address, bill payer, and bill minder. Anyone you receive a bill from you point to your PayTrust address. Then PayTrust ensures any invoices received get paid on time and reminds you if you haven’t paid a bill or received an expected statement. If they receive something which cannot be scanned such as a credit card, they forward it to your current mailing address.

Here’s what I like about it:

  • Cheap! Only $9.95/mo.
  • Multiple funding sources (e.g. any checking account you want to use such as your brokerage account or traditional bank checking account)
  • Late bill reminders, if no statement received with-in N number of days of your last one.
  • Permanent billing address (PO BOX 1819*****, SIOUX FALLS, SD, 57186) that doesn’t change when you move.
  • Paperless. All mail gets scanned and is available on their site for viewing or download.
  • View new or existing bills while out of town. I can go out of town for a month and not worry about anything.
  • Integrates with Mint.com “Real Balance” so you can see what your available account balances are. Kind of basic, but still a good thing.
  • Does both E-Bills and paper bills. E-bills is where it logs into the account and downloads the PDFs for you (go green!)
  • Backed by Intuit & been around for more than 10 years, so it feels more legit than some small little startup.
  • I get almost no advertisements in the mail as they all go to PayTrust and get shredded.
  • 24 hour phone support and no wait times. It’s almost like a direct line into their offices.
  • They research any payment issues such as payment not received or missing statements
  • They don’t float your money. Money leaves your account the day the check clears, not a few days before like normal banks do with online bill pay.
  • I use it as your billing address for “private” registration for DNS.
  • I receive CD once a year with all my bills that I hold on to for safe-keeping and store in Evernote.
  • Automatically pay statements with variable monthly amounts such as a gas bill (No bank offered Bill Pay service can do that!)
  • Pay exactly what you want to pay. For example, pay off your monthly credit card bill so long as it is under your maximum threshold.
  • No need to update 50 vendors with your new credit card the next time you lose it or it expires.
  • No need to login to multiple sites to manage your automatic payment profiles. Manage them all on PayTrust.
* I received no credit or commission for this post.

MobileMe Synchronization Nightmares

For many months now, MobileMe has stopped syncing. This is a big problem when you depend on your phone to be in sync with your laptop so that you don’t miss scheduled appointments, or when you do, to have the contact number of the person you need to contact!

I am embarrased to say the number of hours I’ve spent on resolving this issue. I’ve tried every trick in the book to get it working from resetting sync settings, creating a new user account, to reinstalling OSX! I’ve spent 4 hours just on MobileMe support and even gone in meet with a tech at the Genius Bar. It turns out that my problem has alluded just about everyone.

I’m very happy to announce that I have a fix! It turns out that it’s possible for your keychain to get corrupt on MobileMe. When this happens, it will synchronize your corrupt keychain locally, thereby breaking Sync Services.

Possible solutions:

  1. One solution is to reset Sync Data on problematic computer, then re-synchronize the keychain on MobileMe with the local computers active Key Chain. This will very likely not work because you cannot even check the box to synchronize data or when you try to register your computer the wheel spins forever. However, if it works for you, then when it prompts to merge choose to use “this computers” data and not the data from MobileMe. The downside is of course that you will lose any passwords stored in your keychain on MobileMe.

  2. The most likely fix to your problem will be to reset your keychain. You will loose all of your stored passwords. (http://support.apple.com/kb/ts1544)

To easily recover from this situation in the future, backup your ~Library/Keychains/*.keychain files. The next time corruption happens (and it will!), just restore your keychain files from a backup so that you won’t lose all of your stored passwords. I am sure with a little bit more investigation, one can delete individual properties of your keychain to determine which key in particular is breaking the sync.


May 25 15:14:26 Quark mobilemesyncclient[1647]: POST / (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:869FFDB9-A01D-429F-808E-1DAD905BBEF7, auto-retries=0, manual-retries=0
May 25 15:14:26 Quark com.apple.syncservices.SyncServer[1325]: 2010-05-25 15:14:26.719 mobilemesyncclient[1647:903] POST / (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:869FFDB9-A01D-429F-808E-1DAD905BBEF7, auto-retries=0, manual-retries=0
May 25 15:14:26 Quark mobilemesyncclient[1647]: DMMKPATH /Library/Application Support/SyncServices/Clients (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:834B96C7-499F-42AD-AD31-FC23D0FED14D, auto-retries=0, manual-retries=0
May 25 15:14:26 Quark com.apple.syncservices.SyncServer[1325]: 2010-05-25 15:14:26.856 mobilemesyncclient[1647:903] DMMKPATH /Library/Application Support/SyncServices/Clients (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:834B96C7-499F-42AD-AD31-FC23D0FED14D, auto-retries=0, manual-retries=0
May 25 15:14:26 Quark mobilemesyncclient[1647]: PROPFIND /Library/Application Support/SyncServices/Clients/9B259C74-23C2-4D3C-AD7C-7C2AED9BB334.client (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:FBDED450-B0CE-47C0-BD3E-CFBC122F5FC0, auto-retries=0, manual-retries=0
May 25 15:14:26 Quark com.apple.syncservices.SyncServer[1325]: 2010-05-25 15:14:26.990 mobilemesyncclient[1647:368f] PROPFIND /Library/Application Support/SyncServices/Clients/9B259C74-23C2-4D3C-AD7C-7C2AED9BB334.client (FAILED), httpStatusCode:-1, errorType:100 (domain=Error domain 3, code=-9813), transactionState:5, txnId:FBDED450-B0CE-47C0-BD3E-CFBC122F5FC0, auto-retries=0, manual-retries=0

Disputing Credit Card Charges

Recently, I learned how broken the dispute resolution process is with Bank of America. I asked them to send me the receipts for 5 recent charges from the same Mexican Food place across the street from me. I love Mexican food, but there’s no way I could have eaten 5 meals in a row from this place (esp when I wasn’t even working from home that week), so something was awry.

Herein lies the rub. If they send me the receipts (which include signatures), I can no longer dispute the charges because now I’m in possession of the sales receipts. Even if the signatures on the receipts clearly show an invalid signature, I have forgone the ability to issue a charge back. I asked the customer support if they could visually verify that the 5 signatures closely matched all my other signatures, but that was impossible. They didn’t have the ability to view the receipts.

I find this totally absurd. Granted, using a signature as a form authorization is incredibly fallible, the fact that it’s not even relevant when disputing a charge, raises the question why do we even need to sign the receipt at all? The back of my credit card is not signed, so that means that they would have been required to view a valid form of identification, which obviously wasn’t the case and is rarely ever the case. I suppose it’s all a moot point. BofA reversed the charges regardless, which was nice of them. I would have just preferred a more thorough process.

Google Exposes Pings

I thought this was a pretty neat discovery. You can download an XML feed of all blogs which have pinged (xmlrpc) Google in the last 5 minutes. Optionally, you can add a parameter ?last=120, if you wanted just the last 120 seconds. The limit is 300 seconds (or 5 minutes).


What Apps


Frustrated with the low quality and infrequent updates of current App Review sites for Facebook, we at Launch 10 decided to start our own blog. We review hundreds of apps every month on our own time, so we decided to share our thoughts on the apps we check out. Our blog will generally be updated several times a day, so check back often. As OpenSocial gets adopted by all the the major social networks, we’ll be covering the new apps that get released on there as well.