Blog

  • A Year Hosting in AWS

    A year ago I changed employers, and in turn moved from a classic bare metal hosting environment to a solely cloud environment. It was a huge change, and pretty intimidating at first, requiring me to re-learn a lot of the practices I was used to as well as learning all new methods and systems. While a lot of the same concepts in system administration are still valid, the flexibility of cloud hosting has taken a while to get used to — an old dog had to learn new tricks.

    The migration is deceptive, it’s relatively easy to approach AWs (or at least EC2) the same way you’ve always approached your hosting. Figure out how to create an instance and you’ve got yourself the rough equivalent of having the physical server you’re used to. It can be controlled in much the same way, and at that basic level things are more just a matter of getting used to the differences in capabilties of what seem like similar CPU/RAM configurations to your old servers.

    One of the breakthroughs in using AWS (again, I’m talking largely EC2 at this point) is the realisation that instances are disposable in a way a physical server has never been. In the past, a server represented a certain amount of blood, sweat and tears, and as such replacing or reinstalling was not something done on a whim. Arguments had to be made to justify the time involved, short term and smaller projects had to be bundled together on shared servers in order to maximize resource use and the cost of both labour and hardware/software had to be justified.

    With cloud, so much of that effort and worry is gone. A week long project can be easiy given resources just for itself, allowing for segregation amongst projects where useful. A project requiring more resources can now be migrated to a higher powered instance, and the old instance terminated, no more ancient hardware cluttering a store room, or having to be re-used for projects with less importance associated with them. Likewise, hardware concerns are now virtually non-existent. So much is handled by Amazon that I no longer have to struggle to find an obscure (and dodgey) driver for an odd RAID or network adapter. With a standardized hardware stack I can count on everything just working; and if something about an instance does seem problematic (perhaps a hardware failure within AWS is causing an issue?), then I can just stop that instance and spin up a new one (most likely placing my instance on new hardware as a result). Did you just accidentally install something all over the OS instead of in its own isolated location, kill the instance and revert to your snapshot, getting up and back to the same spot in minutes. Minor annoyances, mis-judgements or mis-steps in structure can easily be addressed without the need for a major re-engineer. The hosting structure suddenly becomes much more dynamic and fluid than it has been.

    Taking the worry of the data centre out of my job has been amazing, I am ecstatic to know that I will never have to make a run for a data centre in the middle of the night, to sit in a noisy, temperature extreme room for several hours while I diagnose a server error. I don’t have to try and make sure I’m always within X minutes of the data centre when on call — as long as I have an Internet connection I’m capable of doing everything and anything to do with our instances. I recognize that part of this could also be accomplished by moving to a managed hosting environment, but this lets me remain in control of each instance while freeing me up from the hardware level.

    Just recently there have been a couple of great posts that cover a lot of what I’ve had to learn over the last year, as well as providing some new information and goals for the year ahead.

    It’s worth noting that is NOT a panacea, and it’s not for every company or situation. It does fit our organisation extremely well, and I think that the ecosystem is especially attractive to start-ups, empowering them with a variety of services that are both easily accessed and consumed. Another thing I’ve found is that there’s a real sense of community amongst the AWS client base. This may be true in other large hosting environments, but I know it’s not something I found at any of the colocation companies I hosted with in the past; people might identify with one another as clients of those hosts, but they didn’t tend to share knowledge and advice with the same freedom as I’ve found AWS clients to do.

    The past year has been fantastic, it’s forced me to push myself to learn a lot of new stuff; being in AWS has given me the flexibility to test and iterate at a speed I wasn’t able to in a classic hosting environment. I’m really excited with how much our systems have progressed and improved, we’ve made a lot of fairly major changes and they’ve worked well for us. On top of it all, AWS has reduced costs for us a couple of times over the last year, making it so much easier for me to meet my budgets.

    The next year is going to be just as exciting as this one just past, if not more so. I have a good base of knowledge within AWS, so it’ll be time to start pushing forward again into areas where I’m a little less certain of myself. Here goes!

  • AWS re:Invent 2013

    With our heavy use of AWS, it made great sense to attend AWS re:Invent this year. This conference, now in its second year, is put on by Amazon to inform and educate on their web services. It is a great place for customers to get a glimpse of the inner workings, as well as to swap stories and ideas.

    I planned ahead and packed sessions into every possible time slot available, as well as signing up for a Bootcamp put on by Trend Micro. Days are full of the breakout sessions, keynotes and conference floor, while evenings are filled with parties and networking.

    The conference was a huge blast, with some amazing speakers, tonnes of great information and education, as well as exciting parties (deadmau5!). One of the stand-out things about the conference is the accessibility of people of all levels within their organizations; from CEOs to backend engineers, it was really refreshing how approachable everyone was. Everyone was genuinely excited to be there, thrilled about AWS, and eager to brag about their use case.

    Sitting down afterwards to evaluate the conference, I know it was a great investment for us, I got a LOT of information, and some amazing discussion with other AWS clients of varying experience. While obvious, it was really helpful to sit down and understand that many people are making the same mistakes I am, and in turn making the same advances. There were plenty of “oh, we could do THAT!” moments, accompanied by a satisfying number of “you should give what I’ve done a try to solve your problem” ones. A few things I’ll think about when we return next year:

    • It’s easy to underestimate your skill and familiarity level; challenge yourself and take sessions that sound difficult, or deal with subject matter you’re not particularly familiar with. While there were gems in some of the lower level courses, or in ones dealing with services I had familiarity and experience with, the real gain was in the higher level courses and ones of which I had at best a passing experience with.
    • Careful with the Bootcamps: not to disparage the speakers/presenters, but given that the sessions were sponsored, they obviously had a responsibility to push their company’s products (and to be fair, often the product was the explicit purpose of the bootcamp). I heard a lot of people complaining about these sessions. I lucked out this year, but would definitely be careful in the future. Some official AWS sponsored Bootcamps would be really enticing I think.
    • Make my way into more casual conversations; this year it was overwhelming: my first year using AWS, 8000+ people — I did a poor job of socializing and picking up information from as many people as possible. There are a TONNE of smart people at this event, and the vast majority are wandering the conference floor with you, talk to them!
    • Look into the certification and AWS training; I didn’t really register either of these items in my head until towards the end of the event. Both looked to have a lot of value and I’ll be looking into them over the next while (as they don’t have to be done at the conference) – but it’d be great to carve out some time to do some of this stuff next year as well.
    • 3 days (plus the bootcamp day) was great, and in some ways probably about as long as I could handle; but at the same time, I would have gladly taken more sessions and spent more time learning from all the great resources available at the conference. In order to attend as many of the sessions as I wanted, I felt like I didn’t get much chance to hit the conference floor and chat with vendors. I think having multiple people attend from our company would help with this – spread the load, but I’d also not argue against an extra day of conference!

    Thanks to everyone involved in the organization and execution of the conference, it was an amazing experience and I’m extremely glad to have had the chance to attend.

    Eagerly looking forward to re:Invent 2014, see you there!

  • Setting up VPN

    Up to now I’ve had little interaction with VPN; companies I’ve worked at either haven’t had much need for it, or it’s been avoided in order to prevent bringing office/home environments into the scope of audits. As I’ve moved forward with our AWS setup and started a migration towards using VPC rather than EC2 Classic, I’ve seen good use cases for VPN and decided to give it a go, helping both myself and our staff interact with our services without exposing network ports more than necessary.

    I’ve used OpenVPN as it is very easy to get into a useable state very quickly. At the OS level we’re largely using the Amazon Linux AMI (essentially RHEL) but this is pretty simple to translate over to other Linux distributions.

    The first step is to install the packages we need:

    yum install --enablerepo=epel  openvpn easy-rsa

    Next, copy the default server configuration and then edit it as necessary.

    • We’ve chosen to use TCP rather than UDP for our VPN connections, this is due to our connections being cross-continent and wanting to avoid errors in the data transmission. UDP is faster, and if your VPN connection is a short hop may be the better option, see “proto tcp” or “proto udp” in the server config.
    • We adjusted “dh dh1024.pem” to “dh dh2048.pem” in order to have a larger key size, some additional security.
    • We changed the “server 10.8.0.0 255.255.255.0” line as well; this is a subnet that the OpenVPN server will manage for itself and clients connecting to it. The default here is probably good unless you’re running and connecting to multiple VPN servers at once. Be careful of IP space conflicts here.
    • Finally, you’ll need to adjust the route that is pushed to clients connecting to the VPN server — this should be a subnet that encompasses all the subnets your VPN clients will be trying to connect to. (ex. push “route 10.0.0.0 255.255.255.0”).
    cp /usr/share/doc/openvpn-2.3.2/sample/sample-config-files/server.conf /etc/openvpn
    vi /etc/openvpn/server.conf

    Next, we’ll copy the easy-rsa config and scripts into a working location, and adjust entries within the /etc/openvpn/easy-rsa/vars file to match our environment. Set the KEY_SIZE=2048 to match the OpenVPN server configuration. Set KEY_COUNTRY, KEY_PROVINCE, KEY_CITY, KEY_ORG, and KEY_EMAIL to valid information for your organization.

    cp -rf /usr/share/easy-rsa/2.0 /etc/openvpn/easy-rsa
    mkdir /etc/openvpn/easy-rsa/keys
    vi /etc/openvpn/easy-rsa/vars

    Now we need to build the server SSL keys and place them in the default locations:

    cd /etc/openvpn/easy-rsa
    source ./vars
    ./clean-all
    ./build-ca
    ./build-key-server server
    ./build-dh
    cp /etc/openvpn/easy-rsa/keys/{dh2048.pem,ca.crt,server.crt,server.key} /etc/openvpn

    Some network options on the server need to be set now to allow for all the routing to work, enable IP forwarding by editing /etc/sysctl.conf and changing “net.ipv4.ip_forward=0” to “net.ipv4.ip_forward=1“.  The iptables rules need to be adjusted to allow for NAT as well, the subnet listed in the command here should be adjusted to match the subnet used in the “server” line of /etc/openvpn/server.conf.

    iptables -t nat -A POSTROUTING -s 10.8.0.0/24 -o eth0 -j MASQUERADE
    iptables-save > /etc/sysconfig/iptables
    sysctl -p

    Make sure that OpenVPN is set to start on boot, and start the service up:

    chkconfig openvpn on
    service openvpn start

    Now we need to create keys for any clients that will be connecting to the VPN, generate a different key for each client/user as this will make for easier management in the future, especially if someone needs to be denied access. The name passed to the script is irrelevant, you can use their name, initials, whatever works for your organization.

    cd /etc/openvpn/easy-rsa
    source ./vars
    ./build-key <client>

    You’ll also need to create a configuration file for each of your users, this is pretty generic and easy to adjust for each key — many client software packages will allow this file to be imported and will make for an easy setup for end users.  You should provide the users with this configuration file, as well as their own cert and key (saved in /etc/openvpn/easy-rsa/keys/) , and also the server’s cert (located at /etc/openvpn/ca.crt).

    remote <vpn hostname or ip> 1194
    ca ca.crt
    cert <user>.crt
    key <user>.key
    client
    dev tun
    proto tcp
    resolv-retry infinite
    nobind
    persist-key
    persist-tun
    ns-cert-type server
    comp-lzo
    verb 3

    There are lots of options available on the client side, and it’s really entirely up to the individual along with their OS of choice. Tunnelblick, OpenVPN itself, Viscosity… Find one you like, or that you’re able to support your users on and go with it. Most have the ability to accept OpenVPN configurations, and you’ll be able to just feed the client the above configuration and associated certificates/keys.

  • Exploring logs

    Having SaltStack implemented (or at least in its early stages) has also allowed me to start the process of centralising our logs and providing an interface for interacting with them. Logstash seems to be the open-source tool of choice for shipping and indexing server logs, so we’ve gone with that, initially using AmazonSQS as our delivery agent, and Kibana3 as our UI.

    The SQS implementation proved to be a little bit too buggy, usually resulting in log collection from SQS to our elasticsearch instance suddenly ceasing; as a result I’ve switched to an installation of Redis to handle the buffering of our logs and its implementation within logstash has been much more stable.

    Using SaltStack I was able to quickly push a logstash configuration to all of our web instances (our first area of focus for our logs), install, enable and start the logstash service. Now I have a flood of logs coming in to an elasticsearch node, which I (and our devops team) can easily query using Kibana3 to drill down on problems and questions. This process would have been much more tedious without the help of SaltStack; Apache log formats needed to be updated several times, tags identifying each instance meant that the configuration being updated on each would not be quite uniform between all the instances. Using SaltStack I was able to template the configuration file and apply the instance ID to the configuration file as it was deployed, all within the few minutes it took to read about Jinja2 templating, update a  single configuration template, and execute one command to both copy the configuration to all the instances and reload the Apache configuration.

    Having logs in an easily searched and displayed interface has been a great help to the team, we’re able to share searches we’ve made to illustrate patterns we’ve found, provide links in tickets to searches and graphs that document problems. We’ve unearthed issues more quickly after code deployments, and been able to find and document problems that occurred so infrequently that they were hard to discover when the logs were spread across so many instances.

    Kibana_3

    Now the focus will be to get in more: system logs, S3/AWS logs, and hopefully even output our application logs here as well. This should make everything much easier to trace for myself and our dev team.

  • A high sodium diet

    Since starting my new position I’ve been wanting to spend time to implement change control and some sort of centralised administration; but there has been (as always) a plethora of other tasks vying for my attention, many which have more direct impact on the viability of the business and services. Finally, this last week I’ve had the time to sit down and spend some time on this task, and it seemed to coincide with my (re)discovery of SaltStack.

    I have toyed with Chef and Puppet several times in the last year, and neither really provided the functionality I’d hoped for — this isn’t to say that there is anything wrong with them, just that they didn’t match the idea of a tool I had floating around in my head. SaltStack is what I’d been looking for, and I finally have found the time, company and environment where I can choose to use it.

    I can easily administer groups of instances from the command line on a master node:

    • running remote commands (e.g. “salt ‘*’ cmd.run “uptime”” let’s me see the uptime on all connected minions),
    • pulling up information about the instances (e.g. “salt ‘*’ status.uptime” does the same as the previous remote command),
    • or administering packages (e.g. “salt ‘*’ pkg.list_upgrades” gives me a list of packages eligible for upgrade on the minions).

    Better yet, I can easily define the groups of instances I want to apply to with quick regular expressions and SaltStack’s grains. From here, it’s easy to make the next step to starting to define your instances, what files should be present and where, what packages each instance should have installed, what services it should be running, etc.

    Implementing SaltStack has been a painless process, and seeing it come together is very exciting! Installation of the minions was quick and really the only manual part of the process; packages exist on all the distros I’m responsible for, which made this a much easier process. The master was also extremely easy to configure to the base starting point, connecting all the minions and authorizing them with ease. At no point during the process did I encounter any difficulty in bringing minions online with the master, and being able to issue remote commands or use the test.sls to quickly confirm the connections provided a quick and easy way to verify that they were accessible.

    With everything installed I’ve started building configurations on my master to match what should be on the minions. Now, at a glance I can see that groups of instances that were supposed to be homogeneous have discrepancies — luckily nothing service impacting, but with SaltStack I can correct this issue in seconds: define the configuration files I want to be the same, packages that need to be present and running, then run salt against the instances in question; seconds later I can be certain that the configurations are standard across the group.

    Watching Salt is an exciting process (at this point anyway), it’s amazingly fast in deploying changes/rules to large groups of instances; seeing the changes being applied where necessary, services enabled, some services started that weren’t running (WHAT!)… Having your instances come in-line with centralised policies is extremely satisfying, knowing that when you need to make a change to a configuration to adjust the logging, or to replace an SSL certificate it can be done manually in one place, and then simply and quickly applied to all the necessary instances.

    It’s still early on, but the flexibility provided within Salt is amazing: I can template configuration files such that they’re dynamically generated based on qualities of the instances they’re being applied to. Need a logfile to have the instance name in the filename? done. Need one package name on your CentOS instances, and a different package on your Debian instances? Done. I’m really excited to reach a point where the entire set of instances that I’m responsible for will be fully defined in here. For now I’m happy to be able to run a command across them all with one command from the central host, I have a new level of control to make my job easier.

  • Databases, backups and more!

    The biggest change from my last position to the current one? There’s always something going on here and we’re always trying to evolve and expand our systems. One of the first things I wanted to do upon starting here is to migrate our AWS instances to VPC rather than classic EC2 — it seemed to me that this provides us with much better and dynamic control over the ingress and egress. Our databases were the largest speed bump in the migration to VPC — they are a huge amount of data and with them offline, our entire application is down. We had 10 database instances (5 master-master pairs) with which our software interacts and they’re all heavily dependent on each other; in other words, it was a situation of moving them all or not moving them at all – one-by-one promised a much larger potential downtime.

    Part of the reason for the 10 database instances was historical, and represented a situation that was no longer valid in our application, if an instance were capable, there was no reason we couldn’t have them all on one, instead of spread around. With the trending data we pulled into Zabbix I was able to determine that one of our reserved instances (an m2.4xlarge), was more than capable of our existing load, and that it still gave us lots of space for growth. Having taken the time to push a lot of items into Zabbix paid off here, I was able to prove to myself (and my colleagues) the feasibility of what I was proposing — queries per second, IOPS, MySQL operations, load — it was all there, and even combining peaks we had plenty of space.

    Some of my research suggested that our biggest concern within EC2 would be the IOPS on the EBS volumes — we had already purchased non-IOPS-provisioned reserved instances and the decision was made to ride them out (we’re largely through our contract), so I wanted to make sure we got the highest possible IO on a non-provisioned IOPS reserved instance. In the end I used a software (mdadm) RAID 10 of 4 EBS volumes, which performance benchmarks suggested would be capable of delivering high IO for our databases. With AWS the process is nice and easy, just create the volumes, attach them to the device and use mdadm to create the RAID.

    ec2-create-volume -z <ZONE> -s 256
     ec2-attach-volume -d /dev/sdj{#} -i <INSTANCE-ID> <VOLUME-ID>
     mdadm -v --create /dev/md0 --chunk=256 --level=raid10 --raid-devices=2 /dev/sdj*
     mdadm --detail --scan >> /etc/mdadm.conf

    After that, it’s just a matter of creating a file-system on the RAID, and then adding it to your fstab before mounting it.

    Another useful item I came across was a tool called ec2-consistent-snapshot, which allows for full warm MySQL backups by locking the tables, freezing the file-system, and shipping an EBS snapshot to S3. Snapshots are incremental, so we only pay for the difference in AWS charges, and thanks to AWS magic, any subset of snapshots can be deleted, as long as one snapshot remains that’s all that’s necessary to rebuild. The snapshot also can be set to contain the MySQL master position, making for easy replicated MySQL instance creation. Prior to this backups were only done on replicated database instances dedicated to the task of backups. Tars were created and shipped to S3, resulting in the backups being offline for the time necessary to create the tar, and then running behind until caught up on the binlog.

    So, after a lot of benchmarking, testing and dry-runs, everything was lined up for the migration; notifications were sent out to our clients suggesting a 2 hour downtime window (a realistic period of time, but with almost no room for mistakes) at a time that meant the least business impact. I’ve planned/executed a lot of migrations in my career, it’s nothing new; however thanks to an amazing team here, as well as superiors who hold back on requesting constant updates (a difficult impulse to quell, I understand!), and the time taken to well plan the process, we completed the migration and had the databases back online almost exactly at the 2 hour mark. Replicated databases were created in the time after the migration, an easy process thanks to the snapshots and were ready to go in a much shorter length of time than I’m used to.

    Having our databases consolidated has made a huge difference: there are less places to monitor and manage, and it’s much easier for our devs to work with one central database server rather than 5 different servers. Thanks to having them in a central location, I’ve been able to focus on optimizing MySQL and we’ve squeezed time off our queries. Backups are taken of both our replicated servers as well as our master — this is huge for me! The fact that I can take backups of our live production database, with almost no impact* to our application lets me sleep peacefully at night. (Note: “almost no impact” means that the database is locked for the backup, however the time to snap is seconds, with the only lengthy part being the time taken to acquire a database lock). Our recovery process has been tested already several times (the creation of replication instances and a new test master) and has been proven to result in perfect data. We’ve also now got the biggest hurdle moved in our migration to VPC AND we’ve gone from 10 instances to 3!

    Collection of trending data was a huge part in this migration, it helped me figure out what was necessary to make the migration, and it helped me justify my choices to those with questions. Zabbix has proven itself to me as a great way to do this, however it’s definitely not the only way, and whatever gets you the numbers in a way that are easy to interpret is great. Also, a confident and positive attitude (not just by you, but by the company as a whole, from counterparts to those who run the place) makes a massive difference when dealing with such mission critical components.

    The work here is constant, and there wil be mis-steps (not this time!), but as before I’m having a lot of fun learning and applying a lot of stuff that is new to me, or finally getting to test out ideas I’ve had in my previous jobs. Processes I’ve set-up as I came on board are paying off (monitoring & trending), and I’m proud of the changes I’ve put in place here so far. Next, while trying to continue improving the systems, I need to dig into the world of DevOps and see if I can’t better empower our devs by creating systems that they can better work with.

  • Big changes

    2013 has started with a bang for me; I was offered and have taken the position of “Director of Tech Ops” with a start-up run by some good friends. I’ve just finished my first month and have some thoughts.

    This has represented a major change in everything to do with my job: my previous workplace moved very slowly, and had plateaued in terms of growth within the systems department. It was a great place, with cool people, but we were limited by an extremely restrictive budget and a dwindling dev team which meant expansion and software experimentation were difficult and spent a long time in the pipeline. Due to a need to maintain PCI-DSS compliance there was huge overhead due to policies; it takes a lot of time to ensure that policies are updated, followed, documented and reviewed, and the temptation is to reduce change in the environment and application in order to reduce the work needed to maintain the accompanying policies.

    The new company is the classic start-up: hungry for growth, fast to change and deploy, willing (and able) to try new technologies. We have a strong and dedicated dev team, constantly pushing out updates and improving our application; my challenge will be to keep pace with them and to enable them to perform even better.

    Coming on-board has meant having to get up to speed very quickly on AWS (EC2 primarily at this point) in order to get a track on our usage and ensure that we’re efficiently using the instances we have online: getting monitoring online as quickly as possible has helped this tremendously, giving me solid metrics to measure our instances by. Cacti was my initial tool as I’ve had a lot of experience with it in the past: it’s easy to read the graphs and relatively easy to bring online quickly; that provided basic data while I figured out how to install and use Zabbix, which gives us access to as many metrics as we can think of — and awesome triggers/alerts based on those numbers. It’s an ongoing process for sure, and the migration from a Nagios-based background to the Zabbix setup/terminology is not the easiest, however it’s been entirely worth it, and I can see huge potential moving forward. Effective monitoring is something I’ve found to be more and more important to me as I progress in my systems career: knowing what’s going on in your environment, knowing how changes trend, and maximizing the effectiveness of your alerting tools is key. Don’t alert for things that nobody’s going to do anything about — informative results might seem like a good idea early on, but they have a tendency to clutter inboxes rapidly and they reduce the impact of alerts from your monitoring system. Zabbix has proven great because we can visualize easily on any of the metrics being pulled in for information purposes, and thereby restrict the alerts to critical events. Alerts MUST result in an action, if they don’t, then you need to adjust your thresholds or disable the trigger.

    Thanks to the change of working in an environment where changes are welcomed, I’ve been able to deploy New Relic for our application profiling, which is giving both myself and our dev team a great insight into how our app performs in the wild: where slowdowns occur and how we might address them. It’s easy to deploy and we’ve experienced fantastic support from the New Relic team throughout our trial of their software. I’m looking forward to seeing how the profiling in New Relic impacts our application in the weeks to come.

    Now the pressure is on to add to and improve upon our existing systems structure: auto-scaling, system configuration automation (I’m leaning towards Puppet), fine-tuning of monitoring and integration (of monitoring) with auto-scaling to allow for a minimum of work when a new instance comes online (or is terminated), benchmarking our EC2 instances to determine EC2 instance-types that best suit the various parts of our application… Many, many projects to choose from, and the best part is I work in a place where I have the freedom and the encouragement to tackle and experiment with as many of them as I want and see a use for. Our use of AWS has lots of room for growth: right now our use is quite basic and straight-forward, so another task is to look at how to leverage some of the other services (beyond plain EC2) that are offered there.

    Can’t wait. 2013 is looking to be an amazing year, full of learning and improvement.

  • PCI-DSS Audit time

    Once a year we undergo our PCI-DSS audit; it’s largely a time filled with document and procedure review, along with a healthy dose of meetings while we prepare for the audit. This year’s been great, already being 2.0 compliant means that there are no changes for us this year. We get to take the opportunity to review our existing policies and procedures with an eye on making them even better, refining wording, streamlining procedures and touching up methodologies. As much frustration as all the overhead has caused for our small company (where in the past we’ve been used to very immediate turnaround on tasks), we’ve learned a lot and improved much of what we do. Documentation has been the answer to just about everything: ensuring that descriptions are in place for how to do things, approval that they ought to be done that way, and what the results were. It has all vastly improved our ability to weather a disaster and replicate lost work. For a company as small as ours, PCI-DSS audits can be still a little stress inducing, however they’re also great times for reflection and re-evaluation.

  • Push email and more

    Just had a couple of pieces of software used in the last while that have made my life much easier:

    • Zpush and Zpushbackend which have made it easy to setup push email for our Zimbra mail server, this is much nicer for sending out monitoring alerts to our various devices. I came across a great walk-through here that gets you up and going with a minimum of fuss: http://vwiki.co.uk/Z-Push_v2_with_Zimbra
    • Nexenta is the backbone of our new office network storage, super easy to use and a great alternative to buying specialized hardware. Thanks to Shane from EZP.net for his suggestion and advice.
    • Icinga-Web, an awesome new front-end for Icinga that is even better than the clean-up they did of the old Nagios web interface. All Ajax and Web 2.0 and whatnot.
  • Monitoring

    It’s been a long time since I’ve configured monitoring from the ground up. Typically when and if it needs done, the timing has been such that it has been a great project for a new junior admin. Nothing has the potential for teaching a new staff member about the network structure like monitoring does; you learn where all the servers and network hardware are, what services run, where and to a lesser extent why, what’s dependent upon what, who’s responsible for what… A good admin should be able to come out of a monitoring project with a lot of information about the company they work for; it provides a great excuse to question the structure of both the network and the staff.

    Recently I’ve decided that our existing monitoring solution wasn’t cutting it, and didn’t justify the expense of a license fee. Icinga (http://www.icinga.org) is what I’ve chosen in the end, it seems to do a good job of taking the now industry-standard of Nagios and making it just a bit nicer to use. There are still features missing from most monitoring solutions that I’d love to see (managing oncall contact rotation easily, and through a web interface would be one of the biggest), but they’re largely things that we have either already built in-house workarounds for, or can easily do so.

    I had forgotten the amount of work that building a Nagios based monitor can be, so many configuration files. Every time I think a host is fully defined I seem to see another service I could be monitoring, another dependency I could setting, another escalation necessary to ensure that someone is notified when something fails. This being a network I’m familiar with, it’s taking some work to make sure I don’t gloss over things too, services that I take for granted and don’t explicitly think of monitoring. Each server is requiring a double and triple take. I’m also taking the approach of trying to not reference our existing monitoring too much, since it’d be (relatively) easy to just copy everything, but I suspect there’s stuff in there that’s not monitored properly, or worse, stuff that’s not monitored at all, just as things have changed, new services added, new conditions created. So the trick is to make sure we’re still catching all the old stuff, while not being blinded to the new stuff.

    The great thing is, that while this is typically a project I’d give a new staff member, it’s provided a lot of new awareness for myself, someone who’s been working with this set-up for years; I’m seeing potential projects for improvement of our services and network everywhere as I re-assess each server/component for monitoring – something that should probably be done more often, but it’s hard to find the excuse to do.

    Back to editing masses of definitions, definitely looking forward to cutting over to the new monitoring, and then tackling all the ideas bubbling up.