Chandler Hub Runbook

Overview

This is the operations runbook for the Chandler Hub service. It helps to:

This should not be a static document. As procedures are updated, or operational lessons are learned, this document should be updated.

This document can be found in OSAF's subversion repository, at:

This document is written in asciidoc format. It can be converted to HTML using this command line:

asciidoc -a toc -a toclevels=3 --out-file runbook.html runbook.txt

A copy of the HTML version (might be out of sync) is available here:

Support procedures

This section describes the basic guidelines to be used when a Chandler Hub problem is reported.

It should noted that many outages are unique and somehow end up outside the expected procedure. These steps are good guidelines, but do what you need to do.

Support agreements

Incident coordination

Incidents are coordinated by 1) a primary owner; and 2) a RT ticket (maybe bugzilla if appropriate).

If there is a network or server error, there needs to be a KEI support ticket. Anyone can create a ticket by sending email to the support alias. First person to notice should send in a ticket. If you're not sure if there's a ticket, send one in anyway, and they will be merged as needed. Dave, Paul, Jared, and Bear can be owning a network/server ticket in the KEI Request Tracker system. Support staff should check if there's a ticket in osafsrv or osafsrv-911 before opening a new ticket, and should be owner for a ticket (via take, assign, or steal) before making server configuration changes or replying to the original requestor.

Incident sequence

Known issues

Shutdown may fail

On a loaded production instance, shutting down Cosmo using the wrappers may fail. You may very well need to kill -9 the process directly.

Errors don't always throw tracebacks

You don't always get a traceback when Cosmo throws a 500 error. Some API calls may fail without proper error notification or logging. It can be very difficult to figure out where to go with a problem report if no error is being thrown during a 500 error. Contact Cosmo developers for next steps.

Instances not service-managed; won't start/restart automatically

Runit-based service management has not yet been (re)implemented. There are no facilities in place to automatically restart a crashed Tomcat or start after a reboot.

Space limited; logical volumes not yet in used

The box is configured to provide LVM-based logical volumes; we will create large production databases in LV paritions mounted at /home or into /var directly. This logical volume scheme is not yet in place, so only the default 50GB / (slash) partition is in use. That gives plenty of room to grow after Preview, but before MySQL production gets to about 40GB, we will need to put logical volumes into place.

Raw HTTP errors will be exposed

Graceful downtime messages have not been configured. It's very possible for users to see raw Apache or raw Tomcat errors. Sometimes API calls to Cosmo will return HTML-formatted errors in the body of error responses.

Shutdown/startup is slow

Startup and shutdown of the Cosmo Tomcat instance can be slow in large installations. Tomcat may take 90 seconds to stop and 45 seconds to start. Some transactions during startup are likely to fail, when parts of Snarf are running before the Cosmo webapp is.

Check bugzilla for known service issues

Search bugzilla.osafoundation.org for Product="Sharing Service" and Component="Known issues" for additional issues being tracked in production

Local files may be out of sync with Cosmo upstream

The production instance is customized in ways different than developers and QA run. It is possible for Hub customizations and replacement files to have gotten out of sync with Cosmo upstream.

JAVA_HOME can be tricky

All Java code (Tomcat, Cosmo, migraiton) depends on the JAVA_HOME variable (or what is infered when it is missing). Debian can have its own Java system which might get accidentally called if you aren't careful. bin/manage uses /usr/local/java if no environment variable is set.

Cosmo builds can fail during Maven repo setup

Since Maven downloads a variety of packages if it doesn't already have them cached, and sometimes there is a dependence on remote site, a failure of a remote site can break the Cosmo build. These breakages can be intermittent or quite long-lasting. If the build of Cosmo breaks, best route is to check with Mike Taylor.

Performing specific tasks

Use bin/manage

In the hub environment, most interactions with the Cosmo "daemons/instances/builds" are handled using the "bin/manage" script. This includes building, start/stop/restart, reconfiguring, database manage, etc.

The bin/manage script is reliant on the conf/instances.conf configuration file. All information about the specific instances running on a server are defined in the instances.conf file. To create a new instance, you mostly just cut-and-paste a configuration stanza into a new block, and change the instance name and svn revision.

Whenever possible, bin/manage simply constructs a regular Unix command-line and executes that. As such, you can usually recreate the exact setup as bin/manage by simply typing the same command lines in the same order.

In bin/manage output, the lines prefixed with !!! are notable. In general, they will show the actual commands run (with a CMD: prefix).

BE VERY CAREFUL WITH the "manage rebuild" option; it will drop the production table if the config file points to the production table. Take backups ("manage db_backup") and be very mindful when you get anywhere need production instances or any instances pointing at the production instance.

Note that bin/manage also attempts to manage which user it is run as. In general, production instances are run with the user set to a dedicated user cosmo. bin/manage will decline to start or change its user if it can to match the targetted user. If you do operations directly as root, you may create permission problems later.

Instances can be named with most alphanumeric strings; spaces must not be used. Dashes are preferred over underscores (though the code will smash dashes into underscores before using them as database names.)

bin/manage has a built-in usage string, cut-and-pasted here on 2007-08-14:

Cosmo administration system: "bin/manage"

Command line syntax is:

  manage command[,command2,...] instance[,instance2,...]

You must specify one or more commands, and one or more instances.
Multiple commands and instances are separated with commas.

For example:

  manage stop test
  manage stop,log_nuke,print,start production
  manage rebuild trunk,mig-test,qa1

If you specify an instance of "production", it will be replaced with
the variable set in the "current" variable of the "production" block
of the configuration file.  The same substitution happens for the
"old" and "new" parameters.

The list of actions (start, stop, etc) can be abbreviated to the
shortest unique prefix (except for log and database operations).  So
this works:

  bin/manage sto,pri trunk

The manage script relies heavily on the instance configuration file,
named "conf/instances.conf".  Each Cosmo instance is defined in that
file.

In general, bin/manage should stop if a command errors out or
control-C is pressed.

The commands operate as follows:

build
  Create the named cosmo instances from source.

  The instance will be rooted in the directory set in the
  "instances_root" configuration parameter.  The instance will not be
  created if it already exists (unless the "nuke_dir_before" config
  parameter is set).

start
  Start the named Cosmo instances.

  Instances can be managed different ways (osafsrvctl, runit, etc).
  The "start" command will do the proper thing for each named
  instances.

  This command might fail is the TCP/IP ports are already bound by
  another daemon.

stop
  Stop the named Cosmo instances.

  The stop operation does not halt bin/manage if it fails.  This
  allows you to restart an already-stopped Cosmo instance.

restart
  Restart the named Cosmo instances.

  Exactly equivalent to "manage stop,start instance".

rebuild
  Build the named Cosmo instances again, adding stop/start.

  Exactly equivalent to "manage stop,print,build,start".

print
  Output a short block with key instance parameters.

  This will show the name, directories, and most configuration
  parameters that will be used for other operations.  It's helpful for
  knowing where files will be placed.

echo
  Output just the id of the named instances

  Works best with a single instance parameter, and a symbolic instance
  name like "production".  Then the results can be used in scripts
  (say in backticks) when needed for script automation.

processes
  Output a sysadmin-readable string describing the current Java processes.

  Currently, this is a hack wrapper around "ps | grep".

log_nuke
  Empty the Cosmo server log.

  The log emptied is the "osafsrv.log", located at
  ROOT/tomcat/logs/osafsrv.log.

db_nuke
  Drop and recreate an empty database for the named instances.

  This removes all data from the database used by the instance.  It
  doesn't recreate a fresh schema; it recreates and empty one, which
  Cosmo populates the first time it starts.  Alternatively, you can
  import a backup file into the empty database (or use the db_import
  command).

db_backup
  Take a database snapshot and store a new file in the backup dir.

  The backup directory used is set by the "db_backup_dir" variable in
  the configuration file.  The resulting file will contain the name of
  the database backed up and a timestamp.

db_import
  Reset (db_nuke) the database for the named instances,
  then import the specified file.

  This command takes a single third argument, a valid SQL file which
  is assumed will initialize a database and import all database from a
  given snapshot.  The file specified can be plaintext, gzip, or bzip2
  format and the import system will decompress if needed.

  The import operation will be timed.

db_migrate
  Run the schema migration operation against the database.

  This command runs the Cosmo-provided migration JAR file.  This
  migration system should be able take previous instances of Cosmo
  databases and apply SQL schema change operations on them to bring
  the database up to a version compatible with the named instance.
  The needed migration configuration file is built when the instance
  is built and should have all the proper needed values to match the
  instance.

  The migration operation will be timed.

remigrate
  Run all the standard "reset a migration test" steps.  It is equivalent to:

    bin/manage stop,build,db_import,db_migrate,print,start instances backup-file

  Remember to pass in the backup file to restore from as a third
  command-line parameter.  The backup file should be able to create
  all SQL and import data from a fresh database; it can be
  uncompressed, gzipped, or bzip2ed, as long as it has the standard
  file extension.

configure
  Apply customizations to the installation.

  Not guaranteed to be idempotent in some configuration.  This means
  running more than once may break some instances.  Others might be
  fine, depending on the configurations applied.  When an instance is
  built, it is also configured, so you do not generally need to use
  this command.

Use runit

(FUTURE) Production instances are managed by the "runit" service manager. Essentially, it keeps Tomcat going if Cosmo dies or a box reboots. To control and examine runit services, use the "sv" command and the /var/service directory.

Use svn

The administrative system, at /home/hub, is a subversion working area of a OSAF repository. Changes to configuration files and administrative scripts should be tracked in svn. Check in your changes.

Check in changes to /home/hub when appropriate. Use the —username switch to svn. There may be permissions problems if you're not root, and root may interfere with other usage. Just a heads up that you might see errors when you check in.

Managing the production instance

The production instance of Cosmo should always be tracked in the configuration file. Some workflows use it during migration and "kick production" operations.

To get a decent amount of information about the current production instance, use:

ssh hub01.chandlerproject.org /home/hub/bin/manage print production

Determining if the production Java processes exist

Known issues are that the bin/manage "print" operation does not tell you the up/down status of the instance, and instances are not runit-managed. So we're down to pids and grep.

ssh hub01.chandlerproject.org
/home/hub/bin/manage print production
ps -eFHww |grep INSTANCE_NAME

OR

ssh hub01.chandlerproject.org /home/hub/bin/manage processes production

Obtaining information about the production instance

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage print production

Stopping the production instance

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage stop production

Note, per the known issues above, that the production instance may not stop by itself when under load. Use bin/manage processes production, kill, and kill -9 in that case.

Starting the production instance

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage start production

Restarting the production instance

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage restart production

The instance may not stop properly (see above under stopping), so restarts may also have problems.

Taking a database backup of production instance

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage db_backup production

This will drop its output into /var/local/hub/db-backups as a timestamped file.

Updating Cosmo production

Performing a production update: no schema change, reuse database

In this operation, the goal is to update the Hub production instance of Cosmo while not touching the production database. For this case, the same production database is reused and there are no modifications to the database schema.

The standard downtime for this procedure is under 2 minutes. In general, such updates are not announced ahead-of-time but only after the fact. A small number of end-user failed operations is expected.

Assume that the existing production instance is named OLD and the instance you wish to update to is named NEW.

The command-line sequence used would be:

ssh hub01.chandlerproject.org
cd /home/hub
editor conf/instances.conf
  [create new configuration stanza for NEW]
  [set database_name appropriately]
  [set cosmo_svn_revision+url appropriately]
  [set tomcat_http_port to be same as OLD]
  [set reverse_proxy_host+port if needed]
sudo bin/manage build NEW
sudo bin/manage stop OLD && sudo bin/manage start NEW
svn --username bob ci conf/instances.conf

Performing a production update: database schema change

In this operation, the goal is to update the Hub production instance of Cosmo and change the database schema because the Cosmo upgrade requires it.

The standard downtime for this procedure is 30-60 minutes (slower and slower as the user base and database snapshots grow). A scheduled and announced downtime is standard procedure for this kind of update.

Assume that the existing production instance is named OLD and the instance you wish to update to is named NEW.

The command-line sequence used would be:

ssh hub01.chandlerproject.org
cd /home/hub
editor conf/instances.conf
  [create new configuration stanza for NEW]
  [set tomcat_http_port to be same as OLD]
  [set reverse_proxy_host+port if needed]
  [set cosmo_svn_revision+url appropriately]
  [set database_name if desired; will default to including instance name]
  (a specific example follows)

  [hub-03]
  tomcat_http_port: 8000
  reverse_proxy_host: hub.chandlerproject.org
  reverse_proxy_port: 443
  cosmo_svn_url: http://svn.osafoundation.org/server/cosmo/tags/rel_0.7.0
  cosmo_svn_revision: 5487

sudo bin/manage build NEW
[check through build output looking for errors]
sudo bin/manage stop OLD
[check that production service is down; going to web should show downtime page]
sudo bin/manage db_backup OLD
[the db_backup output will show where the backup file went;
 use that file in the next step (with .bz2) appended]
sudo bin/manage db_migrate NEW BACKUP.sql.bz2
sudo bin/manage start NEW
svn --username bob ci conf/instances.conf

Managing test instances of Cosmo

Rebuilding the test instance from Cosmo trunk

ssh lab.osaf.us sudo /home/hub/bin/manage rebuild trunk

Rebuild a test instance

The simpliest way to rebuild a test instance is:

ssh lab.osaf.us /home/hub/bin/manage rebuild INSTANCE

For this to work, the INSTANCE used must have an entry in /home/hub/conf/instances.conf which allows rebuilding by removing existing instances. To do this, set nuke_dir_before to true.

You will probably need to edit instances.conf to set the desired revision number (unless using a symbolic revision like HEAD).

Create a migrated test instance

Here, we want to make a test instance on lab.osaf.us which contains migrated data. The expected time for this operation is about 10 minutes (at snapshot sizes as of 2007-08-10).

This operation assumes that Apache proxypass is configured on port 80 and pointing at the test instance port.

The command-line sequence used would be:

ssh hub01.chandlerproject.org
sudo bin/manage db_backup production
cd /var/local/hub/db-backups
scp LATEST.sql.bz2 lab.osaf.us:/tmp/migrate.sql.bz2

ssh lab.osaf.us
cd /home/hub
editor conf/instances.conf
  [create or update stanza for mig-test instance]
  [set tomcat_http_port to proxypass destination port: 8000]
  [set nuke_dir_before to true for easy restart]
  [set reverse_proxy_host to lab.osaf.us, port defaults to 80]
  [optionally set cosmo_svn_revision+url, defaults to trunk HEAD]
  [database name defaults to cosmo_mig_test]

  [mig-test]
  tomcat_http_port: 9999
  nuke_dir_before: true
  reverse_proxy_port: lab.osaf.us

sudo bin/manage remigrate mig-test /tmp/migrate.sql.bz2

Execute a production/migrated swap test

To "really" test Chandler Desktop support for server migrations, one needs to do a swap using the same hostname. As admin/tester, you change the server instance out from underneath Chandler to a new migrated instance and see if Chandler notices.

Note, support in Chandler Desktop for the ability to change the host/port that a collection points to would mostly eliminate the need for this operation.

ssh lab.osaf.us
cd /home/hub

[15 min before test to start]
sudo cp conf/apache2-lab-hub-redirect.conf /etc/apache2/sites-available/lab && sudo /etc/init.d/apache2 reload

[wait, as dogfooders restore their collections into Chandler Desktop]
[take a "bin/manage backup" on production, copy to lab as /tmp/migrate.sql.bz2]

sudo bin/manage remigrate mig-test /tmp/migrate.sql.bz2
sudo cp conf/apache2-lab.conf /etc/apache2/sites-available/lab && sudo /etc/init.d/apache2 reload

Create a dual-frozen migration snapshot testing instance

To properly run automated migration tests, one needs a frozen pre-migration snapshot Cosmo instance matched with post-migrated Cosmo instance imported and migrated from the same Hub snapshot.

To do this, you just build the two instances from different svn versions using bin/manage. Then update the snapshots like this:

ssh lab.osaf.us
cd /home/hub
sudo bin/manage stop mig-before,mig-after
sudo bin/manage build mig-after
sudo bin/manage db_import mig-before,mig-after /tmp/migrate.sql.bz2
sudo bin/manage db_migrate mig-after
sudo bin/manage start mig-before,mig-after

Updating the Apache reverse proxy configuration

See the description of the Apache layer below.

In particular, note that a single apache2.conf is used, not the Debian split-style. The Apache configuration file is change controlled, and any changes should be tracked, diffed, and checked in when live.

The OS-default /etc/apache2/apache2.conf is symlinked to /home/hub/conf/apache2-hub01.conf.

To update the Apache configuration because the reverse proxy configuration needs updating perhaps:

ssh hub01.chandlerproject.org
cd /home/hub
editor conf/apache2-hub01.conf
svn diff
/etc/init.d/apache2 reload
svn --username bob ci conf/apache2-hub01.conf

Using the system's RAID card

On hub01.chandlerproject.org, the RAID card is an LSI 8408E; see hardware description for details.

/opt/MegaCli -help
/opt/MegaCli -AdpAllInfo -a0
/opt/MegaCli -PDList -a0
/opt/MegaCli -CfgSave -f filename -a0
/opt/MegaCli -CfgRestore -f filename -a0
/opt/MegaCli -AdpEventLog -GetEvents -f filename -a0

RAID monitoring is handled by the /etc/cron.d/raid-monitor cronjob. Running every 30 minutes, this job emails to root if the -AdpAllInfo command returns lines other than "0" for key lines.

Handling problems

Investigating a Nagios monitor alert/page

The Nagios monitor script tests full-stack Cosmo functionality every 10 minutes. The script does HTTP transactions that exercise the full Cosmo stack: network, Apache, Tomcat, Cosmo, MySQL. The three operations are DAV PUT, CMP GET, and MC GET. If any of them fail, something is seriously wrong. Most likely, the network is problematic, perhaps the drives have filled up, MySQL is frozen, something along those lines. If one of these fail, but the web UI is fine, something is also seriously wrong and it's either the script is incorrectly testing what it thinks it is testing, or there's a unique Cosmo situation which requires emergency developer intervention.

Most likely, you'll see network related troubles. If you can log in to the web UI and there's no errors in the Cosmo logs, the Hub is probably healthy.

If you get an alert, it's likely in one of these states:

Fixing a user's account

Log in to the Cosmo web UI as an administrator. Use the admin UI to find the user account, log in and poke around. From there you should be able to activate the account, change the password, delete the problem collection or item, and most other actions that may be needed to re-establish a working user account.

Investigating problem reports

The easiest way to start a "is this working" test is "can you log in to your personal Hub account" from wherever you are.

If you can't login, then probably you've got a large outage and other people can't either. Investigate further with a critical priority.

If you can login and do some operations but not others, then you've got a partial failure that probably affects lots of people. Investigate further with a high priority.

Try logging in as root if you can't log in as a regular user.

Try pinging hub01.chandlerproject.org to establish basic network and server availability.

If you can ping, try going to http://hub.chandlerproject.org/. You should get a login page.

If you can't get to port 80/443 but can get directly to the Cosmo instance's port via HTTP, then you probably have Apache problems. See if it's running and what's in the access+error logs.

To check Cosmo health, log in to see the production process is running using the procedure above (bin/manage proc prod).

If Cosmo is running, examine the access logs to see if there's incoming traffic, examine the osafsrv.log to see if there are errors/tracebacks, examine network traces to see if Cosmo traffic has hidden errors. Look at the osafsrv.log files and access.log to see if there's an abnormal pattern of errors or transaction types. If there are access log entries accumulating and they have good HTTP response codes, chances are good many or most users are using the service successfully so at worst you're experiencing a partial outage.

If it looks like you need to bounce a Cosmo instance, use bin/manage ala:

ssh hub01.chandlerproject.org sudo /home/hub/bin/manage restart INSTANCE

You might want to look at syslog if the OS is perhaps being wonky: tail -f /var/log/syslog|less

If needed, there are full production network traces kept in rotating files here: /var/local/hub/network-trace. These can be very useful to seeing exactly what's happening on the service.

Problems building Cosmo

Sometimes Maven does weird things. To force a full clear, set nuke_maven_before in a stanza. bin/manage will blow away the maven repository cache entirely and restart. (Note, depending on the state of remote Maven repositories, if you try to rebuild Cosmo after blowing away the repository, you may not be able to rebuild from scratch). Sometimes key packages are temporarily unavailable.

Pointing developers at the server logs

On the server, the log files are kept in these locations:

/home/cosmos/INSTANCE/tomcat/logs
/home/cosmos/INSTANCE/tomcat/logs/osafsrv.log
/var/local/hub/apache-logs
/var/local/hub/network-trace
/var/local/hub/db-backups
/var/service/INSTANCE/log/logs

System description

Public services provided

The Chandler Hub service provides:

Additional details include: A 2-hour timeout is configured for the web UI. There are no current limitations set on the storage limits or bandwidth consumption of account users. There's a welcome screen which includes a sign-in box and a link to create an account. The account creation requires first name, last name, email address, username, password, and a confirmation that the terms of service are agreed to. * When a user logs in to the web UI, they are shown a dashboard view for their default collection.

Limitations include: You can't overlay multiple collections into one calendar or dashboard view. You can not upload a monolithic *.ics file (via iCal 2.x) and then use the Hub web UI operate on that calendar.

Software architecture

The service is provided by a Tomcat instance running the Cosmo Java webapp (supported by many Java packages), all sitting behind an Apache reverse proxy. The production servers are Debian Etch 64-bit. The server uses a hardware RAID 1+0 storage configuration. Outgoing email is sent by a local Postfix MTA.

Operating system

Debian pretty much everywhere. hub01.chandlerproject.org is Debian Etch AMD64, full 64-bit version. No hypervisor or any other kernel strangeness.

The OS is tracked using the specific release name, "etch" instead of the logical name "stable". This is because stable will eventually change, but the Hub servers should only be updated from an explicit action.

Currently, for boot procedure, the production Hub instance is started using /etc/rc.local. After runit service management is in use, this method would be removed.

Filesystem

The RAID card takes the 8 drives, forms a RAID 1+0 plus 2 hot-spare volume, and presents that single volume to the OS for partitioning. In this configuration, really only 3 of the drives are in use; the data is striped (RAID 0) across these 3x 15K RPM drives. All other drives are used for redundancy.

Hardware write-back cache is configured (this requires the use of the battery backup card).

So approximately 900GB is presented to the OS for partitioning. Using a ms-dos partition table layout, the following partitioning scheme is used:

- sda1 2GB /boot ext3 primary
- sda5 50GB / ext3 logical
- sda6 20GB /spare ext3 logical
- sda7 48GB swap logical
- sda8 (rest, 750GB) LVM physical region logical

$ fdisk /dev/sda
Disk /dev/sda: 896.9 GB, 896998047744 bytes
255 heads, 63 sectors/track, 109053 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1         243     1951866   83  Linux
/dev/sda2             244      109053   874016325    5  Extended
/dev/sda5             244        6322    48829536   83  Linux
/dev/sda6            6323        8754    19535008+  83  Linux
/dev/sda7            8755       14590    46877638+  82  Linux swap / Solaris
/dev/sda8           14591      109053   758774016   8e  Linux LVM

LVM configuration
- /dev/vg00/home00 250GB /home ext3
- /dev/vg00/var00 100GB /var/local ext3

The filesystem layout is Debian standard. These localizations should be noted:

Ext3 filesystems are mount noatime by default, and are built with dir_index.

Reverse HTTP proxy

The well-known HTTP ports, 80 and 443, are managed by Apache. The user's browser contacts port 443, Apache picks up the phone, and acts as a HTTP proxy to the backend Tomcat+Cosmo daemon (which also speaks HTTP, but on a higher port).

Technically, an HTTP server sitting in front of an application like this is called a reverse proxy (as opposed to a forward proxy which sits in front of many local users in at a company for instance). So in our case, Apache mod_rewrite mod_proxy are serving in a reverse proxy configuration to the Cosmo application.

Apache (we run 2.2) is in place to serve these functions:

Apache stores three logs for each port:

All Apache logs are managed using cronolog, which takes care of datestamped file rotation as well as symlinks to the current files. Apache logs are actually stored in /var/local/hub/apache-logs but are symlinked from the usual /var/log/apache2 location.

Apache configuration is structured not using Debian's split configuration file, as it is very difficult to change control properly. Instead, the entire Apache configuration is a single file, with all variables including VirtualHost blocks.

Application

The primary application of production is Chandler Server, aka Cosmo. Cosmo is a database-backed Java application, running inside a Tomcat container. Each instance of the Cosmo application is independent and rooted at a specific directory. All managed instances of Cosmo are rooted at subdirectories of /home/cosmos.

What we call a Cosmo instance here is really a large collection of interoperating components. As admin, you start and stop the Tomcat container, and Tomcat takes care of starting and stopping all the associated Java components.

The directory structure of each individual instance is like this:

/home/cosmos/[INSTANCE]/
- tomcat/
  - bin/
    - osafsrvctl
  - etc
    - cosmo.properties
  - logs/
    - osafsrv.log
    - access.YYYY-MM-DD.log
  - tomcat/
    - conf/
      - server.xml (main tomcat config file)
    - tomcat/
      - logs/ (rarely have anything)
- webapp/
  - WEB-INF/
    - classes/
      - MessageResources.properties
      - PimMessageResources.properties
    - jsp/
      - about.jsp
- build/
  - cosmo/, snarf/, migration/, pom.xml
- migration/
  - cosmo-migration-XXXXXXXXX-with-dependencies.jar
  - migration.properties

All Cosmo applications are Java based, and the standard Java runtime to use is BEA JRockit, available at /usr/local/java.

Database

The production database is MySQL 5.0, Debian Etch standard. All production tables are InnoDB with UTF-8 charsets.

We use a localized version of the MySQL daemon's configuration file at /etc/mysql/my.cnf.

The Cosmo relational database structure is structured around the primary concepts: users, items, attributes, stamps, tickets, subscriptions. As of Cosmo 0.7, the tables used are:

mysql> show tables;
+------------------------+
| Tables_in_cosmo_hub_02 |
+------------------------+
| attribute              |
| cal_property_index     |
| cal_timerange_index    |
| calendar_stamp         |
| collection_item        |
| content_data           |
| dictionary_values      |
| event_stamp            |
| item                   |
| multistring_values     |
| pwrecovery             |
| server_properties      |
| stamp                  |
| subscription           |
| ticket_privilege       |
| tickets                |
| tombstones             |
| user_preferences       |
| users                  |
+------------------------+
19 rows in set (0.07 sec)

Maintenance

The components of Hub maintenance include:

Environment

Details of the production hardware

The production hardware is the single "big server" named "hub01.chandlerproject.org".

Physical environment

hub01.chandlerproject.org sits in the OSAF/KEI colocation facility at 365 Main (San Francisco). The service provider is Global Netoptex. The cage is a half-rack labelled "8-20A" in colo 8 (4th floor). 3 people have physical access permission: Jared Rhine, Dave Cowen, Paul Lathrop. Those three can also request remote hands from GNi 24x7 if needed, via phone call or email to GNi support. The IP-accessible KVM+managed power (gni-kvm.kei.com, port #1) serves to make most remote hands needs unnecessary. GNi remote hands information is here:

https://info.kei.com/bin/view/Technology/GlobalNetoptex

The 4U service is on sliding rails. Each of the two hot-swappable power supplies (800-watt) is plugged into a separate power circuit, controlled by a IP-accessible power switch. You can unplug one power cord at a time to "walk the machine" between hosting locations.

A Knoppix 5.2 DVD is sitting in the DVD-R drive in the server. The BIOS is set to boot from SCSI drive before optical drive. If remote recovery is needed, one can use the IP KVM to boot into the BIOS, change the order of boot devices (putting CD higher in the list), then boot into Knoppix for recovery attempts.

Network

The hub01.chandlerproject.org box is plugged via 1x "Gig-E" (1Gb-Ethernet) NIC and CAT-5e cable to a managed gigabit switch. The switch is plugged upstream into GNi's switch fabric. GNi and 365 Main are very well connected, with a wide variety of peers.

hub01 sits on the following network:

hub01 IPv4: 64.127.108.178
Network:    64.127.108.128/26
Netmask:    255.255.255.192
Gateway:    64.127.108.129

There is no firewall in front of hub01.chandlerproject.org. Pings should work if the network and server are up and functional.

DNS

Chandler Hub really only takes up one DNS A record (forward DNS address), that of the production machine, hub01.chandlerproject.org at 64.127.108.178. A reverse PTR record is in place.

DNS primary for chandlerproject.org is managed via KEI IT systems on admin.kei.com as ns.osafoundation.org. DNS secondary is provided by ns2.osafoundation.org on makani.osafoundation.org. There is no third DNS secondary.

The hub01.chandlerproject.org uses 4 servers for DNS resolution, the first being the OSAF primary on the same network, the next 2 being GNi's servers, and the last being the OSAF secondary.

ns.osafoundation.org    64.127.108.142  (admin.kei.com)
ns1.globalnetoptex.com  64.127.100.11
ns1.globalnetoptex.com  64.127.100.12
ns2.osafoundation.org   204.152.186.99  (makani.osafoundation.org)

Email

Incoming email (MX) is not configured for chandlerproject.org or hub.chandlerproject.org. Currently, contact emails addresses are in the "osafoundation.org" alias.

The published "administrative contact address" for Chandler Hub is "hub-admin —AT— osafoundation.org". Activation emails come from this address, and Cosmo renders it occassionally when it needs to print a contact address. Jared Rhine and Ted Leung are on the distribution list for that alias.

There are SPF DNS records in both chandlerproject.org and osafoundation.org which include the hub01.chandlerproject.org and the admin/VM server IP addresses, so outgoing email should pass ok if sent from Hub production or a helper/admin virtual machine.

Accounts

The people list below have accounts to the production or test servers where production data may be found, due to the possibility of access to account data. To have physical/network server access, a person must be a member of the "service-eyes-only" group (and mailing list). Changes to the service-eyes-only list are approved only by the OSAF operations/management group "ops-wg".

Server access - Jared Rhine - Dave Cowen - Paul Lathrop - Mike Taylor - Randy Letness

Service-eyes-only additional people - Mikeal Rogers - Morgen Sagen - Andi Vajda

Monitoring

The Hub is monitored via Nagios, Munin, and some custom collection scripts.

The Nagios monitor is running on monitor.kei.com. The script is /home/hub/hub_runtime_check.py. It's not a working area, but a copy of the script is in svn as well at http://svn.osafoundation.org/sandbox/hub/trunk/libexec/nagios/hub_runtime_check.py

The check does three operations: DAV PUT, MC subscribe, and CMP user fetch.

All operations are on a 15-second timeout.

The output of the monitor goes to a hub-specific Nagios contactgroup, where it is distributed to people's pagers.

For detailed server stats, see:

For application and HTTP statistics, see:

Be sure to check the "Investigating a Nagios monitor alert/page" section for more details about monitoring.

Installation log

Rack machine, 2007-07-13
- Install rails
- Slide 4U into rails
- Connect 2x redundant power cables into managed power switches, ports XXX?
- Plug ethernet into
- Install Debian Etch 64-bit (AMD64) from Etch official netinst CD
- Use expert mode for installer
- Default chartset of en_US.UTF8
- Unselect both default package sets at end of OS install to
  configure bare minimum package set
- Create "jared" user during OS install
- add "Frontend: teletype" to first stanza of /etc/debconf.conf
- edit /etc/apt/sources.list.  Prune to single "etch main" mirror and
  one security/updates mirror.  Use "etch", not "stable" for distrib
  name.  Remove CD-ROM and deb-src entries.  Remove contrib from
  security updates list.  Use mirrors.kernel.org/debian/ for upstream.
- apt-get update
- apt-get install sudo ssh
- Update /etc/sudoers to include "%staff ALL=(ALL) NOPASSWD: ALL"
- Add user 'jared' to group 'staff'
- Remove "jared" from lots of standard /etc/group entries
- Remove manual KVM, attach to IP KVM gni-kvm.kei.com, port #1
- Check remote KVM and ssh accessibility, sudo access

Install basic packages
- apt-get install ntp bind9-host debconf-english bzip2 lsof pciutils tct time
- apt-get install less subversion curl screen rsync multitail
- apt-get install bwm-ng iftop tcpflow netdiag ngrep traceroute-nanog tshark
- apt-get install emacs21-nox emacs-goodies-el
- apt-get install lvm2 runit
- apt-get install python2.5-dev python2.4-dev
- apt-get install apache2-mpm-prefork cronolog mysql-server-5.0 mtop
- apt-get install debsecan debsums dstat e2undel
- apt-get install postfix (select 2-Internet site,
  hub.chandlerproject.org mailname)
- apt-get install cron-apt
- Update /etc/cron-apt/config: MAILON=changes, MAILTO=cron-apt --AT-- osafoundation.org

Set up LVM for /home and /var/local
- pvcreate /dev/sda8
- vgcreate vg00 /dev/sda8
- lvcreate --size 250g --name home00 vg00
- lvcreate --size 100g --name var00 vg00
- mke2fs -j -O dir_index /dev/vg00/home00
- mke2fs -j -O dir_index /dev/vg00/var00
- Add to /etc/fstab: /dev/vg00/home00 /home ext3 noatime,errors=remount-ro 0 1
- Add to /etc/fstab: /dev/vg00/var00 /var/local ext3 noatime,errors=remount-ro 0 1
- Move /home/jared to /move/jared
- mount -a
- cp -a /move/jared into /home/; rm -rf /move
- Create "rletness" user; add to "staff" group

Install Java
- chmod 2775 /usr/local/src && chgrp staff /usr/local/src
- Download into /usr/local/src and install JRockit
  R27.3.0-jdk1.5.0_11-linux-x64.bin into /opt and /usr/local/java
- As jared, run ssh-keygen; transfer public key to
  svn.osafoundation.org to set up svn public key authentication

Set up /home/hub
- cd /home/jared && svn --no-auth-cache co svn+ssh://svn.osafoundation.org/svn/sandbox/hub/trunk hub
- sudo mv hub /home/
- cd /usr/local/src
- Download
  http://archive.apache.org/dist/maven/binaries/maven-2.0.7-bin.tar.gz
  into /usr/local/src.  Untar, move to /usr/local/maven-2.0.7, and
  symlink /usr/local/maven2 to maven-2.0.7.  cd /usr/local/bin && ln
  -s /usr/local/maven2/bin/mvn
- /home/hub/bin/manage build trunk

Install LSI RAID-card CLI
- Download into /usr/local/src
  http://www.lsi.com/support/downloads/megaraid/miscellaneous/Linux_MegaCLI_1.01.24.zip
  Linux CLI for LSI 8408E RAID card
- apt-get install unzip rpm
- cd /usr/local/src && mkdir lsi-cli && cp Linux_MegaCLI_1.01.24.zip
  lsi-cli && cd lsi-cli
- unzip Linux_MegaCLI_1.01.24.zip
- unzip MegaCliLin.zip
- Move to /opt/MegaCli
- apt-get install diffmon
- mkdir -p /var/local/raid-monitor
- editor /etc/cron.d/raid-monitor
    [hourly, /opt/MegaCli -AdpAllInfo -a0 > /var/local/raid-monitor/allinfo.log]
    [hourly+3min diffmon -c /etc/diffmon/diffmon.cf]
- rm /etc/cron.daily/diffmon
- editor /etc/diffmon/diffmon.cf
    [/var/local/raid-monitor/allinfo.log]

Tune packages
- sudo aptitude
- Mark all un-needed packages has dependencies (key M).  Remove swaths
  of un-needed packages, resulting in tuned package set

Set up file transfer
- sudo adduser filexfr
- sudo -u filexfr -i
- ssh-keygen
- exit
- sudo mkdir -p /var/service/filexfr/log/logs
- [create runit service]

Set up database backups
- sudo editor /etc/cron.d/hub-db-backup
  0 * * * * root /home/hub/bin/manage db_backup production > /dev/null
- sudo /etc/init.d/cron reload

Set up network trace
- sudo apt-get install tcpflow
- sudo mkdir -p /var/service/hub-network-log/log/logs
- [create runit service]

Set up package monitoring
- sudo apt-get install apticron
- [configure]

Uncommon situations

Restoring from backups

The backup files are (currently) straight MySQL files, so they can be piped into the mysql command-line tool.

On hub01.chandlerproject.org, backups can be found in /var/local/hub/db-backups directory. They are taken every hour.

Using the runit service filexfr, copies of the backup files are rsynced off-host to the protected archive.osafoundation.org server in the /home/files/backups directory.

To restore, figure out the name of the mysql database you want to restore into, empty that database, and then pipe in the mysql backup file:

bzcat /tmp/backups.sql.bz2 | mysql -uroot dbname

You can also use bin/manage to manage and restore tables. That would look like:

ssh hub01.chandlerproject.org
cd /home/hub
sudo bin/manage db_import INSTANCE /tmp/cosmo_hub_01_20070724T070002.sql

MySQL has gotten corrupted

This situation is too unique each time to be able to generalize procedures. In general, to recover MySQL, you'll be trying to replay the transaction logs. See the MySQL documentation and google for information about how to proceed. Remember we have plain-text backups, worst case.

Cosmo won't start

Common cause is trying to bind to a port which is already used. Check other instances. Stop, rebuild, or reconfigure your instances as appropriate in that case.

Look in the osafsrv.log. If there's no indications, you could try to increase debugging, but there's almost always been a traceback indicating the problem in the past.

It's possible for Tomcat to start, but Cosmo to not start. You should be seeing Tomcat HTTP errors upon access in those cases.

See if other instances can be started. If a particular instance has been damaged, you might be able to build a different instance to point to the appropriate database and port.

Java upgrade is needed

Sometimes the Java VM (JRE/JDK) needs to be updated, for instance to fix a Java bug or match the requirements of the Cosmo runtime.

We use the JDK provided by BEA named JRockit. The following procedure was documented for the first installation of JRockit on hub01.chandlerproject.org:

Changing production servers

Every one in a while, you may need to change the actual production server from the standard hub01.chandlerproject.org

This was done in early August 2007 before Chandler Hub update to the Chandler Server 0.7 release. The procedure used for the update follows. The procedure is odd because current tools can't really build a good 0.6.1.1. So the hub.chandlerproject.org production instance will simply be tarballed up.

[reduce DNS expiry to 15m]
[confirm that hub.chandlerproject.org currently pointing to hub02's IP]

ssh hub01.chandlerproject.org
cd /home/hub
editor conf/instances.conf
  [instances]
  production: hub-02

  [hub-02]
  tomcat_http_port: 8000
  database_name: cosmo_hub_01
  reverse_proxy_host: hub.chandlerproject.org
  reverse_proxy_port: 443

ssh app-dogfood.ops.osaf.us
cd /home/osaf.us
sudo bin/manage stop hub-02
[confirm hub down page is being served]

cd /home/cosmos
tar zvcf /tmp/cosmo-hub-02.tar.gz --exclude access.*.log --exclude osafsrv.log* hub-02
sudo /home/osaf.us/bin/backup-mysql -d cosmo_hub_01 -D /var/local/osaf.us/db-backups
scp /tmp/cosmo-hub-02.tar.gz /var/local/osaf.us/db/backups/LATEST.sql.bz2 hub01.chandlerproject.org:/tmp

cd /home/osaf.us
editor conf/httpd-app-dogfood.conf
  [switch in block to proxypass all traffic to hub02.chandlerproject.org]
sudo bin/update-apache

ssh hub01.chandlerproject.org
cd /home/cosmos
cp /tmp/cosmo-hub-02.tar.gz .
mv /tmp/LATEST.sql.bz2 /tmp/migrate.sql.bz2
tar zxvf cosmo-hub-02.tar.bz
chown -R cosmo:cosmo hub-02
cd /home/hub
sudo bin/manage db_import production /tmp/migrate.sql.bz2
sudo bin/manage start production
sudo bin/update-apache XXX

ssh admin.kei.com
sudo editor /etc/bind/zones/
  [bump serial number]
  [change hub A record to be same as hub01]
sudo /etc/init.d/bind9 reload && sudo tail -f /var/log/syslog

ssh app-dogfood.ops.osaf.us
cd /home/osaf.us
sudo bin/update-apache
svn --username jared ci conf/httpd-app-dogfood.conf
scp [access logs into position; merge]

At this point, we have an instance that can be mostly managed by bin/manage (start, stop, db_backup), though the dogfood-03b instance can't be rebuilt or reconfigured.

The procedure that would be used if osaf.us 0.6-legacy were not in the picture would be closer to:

ssh hub01.chandlerproject.org
cd /home/hub
editor conf/instances.conf
  [instances]
  production: rc01-0.7
  old: rc01-0.7
  new: prod-0.7

  [prod-0.7]
  cosmo_svn_revision: 7777
  cosmo_svn_url: http://svn.osafoundation.org/server/cosmo/tags/rel_0.7.0
  tomcat_http_port: 8000
  reverse_proxy_host: hub.chandlerproject.org
  reverse_proxy_port: 443
  database_name: hub_prod
sudo bin/manage build new

ssh app-dogfood.ops.osaf.us
cd /home/osaf.us
sudo bin/manage stop hub-02
[check service down page and errors]
sudo bin/backup-mysql -d cosmo_hub_01 -D /var/local/osaf.us/db-backups
scp /var/local/osaf.us/db-backups/[the right file].sql.bz2 hub01.chandlerproject.org:/tmp/mig07.sql.bz2

ssh hub01.chandlerproject.org
cd /home/hub
sudo bin/manage db_import new /tmp/mig07.sql.bz2
sudo bin/manage db_migrate new
sudo bin/manage stop old
sudo bin/manage start new
editor conf/instances.conf
  [instances]
  production: prod-0.7
  old: prod-0.7
  new: trunk
svn --username bob ci conf/instances.conf

ssh app-dogfood.ops.osaf.us
cd /home/osaf.us
editor conf/httpd-app-dogfood.conf
  [use proxypass to hub01.chandlerproject.org]
bin/update-config

Sharing a GNU Screen session between multiple users

The procedure to share a GNU screen session between multiple users (helpful for shared administration and emergency team diagnostics) is:

[SCREEN-COMMAND-KEY] is [control-A] by default, but is sometimes changed (by the host in this case) to avoid stomping on the readline beginning-of-line keystroke.

SQL alchemy

bug #8614, #8385
delete from subscription where not exists (select id from item where uid=collectionuid);

Duplicate ical uids:
select distinct collectionid from collection_item where itemid in (
select id from item i where itemtype='note' and icaluid is not null and
modifiesitemid is null and exists
  (select id from item i2, collection_item ci2 where ci2.itemid=i2.id and
i2.id!=i.id and i2.modifiesitemid is null and i2.icaluid=i.icaluid and
ci2.collectionid in (select collectionid from collection_item where
itemid=i.id))
)

--

It looks like on hub there are 26 modifications spanning 9 collections that are
out of sync with their master items.  This means the set of parent collections
for these modifications is different from the set of parent collections for the
master item.  This basically finds all modification items whose set of parents doesn't match
teh set of master item parents.

The query I used to find this out:

SELECT i.id from item i where modifiesitemid is not null

and ((

 (select count(collectionid) from collection_item where itemid=i.id) !=
 (select count(collectionid) from collection_item where
itemid=i.modifiesitemid)

) or (

  (select count(collectionid) from collection_item where itemid=i.id) !=
  (select count(collectionid) from collection_item where itemid=i.id and
collectionid in (select collectionid from collection_item where
itemid=i.modifiesitemid))
))

To remove all subscriptions where the ticket doesn't exist on the collection:

  delete s from subscription s where not exists (select i.id from item i, tickets t where i.id=t.itemid and t.ticketkey=s.ticketkey and i.uid=s.collectionuid)

To fix just a single user's subscriptions:

  delete s from subscription s where not exists (select i.id from item i, tickets t where i.id=t.itemid and t.ticketkey=s.ticketkey and i.uid=s.collectionuid) and '[USERNAME]' = (select username from users where id=s.ownerid)

--

A developer wanted to know what the triage status in the production
and migrated databases actually was.  After reviewing the list of
tables (see above), I checked that the "item" table had triage fields:

mysql> describe item;
+--------------------+---------------+------+-----+---------+----------------+
| Field              | Type          | Null | Key | Default | Extra          |
+--------------------+---------------+------+-----+---------+----------------+
| itemtype           | varchar(16)   | NO   | MUL |         |                |
| id                 | bigint(20)    | NO   | PRI | NULL    | auto_increment |
| createdate         | bigint(20)    | YES  |     | NULL    |                |
| modifydate         | bigint(20)    | YES  |     | NULL    |                |
| clientcreatedate   | bigint(20)    | YES  |     | NULL    |                |
| clientmodifieddate | bigint(20)    | YES  |     | NULL    |                |
| itemname           | varchar(255)  | NO   | MUL |         |                |
| displayname        | varchar(255)  | YES  |     | NULL    |                |
| uid                | varchar(255)  | NO   | UNI |         |                |
| version            | int(11)       | NO   |     |         |                |
| contentEncoding    | varchar(32)   | YES  |     | NULL    |                |
| contentLanguage    | varchar(32)   | YES  |     | NULL    |                |
| contentLength      | bigint(20)    | YES  |     | NULL    |                |
| contentType        | varchar(64)   | YES  |     | NULL    |                |
| lastmodifiedby     | varchar(255)  | YES  |     | NULL    |                |
| lastmodification   | int(11)       | YES  |     | NULL    |                |
| triagestatuscode   | int(11)       | YES  |     | NULL    |                |
| triagestatusrank   | decimal(12,2) | YES  |     | NULL    |                |
| isautotriage       | bit(1)        | YES  |     | NULL    |                |
| sent               | bit(1)        | YES  |     | NULL    |                |
| needsreply         | bit(1)        | YES  |     | NULL    |                |
| icaluid            | varchar(255)  | YES  |     | NULL    |                |
| modifiesitemid     | bigint(20)    | YES  | MUL | NULL    |                |
| ownerid            | bigint(20)    | NO   | MUL |         |                |
| contentdataid      | bigint(20)    | YES  | MUL | NULL    |                |
+--------------------+---------------+------+-----+---------+----------------+
25 rows in set (0.00 sec)

I see three triage fields and a few ids.  Asking the developer if they
have the Cosmo UUID or iCal UID.  They do, so I look for rows:

mysql> select * from item where icaluid='5e5582f4-3026-11dc-caa7-b5b97574266e';

and I found that the three fields were NULL.

--

From bug #10516 for fixing UUID-titled events

find all modifications with UUID as title:
  select * from item where modifiesitemid is not null and displayName=uid

update all modifications with UUID as title to inherit from master (5 steps):
  1. create temp table to store affected item ids
  create table temp (id integer unsigned not null)
  2. populate temp table with affected item ids
  insert into temp (select id from item where modifiesitemid is not null and displayName=uid)
  3. fix affected items
  update item set displayName=null, modifydate=UNIX_TIMESTAMP()*1000, version=version+1 where id in (select id from temp)
  4. update collections of affected itmes so that sync will pull changes next sync
  update item set modifydate=UNIX_TIMESTAMP()*1000, version=version+1 where id in (select collectionid from collection_item where itemid in (select id from temp))
  5. get rid of temp table
  drop table temp