asciidoc -a toc -a toclevels=3 --out-file runbook.html runbook.txt
Last update: 2007-08-29
Maintainer: Jared Rhine <jared@wordzoo.com>
Copyright: Open Source Applications Foundation, 2007
License: Creative Commons Attribution Only 3.0 (unported)
This is the operations runbook for the Chandler Hub service. It helps to:
standardize procedures so they are repeatable and best-practice
provide instructions to staff and volunteers helping to keep the service running
This should not be a static document. As procedures are updated, or operational lessons are learned, this document should be updated.
This document can be found in OSAF's subversion repository, at:
This document is written in asciidoc format. It can be converted to HTML using this command line:
asciidoc -a toc -a toclevels=3 --out-file runbook.html runbook.txt
A copy of the HTML version (might be out of sync) is available here:
This section describes the basic guidelines to be used when a Chandler Hub problem is reported.
It should noted that many outages are unique and somehow end up outside the expected procedure. These steps are good guidelines, but do what you need to do.
Jared is first-tier pager; Bear is second-tier pager; Dave/Paul are third-tier pager. All are cell-phoneable.
(After Jared and Bear) Randy is first-tier for application support, Java, Tomcat, MySQL. If there are tracebacks/errors in the Java logs or in the UI, pull in Randy. Randy is cell-phone callable for emergencies.
Systems coverage is not expected to be in place 24x7, nor is it expected for systems folk in the pager tree to always be near a keyboard. Systems coverage is best-effort and no SLA is in place.
Incidents are coordinated by 1) a primary owner; and 2) a RT ticket (maybe bugzilla if appropriate).
If there is a network or server error, there needs to be a KEI support ticket. Anyone can create a ticket by sending email to the support alias. First person to notice should send in a ticket. If you're not sure if there's a ticket, send one in anyway, and they will be merged as needed. Dave, Paul, Jared, and Bear can be owning a network/server ticket in the KEI Request Tracker system. Support staff should check if there's a ticket in osafsrv or osafsrv-911 before opening a new ticket, and should be owner for a ticket (via take, assign, or steal) before making server configuration changes or replying to the original requestor.
If you got your notice via Nagios, assume everyone else did too.
If you got a notice via Nagios, and people above the pager-chain have not responded within 20 minutes, please work the issue.
If you discovered this issue or someone else discovered it and let you know, proceed to creating the coordinating KEI IT RT ticket.
Check if a KEI RT ticket is open
Open KEI RT ticket
Take KEI RT ticket
Investigate Cosmo: Can I log in to my Hub account from a brower? Can I log in to the overlord account?
Investigate Tomcat: Is process running? Does Tomcat access log grow when there are connections?
Investigate Apache: Is port 80/443 open? Is Apache running? Are there issues in the error log? Are you seeing an Apache-redirect to a downtime page?
Investigate server: Can you log in to hub01? What's the load like? How's top? What's mysqladmin processlist* look like? +mtop?
Investigate network: Is machine pingable? Router? iftop or bwm-ng?
Send email to service-dev OSAF list. Characterize problem briefly, estimate outage, mention ticket numbers. Mention what resources you'll need, like Randy.
Try to get ahold of Ted, Katie, Philippe, Aparna by IRC (#osaf, #chandler, #cosmo, #osaf-qa), IM, email, cell phone.
If there's a hardware/network/service failure that will take 1 hour or more of outage, assign someone to notify the public. Notify chandler-users@ and blog.chandlerproject.org.
If outage looks to stretch into 4+ hours, ongoing incident management should be fully handed to, and handled by, an OSAF staff manager to coordinate ongoing scheduling, intra-team communication, etc.
Send updates and resolution emails to service-dev.
Merge RT tickets
On a loaded production instance, shutting down Cosmo using the wrappers may fail. You may very well need to kill -9 the process directly.
You don't always get a traceback when Cosmo throws a 500 error. Some API calls may fail without proper error notification or logging. It can be very difficult to figure out where to go with a problem report if no error is being thrown during a 500 error. Contact Cosmo developers for next steps.
Runit-based service management has not yet been (re)implemented. There are no facilities in place to automatically restart a crashed Tomcat or start after a reboot.
The box is configured to provide LVM-based logical volumes; we will create large production databases in LV paritions mounted at /home or into /var directly. This logical volume scheme is not yet in place, so only the default 50GB / (slash) partition is in use. That gives plenty of room to grow after Preview, but before MySQL production gets to about 40GB, we will need to put logical volumes into place.
Graceful downtime messages have not been configured. It's very possible for users to see raw Apache or raw Tomcat errors. Sometimes API calls to Cosmo will return HTML-formatted errors in the body of error responses.
Startup and shutdown of the Cosmo Tomcat instance can be slow in large installations. Tomcat may take 90 seconds to stop and 45 seconds to start. Some transactions during startup are likely to fail, when parts of Snarf are running before the Cosmo webapp is.
Search bugzilla.osafoundation.org for Product="Sharing Service" and Component="Known issues" for additional issues being tracked in production
The production instance is customized in ways different than developers and QA run. It is possible for Hub customizations and replacement files to have gotten out of sync with Cosmo upstream.
All Java code (Tomcat, Cosmo, migraiton) depends on the JAVA_HOME variable (or what is infered when it is missing). Debian can have its own Java system which might get accidentally called if you aren't careful. bin/manage uses /usr/local/java if no environment variable is set.
Since Maven downloads a variety of packages if it doesn't already have them cached, and sometimes there is a dependence on remote site, a failure of a remote site can break the Cosmo build. These breakages can be intermittent or quite long-lasting. If the build of Cosmo breaks, best route is to check with Mike Taylor.
In the hub environment, most interactions with the Cosmo "daemons/instances/builds" are handled using the "bin/manage" script. This includes building, start/stop/restart, reconfiguring, database manage, etc.
The bin/manage script is reliant on the conf/instances.conf configuration file. All information about the specific instances running on a server are defined in the instances.conf file. To create a new instance, you mostly just cut-and-paste a configuration stanza into a new block, and change the instance name and svn revision.
Whenever possible, bin/manage simply constructs a regular Unix command-line and executes that. As such, you can usually recreate the exact setup as bin/manage by simply typing the same command lines in the same order.
In bin/manage output, the lines prefixed with !!! are notable. In general, they will show the actual commands run (with a CMD: prefix).
BE VERY CAREFUL WITH the "manage rebuild" option; it will drop the production table if the config file points to the production table. Take backups ("manage db_backup") and be very mindful when you get anywhere need production instances or any instances pointing at the production instance.
Note that bin/manage also attempts to manage which user it is run as. In general, production instances are run with the user set to a dedicated user cosmo. bin/manage will decline to start or change its user if it can to match the targetted user. If you do operations directly as root, you may create permission problems later.
Instances can be named with most alphanumeric strings; spaces must not be used. Dashes are preferred over underscores (though the code will smash dashes into underscores before using them as database names.)
bin/manage has a built-in usage string, cut-and-pasted here on 2007-08-14:
Cosmo administration system: "bin/manage"
Command line syntax is:
manage command[,command2,...] instance[,instance2,...]
You must specify one or more commands, and one or more instances.
Multiple commands and instances are separated with commas.
For example:
manage stop test
manage stop,log_nuke,print,start production
manage rebuild trunk,mig-test,qa1
If you specify an instance of "production", it will be replaced with
the variable set in the "current" variable of the "production" block
of the configuration file. The same substitution happens for the
"old" and "new" parameters.
The list of actions (start, stop, etc) can be abbreviated to the
shortest unique prefix (except for log and database operations). So
this works:
bin/manage sto,pri trunk
The manage script relies heavily on the instance configuration file,
named "conf/instances.conf". Each Cosmo instance is defined in that
file.
In general, bin/manage should stop if a command errors out or
control-C is pressed.
The commands operate as follows:
build
Create the named cosmo instances from source.
The instance will be rooted in the directory set in the
"instances_root" configuration parameter. The instance will not be
created if it already exists (unless the "nuke_dir_before" config
parameter is set).
start
Start the named Cosmo instances.
Instances can be managed different ways (osafsrvctl, runit, etc).
The "start" command will do the proper thing for each named
instances.
This command might fail is the TCP/IP ports are already bound by
another daemon.
stop
Stop the named Cosmo instances.
The stop operation does not halt bin/manage if it fails. This
allows you to restart an already-stopped Cosmo instance.
restart
Restart the named Cosmo instances.
Exactly equivalent to "manage stop,start instance".
rebuild
Build the named Cosmo instances again, adding stop/start.
Exactly equivalent to "manage stop,print,build,start".
print
Output a short block with key instance parameters.
This will show the name, directories, and most configuration
parameters that will be used for other operations. It's helpful for
knowing where files will be placed.
echo
Output just the id of the named instances
Works best with a single instance parameter, and a symbolic instance
name like "production". Then the results can be used in scripts
(say in backticks) when needed for script automation.
processes
Output a sysadmin-readable string describing the current Java processes.
Currently, this is a hack wrapper around "ps | grep".
log_nuke
Empty the Cosmo server log.
The log emptied is the "osafsrv.log", located at
ROOT/tomcat/logs/osafsrv.log.
db_nuke
Drop and recreate an empty database for the named instances.
This removes all data from the database used by the instance. It
doesn't recreate a fresh schema; it recreates and empty one, which
Cosmo populates the first time it starts. Alternatively, you can
import a backup file into the empty database (or use the db_import
command).
db_backup
Take a database snapshot and store a new file in the backup dir.
The backup directory used is set by the "db_backup_dir" variable in
the configuration file. The resulting file will contain the name of
the database backed up and a timestamp.
db_import
Reset (db_nuke) the database for the named instances,
then import the specified file.
This command takes a single third argument, a valid SQL file which
is assumed will initialize a database and import all database from a
given snapshot. The file specified can be plaintext, gzip, or bzip2
format and the import system will decompress if needed.
The import operation will be timed.
db_migrate
Run the schema migration operation against the database.
This command runs the Cosmo-provided migration JAR file. This
migration system should be able take previous instances of Cosmo
databases and apply SQL schema change operations on them to bring
the database up to a version compatible with the named instance.
The needed migration configuration file is built when the instance
is built and should have all the proper needed values to match the
instance.
The migration operation will be timed.
remigrate
Run all the standard "reset a migration test" steps. It is equivalent to:
bin/manage stop,build,db_import,db_migrate,print,start instances backup-file
Remember to pass in the backup file to restore from as a third
command-line parameter. The backup file should be able to create
all SQL and import data from a fresh database; it can be
uncompressed, gzipped, or bzip2ed, as long as it has the standard
file extension.
configure
Apply customizations to the installation.
Not guaranteed to be idempotent in some configuration. This means
running more than once may break some instances. Others might be
fine, depending on the configurations applied. When an instance is
built, it is also configured, so you do not generally need to use
this command.
(FUTURE) Production instances are managed by the "runit" service manager. Essentially, it keeps Tomcat going if Cosmo dies or a box reboots. To control and examine runit services, use the "sv" command and the /var/service directory.
The administrative system, at /home/hub, is a subversion working area of a OSAF repository. Changes to configuration files and administrative scripts should be tracked in svn. Check in your changes.
Check in changes to /home/hub when appropriate. Use the —username switch to svn. There may be permissions problems if you're not root, and root may interfere with other usage. Just a heads up that you might see errors when you check in.
The production instance of Cosmo should always be tracked in the configuration file. Some workflows use it during migration and "kick production" operations.
To get a decent amount of information about the current production instance, use:
ssh hub01.chandlerproject.org /home/hub/bin/manage print production
Known issues are that the bin/manage "print" operation does not tell you the up/down status of the instance, and instances are not runit-managed. So we're down to pids and grep.
ssh hub01.chandlerproject.org /home/hub/bin/manage print production ps -eFHww |grep INSTANCE_NAME OR ssh hub01.chandlerproject.org /home/hub/bin/manage processes production
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage print production
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage stop production
Note, per the known issues above, that the production instance may not stop by itself when under load. Use bin/manage processes production, kill, and kill -9 in that case.
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage start production
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage restart production
The instance may not stop properly (see above under stopping), so restarts may also have problems.
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage db_backup production
This will drop its output into /var/local/hub/db-backups as a timestamped file.
In this operation, the goal is to update the Hub production instance of Cosmo while not touching the production database. For this case, the same production database is reused and there are no modifications to the database schema.
The standard downtime for this procedure is under 2 minutes. In general, such updates are not announced ahead-of-time but only after the fact. A small number of end-user failed operations is expected.
Assume that the existing production instance is named OLD and the instance you wish to update to is named NEW.
The command-line sequence used would be:
ssh hub01.chandlerproject.org cd /home/hub editor conf/instances.conf [create new configuration stanza for NEW] [set database_name appropriately] [set cosmo_svn_revision+url appropriately] [set tomcat_http_port to be same as OLD] [set reverse_proxy_host+port if needed] sudo bin/manage build NEW sudo bin/manage stop OLD && sudo bin/manage start NEW svn --username bob ci conf/instances.conf
In this operation, the goal is to update the Hub production instance of Cosmo and change the database schema because the Cosmo upgrade requires it.
The standard downtime for this procedure is 30-60 minutes (slower and slower as the user base and database snapshots grow). A scheduled and announced downtime is standard procedure for this kind of update.
Assume that the existing production instance is named OLD and the instance you wish to update to is named NEW.
The command-line sequence used would be:
ssh hub01.chandlerproject.org cd /home/hub editor conf/instances.conf [create new configuration stanza for NEW] [set tomcat_http_port to be same as OLD] [set reverse_proxy_host+port if needed] [set cosmo_svn_revision+url appropriately] [set database_name if desired; will default to including instance name] (a specific example follows) [hub-03] tomcat_http_port: 8000 reverse_proxy_host: hub.chandlerproject.org reverse_proxy_port: 443 cosmo_svn_url: http://svn.osafoundation.org/server/cosmo/tags/rel_0.7.0 cosmo_svn_revision: 5487 sudo bin/manage build NEW [check through build output looking for errors] sudo bin/manage stop OLD [check that production service is down; going to web should show downtime page] sudo bin/manage db_backup OLD [the db_backup output will show where the backup file went; use that file in the next step (with .bz2) appended] sudo bin/manage db_migrate NEW BACKUP.sql.bz2 sudo bin/manage start NEW svn --username bob ci conf/instances.conf
ssh lab.osaf.us sudo /home/hub/bin/manage rebuild trunk
The simpliest way to rebuild a test instance is:
ssh lab.osaf.us /home/hub/bin/manage rebuild INSTANCE
For this to work, the INSTANCE used must have an entry in /home/hub/conf/instances.conf which allows rebuilding by removing existing instances. To do this, set nuke_dir_before to true.
You will probably need to edit instances.conf to set the desired revision number (unless using a symbolic revision like HEAD).
Here, we want to make a test instance on lab.osaf.us which contains migrated data. The expected time for this operation is about 10 minutes (at snapshot sizes as of 2007-08-10).
This operation assumes that Apache proxypass is configured on port 80 and pointing at the test instance port.
The command-line sequence used would be:
ssh hub01.chandlerproject.org sudo bin/manage db_backup production cd /var/local/hub/db-backups scp LATEST.sql.bz2 lab.osaf.us:/tmp/migrate.sql.bz2 ssh lab.osaf.us cd /home/hub editor conf/instances.conf [create or update stanza for mig-test instance] [set tomcat_http_port to proxypass destination port: 8000] [set nuke_dir_before to true for easy restart] [set reverse_proxy_host to lab.osaf.us, port defaults to 80] [optionally set cosmo_svn_revision+url, defaults to trunk HEAD] [database name defaults to cosmo_mig_test] [mig-test] tomcat_http_port: 9999 nuke_dir_before: true reverse_proxy_port: lab.osaf.us sudo bin/manage remigrate mig-test /tmp/migrate.sql.bz2
To "really" test Chandler Desktop support for server migrations, one needs to do a swap using the same hostname. As admin/tester, you change the server instance out from underneath Chandler to a new migrated instance and see if Chandler notices.
Note, support in Chandler Desktop for the ability to change the host/port that a collection points to would mostly eliminate the need for this operation.
ssh lab.osaf.us cd /home/hub [15 min before test to start] sudo cp conf/apache2-lab-hub-redirect.conf /etc/apache2/sites-available/lab && sudo /etc/init.d/apache2 reload [wait, as dogfooders restore their collections into Chandler Desktop] [take a "bin/manage backup" on production, copy to lab as /tmp/migrate.sql.bz2] sudo bin/manage remigrate mig-test /tmp/migrate.sql.bz2 sudo cp conf/apache2-lab.conf /etc/apache2/sites-available/lab && sudo /etc/init.d/apache2 reload
To properly run automated migration tests, one needs a frozen pre-migration snapshot Cosmo instance matched with post-migrated Cosmo instance imported and migrated from the same Hub snapshot.
To do this, you just build the two instances from different svn versions using bin/manage. Then update the snapshots like this:
ssh lab.osaf.us cd /home/hub sudo bin/manage stop mig-before,mig-after sudo bin/manage build mig-after sudo bin/manage db_import mig-before,mig-after /tmp/migrate.sql.bz2 sudo bin/manage db_migrate mig-after sudo bin/manage start mig-before,mig-after
See the description of the Apache layer below.
In particular, note that a single apache2.conf is used, not the Debian split-style. The Apache configuration file is change controlled, and any changes should be tracked, diffed, and checked in when live.
The OS-default /etc/apache2/apache2.conf is symlinked to /home/hub/conf/apache2-hub01.conf.
To update the Apache configuration because the reverse proxy configuration needs updating perhaps:
ssh hub01.chandlerproject.org cd /home/hub editor conf/apache2-hub01.conf svn diff /etc/init.d/apache2 reload svn --username bob ci conf/apache2-hub01.conf
On hub01.chandlerproject.org, the RAID card is an LSI 8408E; see hardware description for details.
/opt/MegaCli -help /opt/MegaCli -AdpAllInfo -a0 /opt/MegaCli -PDList -a0 /opt/MegaCli -CfgSave -f filename -a0 /opt/MegaCli -CfgRestore -f filename -a0 /opt/MegaCli -AdpEventLog -GetEvents -f filename -a0
RAID monitoring is handled by the /etc/cron.d/raid-monitor cronjob. Running every 30 minutes, this job emails to root if the -AdpAllInfo command returns lines other than "0" for key lines.
The Nagios monitor script tests full-stack Cosmo functionality every 10 minutes. The script does HTTP transactions that exercise the full Cosmo stack: network, Apache, Tomcat, Cosmo, MySQL. The three operations are DAV PUT, CMP GET, and MC GET. If any of them fail, something is seriously wrong. Most likely, the network is problematic, perhaps the drives have filled up, MySQL is frozen, something along those lines. If one of these fail, but the web UI is fine, something is also seriously wrong and it's either the script is incorrectly testing what it thinks it is testing, or there's a unique Cosmo situation which requires emergency developer intervention.
Most likely, you'll see network related troubles. If you can log in to the web UI and there's no errors in the Cosmo logs, the Hub is probably healthy.
If you get an alert, it's likely in one of these states:
Host down. If Nagios can't ping hub01.chandlerproject.org (GNi) from monitor.kei.com (543 Howard), you should get a "host down" type of alert. Investigate the network.
Timeout - Network latency? Network down? Server overloaded? This is generally when Nagios itself kills the test because it ran too long. All monitor tests are supposed to have socket timeouts of 15 seconds.
PUT test failed. The monitor uploads a 1000-byte file to the same filename (/dav/hub-test/dav-put/put-test.txt) every run, via HTTP PUT. If the PUT takes longer than 3 seconds, a Nagios WARNING will be issued. If the PUT takes longer than 10 seconds, a Nagios CRITICAL alert will be sent. The PUT must also return a 204 HTTP code to be considered successful; otherwise the monitor will go off.
CMP user GET failed. This is a basic Cosmo admin operation. It fetches /cmp/account for the authenticated hub-test user. It should returns essentially the same thing every time; anything that's not a 200 error is considered a failure.
MC subscribe failed. The subscribe operation is a basic one to Chandler Desktop operations. There's a hand-set-up collection in the Hub's hub-test account. Each monitor run also does a "subscribe" to this collection via an HTTP GET on /mc/collection/[see monitor script].
Log in to the Cosmo web UI as an administrator. Use the admin UI to find the user account, log in and poke around. From there you should be able to activate the account, change the password, delete the problem collection or item, and most other actions that may be needed to re-establish a working user account.
The easiest way to start a "is this working" test is "can you log in to your personal Hub account" from wherever you are.
If you can't login, then probably you've got a large outage and other people can't either. Investigate further with a critical priority.
If you can login and do some operations but not others, then you've got a partial failure that probably affects lots of people. Investigate further with a high priority.
Try logging in as root if you can't log in as a regular user.
Try pinging hub01.chandlerproject.org to establish basic network and server availability.
If you can ping, try going to http://hub.chandlerproject.org/. You should get a login page.
If you can't get to port 80/443 but can get directly to the Cosmo instance's port via HTTP, then you probably have Apache problems. See if it's running and what's in the access+error logs.
To check Cosmo health, log in to see the production process is running using the procedure above (bin/manage proc prod).
If Cosmo is running, examine the access logs to see if there's incoming traffic, examine the osafsrv.log to see if there are errors/tracebacks, examine network traces to see if Cosmo traffic has hidden errors. Look at the osafsrv.log files and access.log to see if there's an abnormal pattern of errors or transaction types. If there are access log entries accumulating and they have good HTTP response codes, chances are good many or most users are using the service successfully so at worst you're experiencing a partial outage.
If it looks like you need to bounce a Cosmo instance, use bin/manage ala:
ssh hub01.chandlerproject.org sudo /home/hub/bin/manage restart INSTANCE
You might want to look at syslog if the OS is perhaps being wonky: tail -f /var/log/syslog|less
If needed, there are full production network traces kept in rotating files here: /var/local/hub/network-trace. These can be very useful to seeing exactly what's happening on the service.
Sometimes Maven does weird things. To force a full clear, set nuke_maven_before in a stanza. bin/manage will blow away the maven repository cache entirely and restart. (Note, depending on the state of remote Maven repositories, if you try to rebuild Cosmo after blowing away the repository, you may not be able to rebuild from scratch). Sometimes key packages are temporarily unavailable.
On the server, the log files are kept in these locations:
/home/cosmos/INSTANCE/tomcat/logs /home/cosmos/INSTANCE/tomcat/logs/osafsrv.log /var/local/hub/apache-logs /var/local/hub/network-trace /var/local/hub/db-backups /var/service/INSTANCE/log/logs
The Chandler Hub service provides:
Free Cosmo instance available continuously
Database storage space
Backup of database storage
Web UI for CRUD operations on PIM collections (authenticated and anonymous access)
Calendar view and dashboard (list) view available for each collection
Upload/download of PIM resources (events, todos); incoming/outgoing bandwidth
Superuser administrative interface
SSL channel encryption (prefered, but not required)
Additional details include: A 2-hour timeout is configured for the web UI. There are no current limitations set on the storage limits or bandwidth consumption of account users. There's a welcome screen which includes a sign-in box and a link to create an account. The account creation requires first name, last name, email address, username, password, and a confirmation that the terms of service are agreed to. * When a user logs in to the web UI, they are shown a dashboard view for their default collection.
Limitations include: You can't overlay multiple collections into one calendar or dashboard view. You can not upload a monolithic *.ics file (via iCal 2.x) and then use the Hub web UI operate on that calendar.
The service is provided by a Tomcat instance running the Cosmo Java webapp (supported by many Java packages), all sitting behind an Apache reverse proxy. The production servers are Debian Etch 64-bit. The server uses a hardware RAID 1+0 storage configuration. Outgoing email is sent by a local Postfix MTA.
Debian pretty much everywhere. hub01.chandlerproject.org is Debian Etch AMD64, full 64-bit version. No hypervisor or any other kernel strangeness.
The OS is tracked using the specific release name, "etch" instead of the logical name "stable". This is because stable will eventually change, but the Hub servers should only be updated from an explicit action.
Currently, for boot procedure, the production Hub instance is started using /etc/rc.local. After runit service management is in use, this method would be removed.
The RAID card takes the 8 drives, forms a RAID 1+0 plus 2 hot-spare volume, and presents that single volume to the OS for partitioning. In this configuration, really only 3 of the drives are in use; the data is striped (RAID 0) across these 3x 15K RPM drives. All other drives are used for redundancy.
Hardware write-back cache is configured (this requires the use of the battery backup card).
So approximately 900GB is presented to the OS for partitioning. Using a ms-dos partition table layout, the following partitioning scheme is used:
- sda1 2GB /boot ext3 primary - sda5 50GB / ext3 logical - sda6 20GB /spare ext3 logical - sda7 48GB swap logical - sda8 (rest, 750GB) LVM physical region logical $ fdisk /dev/sda Disk /dev/sda: 896.9 GB, 896998047744 bytes 255 heads, 63 sectors/track, 109053 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 * 1 243 1951866 83 Linux /dev/sda2 244 109053 874016325 5 Extended /dev/sda5 244 6322 48829536 83 Linux /dev/sda6 6323 8754 19535008+ 83 Linux /dev/sda7 8755 14590 46877638+ 82 Linux swap / Solaris /dev/sda8 14591 109053 758774016 8e Linux LVM LVM configuration - /dev/vg00/home00 250GB /home ext3 - /dev/vg00/var00 100GB /var/local ext3
The filesystem layout is Debian standard. These localizations should be noted:
/home : local storage, 250GB mounted from lvm, so snapshots available for consistent backups, production databases should move here when they outgrow / partition
/home/hub : working area of administrative tools/scripts
/var/local/hub : local var directory; contains subdirectories as needed
/var/local/hub/db-backups : mysqldump database backups
/var/local/hub/network-traces : ongoing production network dump
/var/local/hub/apache-logs : apache proxy log files
/etc/cron.d : local cron jobs should be placed here
/usr/local/java : the default JAVA_HOME for local apps
/usr/local/PACKAGE : local packages should get symlinks here, ie: /usr/local/maven2
/opt/PACKAGE : vendor-provided packages installed here, ie: JRockit
Ext3 filesystems are mount noatime by default, and are built with dir_index.
The well-known HTTP ports, 80 and 443, are managed by Apache. The user's browser contacts port 443, Apache picks up the phone, and acts as a HTTP proxy to the backend Tomcat+Cosmo daemon (which also speaks HTTP, but on a higher port).
Technically, an HTTP server sitting in front of an application like this is called a reverse proxy (as opposed to a forward proxy which sits in front of many local users in at a company for instance). So in our case, Apache mod_rewrite mod_proxy are serving in a reverse proxy configuration to the Cosmo application.
Apache (we run 2.2) is in place to serve these functions:
SSL decryption
Detailed HTTP logging such as microseconds per request
Ability to throttle HTTP traffic if it were to become necessary
General purpose mod_rewrite engine allows familiar manipulation of URL namespace, to block certain URLs, support legacy operations, etc.
Apache stores three logs for each port:
access log (standard Apache Extended Common Log Format)
error log (Apache error)
detail log (customized local format containing about everything you can get out of an Apache log, in easy one-line-per-request format
All Apache logs are managed using cronolog, which takes care of datestamped file rotation as well as symlinks to the current files. Apache logs are actually stored in /var/local/hub/apache-logs but are symlinked from the usual /var/log/apache2 location.
Apache configuration is structured not using Debian's split configuration file, as it is very difficult to change control properly. Instead, the entire Apache configuration is a single file, with all variables including VirtualHost blocks.
The primary application of production is Chandler Server, aka Cosmo. Cosmo is a database-backed Java application, running inside a Tomcat container. Each instance of the Cosmo application is independent and rooted at a specific directory. All managed instances of Cosmo are rooted at subdirectories of /home/cosmos.
What we call a Cosmo instance here is really a large collection of interoperating components. As admin, you start and stop the Tomcat container, and Tomcat takes care of starting and stopping all the associated Java components.
The directory structure of each individual instance is like this:
/home/cosmos/[INSTANCE]/
- tomcat/
- bin/
- osafsrvctl
- etc
- cosmo.properties
- logs/
- osafsrv.log
- access.YYYY-MM-DD.log
- tomcat/
- conf/
- server.xml (main tomcat config file)
- tomcat/
- logs/ (rarely have anything)
- webapp/
- WEB-INF/
- classes/
- MessageResources.properties
- PimMessageResources.properties
- jsp/
- about.jsp
- build/
- cosmo/, snarf/, migration/, pom.xml
- migration/
- cosmo-migration-XXXXXXXXX-with-dependencies.jar
- migration.properties
All Cosmo applications are Java based, and the standard Java runtime to use is BEA JRockit, available at /usr/local/java.
The production database is MySQL 5.0, Debian Etch standard. All production tables are InnoDB with UTF-8 charsets.
We use a localized version of the MySQL daemon's configuration file at /etc/mysql/my.cnf.
The Cosmo relational database structure is structured around the primary concepts: users, items, attributes, stamps, tickets, subscriptions. As of Cosmo 0.7, the tables used are:
mysql> show tables; +------------------------+ | Tables_in_cosmo_hub_02 | +------------------------+ | attribute | | cal_property_index | | cal_timerange_index | | calendar_stamp | | collection_item | | content_data | | dictionary_values | | event_stamp | | item | | multistring_values | | pwrecovery | | server_properties | | stamp | | subscription | | ticket_privilege | | tickets | | tombstones | | user_preferences | | users | +------------------------+ 19 rows in set (0.07 sec)
The components of Hub maintenance include:
Cosmo updates: Updating Cosmo, using the bin/manage script. Both schema-changing updates and non-schema-changing updates
SVN checkins: Checking changes into the OSAF svn tree, in particular the /home/hub directory
Apache logs: Rotating and pruning the Apache access, error, and detail log files. See /var/local/hub/apache-logs
Network dumps: Maintain a rotating network dump log for protocol-level investigation of errors. See /var/local/hub/db-backups
Database backups: Taking regular backups of the MySQL production databases; triggered via cron and created using mysqldump. See /etc/cron.d and bin/manage db_backup.
The production hardware is the single "big server" named "hub01.chandlerproject.org".
Var/reseller: ASA Computers
2x Xeon Clovertown 5355 2.66Ghz (quad core each, 8 cores total)
16x 2Gb DIMMs ECC (32Gb RAM total)
8x Seagate Cheetah 15K.5 300Gb 3.5" ST3300655SS 5 Year Warranty
Keyboard, video, mouse plugged into gni-kvm.kei.com port #1
hub01.chandlerproject.org sits in the OSAF/KEI colocation facility at 365 Main (San Francisco). The service provider is Global Netoptex. The cage is a half-rack labelled "8-20A" in colo 8 (4th floor). 3 people have physical access permission: Jared Rhine, Dave Cowen, Paul Lathrop. Those three can also request remote hands from GNi 24x7 if needed, via phone call or email to GNi support. The IP-accessible KVM+managed power (gni-kvm.kei.com, port #1) serves to make most remote hands needs unnecessary. GNi remote hands information is here:
https://info.kei.com/bin/view/Technology/GlobalNetoptex
The 4U service is on sliding rails. Each of the two hot-swappable power supplies (800-watt) is plugged into a separate power circuit, controlled by a IP-accessible power switch. You can unplug one power cord at a time to "walk the machine" between hosting locations.
A Knoppix 5.2 DVD is sitting in the DVD-R drive in the server. The BIOS is set to boot from SCSI drive before optical drive. If remote recovery is needed, one can use the IP KVM to boot into the BIOS, change the order of boot devices (putting CD higher in the list), then boot into Knoppix for recovery attempts.
The hub01.chandlerproject.org box is plugged via 1x "Gig-E" (1Gb-Ethernet) NIC and CAT-5e cable to a managed gigabit switch. The switch is plugged upstream into GNi's switch fabric. GNi and 365 Main are very well connected, with a wide variety of peers.
hub01 sits on the following network:
hub01 IPv4: 64.127.108.178 Network: 64.127.108.128/26 Netmask: 255.255.255.192 Gateway: 64.127.108.129
There is no firewall in front of hub01.chandlerproject.org. Pings should work if the network and server are up and functional.
Chandler Hub really only takes up one DNS A record (forward DNS address), that of the production machine, hub01.chandlerproject.org at 64.127.108.178. A reverse PTR record is in place.
DNS primary for chandlerproject.org is managed via KEI IT systems on admin.kei.com as ns.osafoundation.org. DNS secondary is provided by ns2.osafoundation.org on makani.osafoundation.org. There is no third DNS secondary.
The hub01.chandlerproject.org uses 4 servers for DNS resolution, the first being the OSAF primary on the same network, the next 2 being GNi's servers, and the last being the OSAF secondary.
ns.osafoundation.org 64.127.108.142 (admin.kei.com) ns1.globalnetoptex.com 64.127.100.11 ns1.globalnetoptex.com 64.127.100.12 ns2.osafoundation.org 204.152.186.99 (makani.osafoundation.org)
Incoming email (MX) is not configured for chandlerproject.org or hub.chandlerproject.org. Currently, contact emails addresses are in the "osafoundation.org" alias.
The published "administrative contact address" for Chandler Hub is "hub-admin —AT— osafoundation.org". Activation emails come from this address, and Cosmo renders it occassionally when it needs to print a contact address. Jared Rhine and Ted Leung are on the distribution list for that alias.
There are SPF DNS records in both chandlerproject.org and osafoundation.org which include the hub01.chandlerproject.org and the admin/VM server IP addresses, so outgoing email should pass ok if sent from Hub production or a helper/admin virtual machine.
The people list below have accounts to the production or test servers where production data may be found, due to the possibility of access to account data. To have physical/network server access, a person must be a member of the "service-eyes-only" group (and mailing list). Changes to the service-eyes-only list are approved only by the OSAF operations/management group "ops-wg".
Server access - Jared Rhine - Dave Cowen - Paul Lathrop - Mike Taylor - Randy Letness
Service-eyes-only additional people - Mikeal Rogers - Morgen Sagen - Andi Vajda
The Hub is monitored via Nagios, Munin, and some custom collection scripts.
The Nagios monitor is running on monitor.kei.com. The script is /home/hub/hub_runtime_check.py. It's not a working area, but a copy of the script is in svn as well at http://svn.osafoundation.org/sandbox/hub/trunk/libexec/nagios/hub_runtime_check.py
The check does three operations: DAV PUT, MC subscribe, and CMP user fetch.
All operations are on a 15-second timeout.
The output of the monitor goes to a hub-specific Nagios contactgroup, where it is distributed to people's pagers.
For detailed server stats, see:
For application and HTTP statistics, see:
Be sure to check the "Investigating a Nagios monitor alert/page" section for more details about monitoring.
2007-Mar: Purchase machine from ASA computers, and install at colo (RT#8182)
2007-Mar through July: Track RMA problems with the original hardware delivered by KEI (RT#8487, RT#8606)
2007-Jul: Set up hub01 server (RT#7711)
Rack machine, 2007-07-13
- Install rails
- Slide 4U into rails
- Connect 2x redundant power cables into managed power switches, ports XXX?
- Plug ethernet into
- Install Debian Etch 64-bit (AMD64) from Etch official netinst CD
- Use expert mode for installer
- Default chartset of en_US.UTF8
- Unselect both default package sets at end of OS install to
configure bare minimum package set
- Create "jared" user during OS install
- add "Frontend: teletype" to first stanza of /etc/debconf.conf
- edit /etc/apt/sources.list. Prune to single "etch main" mirror and
one security/updates mirror. Use "etch", not "stable" for distrib
name. Remove CD-ROM and deb-src entries. Remove contrib from
security updates list. Use mirrors.kernel.org/debian/ for upstream.
- apt-get update
- apt-get install sudo ssh
- Update /etc/sudoers to include "%staff ALL=(ALL) NOPASSWD: ALL"
- Add user 'jared' to group 'staff'
- Remove "jared" from lots of standard /etc/group entries
- Remove manual KVM, attach to IP KVM gni-kvm.kei.com, port #1
- Check remote KVM and ssh accessibility, sudo access
Install basic packages
- apt-get install ntp bind9-host debconf-english bzip2 lsof pciutils tct time
- apt-get install less subversion curl screen rsync multitail
- apt-get install bwm-ng iftop tcpflow netdiag ngrep traceroute-nanog tshark
- apt-get install emacs21-nox emacs-goodies-el
- apt-get install lvm2 runit
- apt-get install python2.5-dev python2.4-dev
- apt-get install apache2-mpm-prefork cronolog mysql-server-5.0 mtop
- apt-get install debsecan debsums dstat e2undel
- apt-get install postfix (select 2-Internet site,
hub.chandlerproject.org mailname)
- apt-get install cron-apt
- Update /etc/cron-apt/config: MAILON=changes, MAILTO=cron-apt --AT-- osafoundation.org
Set up LVM for /home and /var/local
- pvcreate /dev/sda8
- vgcreate vg00 /dev/sda8
- lvcreate --size 250g --name home00 vg00
- lvcreate --size 100g --name var00 vg00
- mke2fs -j -O dir_index /dev/vg00/home00
- mke2fs -j -O dir_index /dev/vg00/var00
- Add to /etc/fstab: /dev/vg00/home00 /home ext3 noatime,errors=remount-ro 0 1
- Add to /etc/fstab: /dev/vg00/var00 /var/local ext3 noatime,errors=remount-ro 0 1
- Move /home/jared to /move/jared
- mount -a
- cp -a /move/jared into /home/; rm -rf /move
- Create "rletness" user; add to "staff" group
Install Java
- chmod 2775 /usr/local/src && chgrp staff /usr/local/src
- Download into /usr/local/src and install JRockit
R27.3.0-jdk1.5.0_11-linux-x64.bin into /opt and /usr/local/java
- As jared, run ssh-keygen; transfer public key to
svn.osafoundation.org to set up svn public key authentication
Set up /home/hub
- cd /home/jared && svn --no-auth-cache co svn+ssh://svn.osafoundation.org/svn/sandbox/hub/trunk hub
- sudo mv hub /home/
- cd /usr/local/src
- Download
http://archive.apache.org/dist/maven/binaries/maven-2.0.7-bin.tar.gz
into /usr/local/src. Untar, move to /usr/local/maven-2.0.7, and
symlink /usr/local/maven2 to maven-2.0.7. cd /usr/local/bin && ln
-s /usr/local/maven2/bin/mvn
- /home/hub/bin/manage build trunk
Install LSI RAID-card CLI
- Download into /usr/local/src
http://www.lsi.com/support/downloads/megaraid/miscellaneous/Linux_MegaCLI_1.01.24.zip
Linux CLI for LSI 8408E RAID card
- apt-get install unzip rpm
- cd /usr/local/src && mkdir lsi-cli && cp Linux_MegaCLI_1.01.24.zip
lsi-cli && cd lsi-cli
- unzip Linux_MegaCLI_1.01.24.zip
- unzip MegaCliLin.zip
- Move to /opt/MegaCli
- apt-get install diffmon
- mkdir -p /var/local/raid-monitor
- editor /etc/cron.d/raid-monitor
[hourly, /opt/MegaCli -AdpAllInfo -a0 > /var/local/raid-monitor/allinfo.log]
[hourly+3min diffmon -c /etc/diffmon/diffmon.cf]
- rm /etc/cron.daily/diffmon
- editor /etc/diffmon/diffmon.cf
[/var/local/raid-monitor/allinfo.log]
Tune packages
- sudo aptitude
- Mark all un-needed packages has dependencies (key M). Remove swaths
of un-needed packages, resulting in tuned package set
Set up file transfer
- sudo adduser filexfr
- sudo -u filexfr -i
- ssh-keygen
- exit
- sudo mkdir -p /var/service/filexfr/log/logs
- [create runit service]
Set up database backups
- sudo editor /etc/cron.d/hub-db-backup
0 * * * * root /home/hub/bin/manage db_backup production > /dev/null
- sudo /etc/init.d/cron reload
Set up network trace
- sudo apt-get install tcpflow
- sudo mkdir -p /var/service/hub-network-log/log/logs
- [create runit service]
Set up package monitoring
- sudo apt-get install apticron
- [configure]
The backup files are (currently) straight MySQL files, so they can be piped into the mysql command-line tool.
On hub01.chandlerproject.org, backups can be found in /var/local/hub/db-backups directory. They are taken every hour.
Using the runit service filexfr, copies of the backup files are rsynced off-host to the protected archive.osafoundation.org server in the /home/files/backups directory.
To restore, figure out the name of the mysql database you want to restore into, empty that database, and then pipe in the mysql backup file:
bzcat /tmp/backups.sql.bz2 | mysql -uroot dbname
You can also use bin/manage to manage and restore tables. That would look like:
ssh hub01.chandlerproject.org cd /home/hub sudo bin/manage db_import INSTANCE /tmp/cosmo_hub_01_20070724T070002.sql
This situation is too unique each time to be able to generalize procedures. In general, to recover MySQL, you'll be trying to replay the transaction logs. See the MySQL documentation and google for information about how to proceed. Remember we have plain-text backups, worst case.
Common cause is trying to bind to a port which is already used. Check other instances. Stop, rebuild, or reconfigure your instances as appropriate in that case.
Look in the osafsrv.log. If there's no indications, you could try to increase debugging, but there's almost always been a traceback indicating the problem in the past.
It's possible for Tomcat to start, but Cosmo to not start. You should be seeing Tomcat HTTP errors upon access in those cases.
See if other instances can be started. If a particular instance has been damaged, you might be able to build a different instance to point to the appropriate database and port.
Sometimes the Java VM (JRE/JDK) needs to be updated, for instance to fix a Java bug or match the requirements of the Cosmo runtime.
We use the JDK provided by BEA named JRockit. The following procedure was documented for the first installation of JRockit on hub01.chandlerproject.org:
Head to www.bea.com in a browser
Select the "all products" view from the nav menu "Products -> All Products"
Follow the "JRockit" product link
Look for the "download" link, in the upper right as of 2007-07-15
Select the latest appropriate JRockit version (JRockit 5.0 as of 2007-07-15)
Accept the binary license agreement
Find the latest 64-bit Linux version of the JDK
Save the URL for that version (right-click, "copy link location")
Log in to machine to install
cd /usr/local/src
wget [PASTED URL HERE]
Make the .bin file executable: chmod +x jrockit-.bin
Run the installer: ./jrockit-*.bin
Skip the intro page: hit enter
Accept the license: type 1 then hit enter
Enter install directory: /opt/jrockit-[SAME-AS-PACKAGE]; hit enter, then enter again to accept
Create /usr/local/java link: cd /usr/local; rm java; ln -s /opt/jrockit-[YOURS] java
Applications which need Java should set JAVA_HOME=/usr/local/java. It isn't set globally to reduce global config and to force you to think about JAVA_HOME in your app
Every one in a while, you may need to change the actual production server from the standard hub01.chandlerproject.org
This was done in early August 2007 before Chandler Hub update to the Chandler Server 0.7 release. The procedure used for the update follows. The procedure is odd because current tools can't really build a good 0.6.1.1. So the hub.chandlerproject.org production instance will simply be tarballed up.
[reduce DNS expiry to 15m] [confirm that hub.chandlerproject.org currently pointing to hub02's IP] ssh hub01.chandlerproject.org cd /home/hub editor conf/instances.conf [instances] production: hub-02 [hub-02] tomcat_http_port: 8000 database_name: cosmo_hub_01 reverse_proxy_host: hub.chandlerproject.org reverse_proxy_port: 443 ssh app-dogfood.ops.osaf.us cd /home/osaf.us sudo bin/manage stop hub-02 [confirm hub down page is being served] cd /home/cosmos tar zvcf /tmp/cosmo-hub-02.tar.gz --exclude access.*.log --exclude osafsrv.log* hub-02 sudo /home/osaf.us/bin/backup-mysql -d cosmo_hub_01 -D /var/local/osaf.us/db-backups scp /tmp/cosmo-hub-02.tar.gz /var/local/osaf.us/db/backups/LATEST.sql.bz2 hub01.chandlerproject.org:/tmp cd /home/osaf.us editor conf/httpd-app-dogfood.conf [switch in block to proxypass all traffic to hub02.chandlerproject.org] sudo bin/update-apache ssh hub01.chandlerproject.org cd /home/cosmos cp /tmp/cosmo-hub-02.tar.gz . mv /tmp/LATEST.sql.bz2 /tmp/migrate.sql.bz2 tar zxvf cosmo-hub-02.tar.bz chown -R cosmo:cosmo hub-02 cd /home/hub sudo bin/manage db_import production /tmp/migrate.sql.bz2 sudo bin/manage start production sudo bin/update-apache XXX ssh admin.kei.com sudo editor /etc/bind/zones/ [bump serial number] [change hub A record to be same as hub01] sudo /etc/init.d/bind9 reload && sudo tail -f /var/log/syslog ssh app-dogfood.ops.osaf.us cd /home/osaf.us sudo bin/update-apache svn --username jared ci conf/httpd-app-dogfood.conf scp [access logs into position; merge]
At this point, we have an instance that can be mostly managed by bin/manage (start, stop, db_backup), though the dogfood-03b instance can't be rebuilt or reconfigured.
The procedure that would be used if osaf.us 0.6-legacy were not in the picture would be closer to:
ssh hub01.chandlerproject.org cd /home/hub editor conf/instances.conf [instances] production: rc01-0.7 old: rc01-0.7 new: prod-0.7 [prod-0.7] cosmo_svn_revision: 7777 cosmo_svn_url: http://svn.osafoundation.org/server/cosmo/tags/rel_0.7.0 tomcat_http_port: 8000 reverse_proxy_host: hub.chandlerproject.org reverse_proxy_port: 443 database_name: hub_prod sudo bin/manage build new ssh app-dogfood.ops.osaf.us cd /home/osaf.us sudo bin/manage stop hub-02 [check service down page and errors] sudo bin/backup-mysql -d cosmo_hub_01 -D /var/local/osaf.us/db-backups scp /var/local/osaf.us/db-backups/[the right file].sql.bz2 hub01.chandlerproject.org:/tmp/mig07.sql.bz2 ssh hub01.chandlerproject.org cd /home/hub sudo bin/manage db_import new /tmp/mig07.sql.bz2 sudo bin/manage db_migrate new sudo bin/manage stop old sudo bin/manage start new editor conf/instances.conf [instances] production: prod-0.7 old: prod-0.7 new: trunk svn --username bob ci conf/instances.conf ssh app-dogfood.ops.osaf.us cd /home/osaf.us editor conf/httpd-app-dogfood.conf [use proxypass to hub01.chandlerproject.org] bin/update-config
The procedure to share a GNU screen session between multiple users (helpful for shared administration and emergency team diagnostics) is:
The host (say, alice) starts a screen session on the selected host.
Do [SCREEN-COMMAND-KEY]:multiuser on[RET]
Do [SCREEN-COMMAND-KEY]:acladd bob[RET] (where "bob" is the Unix username of the person who will attach later)
Bob starts screen -x alice/
Bob is now "in" alice's screen session; he can do a [SCREEN-COMMAND-KEY]? for help and [SCREEN-COMMAND-KEY]" to move around between windows within that session. When alice and bob are on the same window, they will see instant updates of any keys either of them make.
[SCREEN-COMMAND-KEY] is [control-A] by default, but is sometimes changed (by the host in this case) to avoid stomping on the readline beginning-of-line keystroke.
bug #8614, #8385 delete from subscription where not exists (select id from item where uid=collectionuid); Duplicate ical uids: select distinct collectionid from collection_item where itemid in ( select id from item i where itemtype='note' and icaluid is not null and modifiesitemid is null and exists (select id from item i2, collection_item ci2 where ci2.itemid=i2.id and i2.id!=i.id and i2.modifiesitemid is null and i2.icaluid=i.icaluid and ci2.collectionid in (select collectionid from collection_item where itemid=i.id)) ) -- It looks like on hub there are 26 modifications spanning 9 collections that are out of sync with their master items. This means the set of parent collections for these modifications is different from the set of parent collections for the master item. This basically finds all modification items whose set of parents doesn't match teh set of master item parents. The query I used to find this out: SELECT i.id from item i where modifiesitemid is not null and (( (select count(collectionid) from collection_item where itemid=i.id) != (select count(collectionid) from collection_item where itemid=i.modifiesitemid) ) or ( (select count(collectionid) from collection_item where itemid=i.id) != (select count(collectionid) from collection_item where itemid=i.id and collectionid in (select collectionid from collection_item where itemid=i.modifiesitemid)) )) To remove all subscriptions where the ticket doesn't exist on the collection: delete s from subscription s where not exists (select i.id from item i, tickets t where i.id=t.itemid and t.ticketkey=s.ticketkey and i.uid=s.collectionuid) To fix just a single user's subscriptions: delete s from subscription s where not exists (select i.id from item i, tickets t where i.id=t.itemid and t.ticketkey=s.ticketkey and i.uid=s.collectionuid) and '[USERNAME]' = (select username from users where id=s.ownerid) -- A developer wanted to know what the triage status in the production and migrated databases actually was. After reviewing the list of tables (see above), I checked that the "item" table had triage fields: mysql> describe item; +--------------------+---------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------+---------------+------+-----+---------+----------------+ | itemtype | varchar(16) | NO | MUL | | | | id | bigint(20) | NO | PRI | NULL | auto_increment | | createdate | bigint(20) | YES | | NULL | | | modifydate | bigint(20) | YES | | NULL | | | clientcreatedate | bigint(20) | YES | | NULL | | | clientmodifieddate | bigint(20) | YES | | NULL | | | itemname | varchar(255) | NO | MUL | | | | displayname | varchar(255) | YES | | NULL | | | uid | varchar(255) | NO | UNI | | | | version | int(11) | NO | | | | | contentEncoding | varchar(32) | YES | | NULL | | | contentLanguage | varchar(32) | YES | | NULL | | | contentLength | bigint(20) | YES | | NULL | | | contentType | varchar(64) | YES | | NULL | | | lastmodifiedby | varchar(255) | YES | | NULL | | | lastmodification | int(11) | YES | | NULL | | | triagestatuscode | int(11) | YES | | NULL | | | triagestatusrank | decimal(12,2) | YES | | NULL | | | isautotriage | bit(1) | YES | | NULL | | | sent | bit(1) | YES | | NULL | | | needsreply | bit(1) | YES | | NULL | | | icaluid | varchar(255) | YES | | NULL | | | modifiesitemid | bigint(20) | YES | MUL | NULL | | | ownerid | bigint(20) | NO | MUL | | | | contentdataid | bigint(20) | YES | MUL | NULL | | +--------------------+---------------+------+-----+---------+----------------+ 25 rows in set (0.00 sec) I see three triage fields and a few ids. Asking the developer if they have the Cosmo UUID or iCal UID. They do, so I look for rows: mysql> select * from item where icaluid='5e5582f4-3026-11dc-caa7-b5b97574266e'; and I found that the three fields were NULL. -- From bug #10516 for fixing UUID-titled events find all modifications with UUID as title: select * from item where modifiesitemid is not null and displayName=uid update all modifications with UUID as title to inherit from master (5 steps): 1. create temp table to store affected item ids create table temp (id integer unsigned not null) 2. populate temp table with affected item ids insert into temp (select id from item where modifiesitemid is not null and displayName=uid) 3. fix affected items update item set displayName=null, modifydate=UNIX_TIMESTAMP()*1000, version=version+1 where id in (select id from temp) 4. update collections of affected itmes so that sync will pull changes next sync update item set modifydate=UNIX_TIMESTAMP()*1000, version=version+1 where id in (select collectionid from collection_item where itemid in (select id from temp)) 5. get rid of temp table drop table temp