Email server 2018/08/19
What happened to the email server on Friday, August 17, 2018?
During the week we had been getting reports the server was slow. Checking logs and stats on the server did not reveal anything out of the ordinary. On Friday we got a couple of reports of the server hanging for long periods of time. Since I could not find anything wrong with the server I decided to reboot the server. After about 30 minutes of it trying to shutdown I concluded that there must be a process hung. I powered off the computer and powered it back on. When it came back up the partition for all the mail did not mount. Shutdown again and then physically unplugged power to server and drives. Reconnected everything and powered back on. Still no mail partition. Ran Disk Utility and got errors that there were unrecoverable problems with the BTree. Copied DiskWarrior and TechTool Pro off my Mac to the email server and tried running them. They did not see the damaged partition. Pulled the RAID from the server room and took it to Gordon’s office. He started working on finding some tools to repair the partition. I connected another RAID to the mail server and started a restore from CrashPlan backup.
At around 6pm we had had no luck with restoring the partition. We connected the RAID to a spare Mac mini and installed kerio mail server on it and also connected another RAID. The RAID with the bad partition had a partition on it that was used by the mail server for a nightly backup. Last back up had finished at 3 am. I decided to try and restore from that backup while the Crash Plan back was restoring to the original Mac Mini and the new RAID. Based on the time it took to process one 2GB file of backup we computed it would take about 8 hours to restore.
Saturday morning at 7 am the restore was done. I decided to check DiskWarrior’s website for updates to their utility and found they did have an update. I got the update and at 7:30 I installed it and it was able to see the damaged partition and I told it to start a repair of the damaged partition. It finished recreating the directory at 3:30am Sunday morning and I told it to use the recreated directory. Once it put the new directory in place it started the process of fixing damaged files. As of this writing, 2:30pm Sunday, it is still in the last step of fixing them. It does not appear to be stuck because watching the system.log I see that it is working.
As of this writing the CrashPlan restore has restored 787GB out of 1.4TB.
OPTIONS RIGHT NOW.
1. Stop the partition recovery so that I remove the RAID that has the mail server as 3am on Friday morning. All mail from Friday is on https://mailarchive.ihmc.us and could be forwarded to the users.
2. Let the partition recovery continue and then rsync the files to the RAID that has the mail as of 3am.
3. Startup a new empty server so that people can start emailing again and then when the partition is repaired bring it on line as an archived mail server.
4. A combination of 1 and 3.
When I went to the ARI conference in Michigan a few months ago, one the big topics was that research companies were moving the email to either gmail or Microsoft Cloud platform. I know that many universities have moved to gmail for their mail hosting. I think UWF moved about 3 years ago. I have been researching moving to a hosted system for the past three months and just was having a hard time with the reoccurring cost.
While sitting and watching the restore plod along I continued to research. The company we use for spam/virus filtering, AppRiver, offers hosted email. I had looked at it in the past but did not bring it up because of the price. Our price would be about $120/mailbox every year. There might be some users that only need Hosted Exchange Lite Plus, which would be half the cost.
In the wee hours of Sunday morning I had an idea that we could have two email servers. One for students/interns and another for Researchers/Admin. The Researchers/Admin could be on AppRiver hosted which would be high availability. Student/Intern would remain on our onsite hosted email. Here is the url to the AppRiver hosted for more info <https://www.appriver.com/services/secure-hosted-exchange/>
Not only are they hosted by they are HIPPA compliant and take email security seriously. Something I don’t think gmail and Microsoft Cloud are as serious about. AppRiver is SSAE 16 / ISAE-3402 Type II SOC 1 compliant, which has to do with auditing. (“A reporting framework through which organizations can communicate relevant useful information about the effectiveness of their cybersecurity risk management program and CPAs can report on such information to meet the cybersecurity information needs of a broad range of stakeholders.”)