Over the past week, we've seen a series of events come up which all compounded to cause several service issues with our residential VoIP service.
You've likely seen your adapter losing its connection and requiring a daily reboot, poor voicemail call quality, and a Friday morning outage among other things. To be frank, we've quite possibly had more issues in the past week than we did in our entire two year BETA testing period.
I'd like to personally apologize for the unacceptable level of service you've experienced over the last week and give you some insight into what happened.
Service Credit
We've already issued a 14 day credit in the form of a renewal extension to all residential accounts. While this doesn't change the fact that your service was unreliable this week, we feel it's appropriate in this situation.
What Happened
A little over a week ago, we began to see some reports of poor call quality and static from some of our newest customers. As these reports continued to mount, we began to look for the cause and eventually discovered that some of the newest adapters had an issue with the firmware that was causing some static. Firmware is basically what runs on the adapter to keep it operating. When we upgraded the firmware on the adapters to the newest version, most saw an immediate call quality improvement, so we made the decision to update them all and push out new firmware to resolve the call quality issue.
Given the sense of urgency to push out this new firmware so the new customers with the issue would not think our call quality is always bad, we did not test it as thoroughly as we should have. We enabled all adapters to receive the update automatically and pushed it out. For a while, all seemed well except some very isolated cases of issues with the update.
A few days later, we began seeing a large number of reports of adapters losing their connection to our network. When this happened, users would need to reboot them to get them back online. After searching for the cause, we initially concluded that this was related to a load situation caused by our rapid growth. We noticed that during hourly backups, the load was getting extremely high on some servers so we restructured the backups and this issue seemed resolved. It definitely was the cause for some customers, but apparently was not the real cause.
The next day, we began to see the reports again despite the fact that there were no ongoing load issues. As it turns out, this new firmware update prevented the adapters from automatically reconnecting to our network sporadically after updates. So when the adapter synced up periodically throughout the day to download the latest configuration or was idle for a while, it sporadically would not re-register with us until it was rebooted. This is not how it's supposed to function. Once we finally discovered this was the true issue, we disabled automatic firmware updates, set the adapters to reconnect every minute so they wouldn't go idle, and made some system changes to make our system more forgiving with connections. Once customers rebooted their adapters and picked up the changes, this seemed to resolve the issue entirely.
In an effort to avoid any additional issues since it had been a difficult week already for some customers, we worked on identifying any other areas in which we could forsee any potential problems. We began restructuring our core database clusters since we knew that with our rapid growth, we needed to optimize them. When these changes were made, human error came into play and a config file was setup incorrectly on some new database servers by leaving a setting as a default when it should have been set differently. While this error did not have an immediate impact on anything, it did Friday morning when we reached a connection limit on a primary database server. When this happened, it caused somewhat of a chain reaction and brought our database cluster offline. If this config file had been configured correctly, this would not have happened. This essentially resulted in a system-wide outage until we restored the database access as a result of human error with the configuration changes.
On Thursday, we also began seeing several issues with voicemail call quality. This turned out to be related to a malfunctioning Sonus gateway on an upstream provider's network that was incorrectly setting up the timing for RTP packets when interacting with our voicemail system. We applied a temporary fix on our end by proxying the audio and correcting it. On a side note, the upstream carrier fixed that gateway and this issue was fully resolved on Friday.
Future
Our biggest pitfall here ironically was being too aggressive in trying to resolve issues that came up by applying fixes which had not been tested as thoroughly as they should be.
The firmware update was applied to address a call quality issue some users reported. The backups causing some adapters to lose a connection were preventative. The database changes and resulting human error with the config that caused Friday's outage were very quickly put in place to prevent any additional issues since there had already been so many earlier in the week.
While we generally do a lot of BETA testing for new features, we did not do nearly as much testing with these fixes in an attempt to rapidly resolve issues. We're just not used to customers having problems and want to resolve them immediately when they do come up. Additionally, our sales volume has been approximately 6-7x what it was during our launch and we did not want these issues to last long and give new customers a negative first impression.
In the future, we're going to have to do more testing on fixes even if that means non-critical issues won't get resolved as quickly as we'd like in order to avoid even bigger issues.
We're also going to de-centralize some of our database infrastructure to avoid similar domino effects in the future and work on improving customer communication in outage situations.
Once again, on behalf of VOIPo, I apologize for the level of service you received this past week. We appreciate your continued confidence in VOIPo and anticipate a much smoother experience in the future.
If you are experiencing any problems, please contact us so we can get them resolved for you.
Bookmarks