VOIPo Service Update - Recent Service Levels

**VOIPoTim** · 02-07-2009 03:26 AM

Over the past week, we've seen a series of events come up which all compounded to cause several service issues with our residential VoIP service.

You've likely seen your adapter losing its connection and requiring a daily reboot, poor voicemail call quality, and a Friday morning outage among other things. To be frank, we've quite possibly had more issues in the past week than we did in our entire two year BETA testing period.

I'd like to personally apologize for the unacceptable level of service you've experienced over the last week and give you some insight into what happened.

Service Credit

We've already issued a 14 day credit in the form of a renewal extension to all residential accounts. While this doesn't change the fact that your service was unreliable this week, we feel it's appropriate in this situation.

What Happened

A little over a week ago, we began to see some reports of poor call quality and static from some of our newest customers. As these reports continued to mount, we began to look for the cause and eventually discovered that some of the newest adapters had an issue with the firmware that was causing some static. Firmware is basically what runs on the adapter to keep it operating. When we upgraded the firmware on the adapters to the newest version, most saw an immediate call quality improvement, so we made the decision to update them all and push out new firmware to resolve the call quality issue.

Given the sense of urgency to push out this new firmware so the new customers with the issue would not think our call quality is always bad, we did not test it as thoroughly as we should have. We enabled all adapters to receive the update automatically and pushed it out. For a while, all seemed well except some very isolated cases of issues with the update.

A few days later, we began seeing a large number of reports of adapters losing their connection to our network. When this happened, users would need to reboot them to get them back online. After searching for the cause, we initially concluded that this was related to a load situation caused by our rapid growth. We noticed that during hourly backups, the load was getting extremely high on some servers so we restructured the backups and this issue seemed resolved. It definitely was the cause for some customers, but apparently was not the real cause.

The next day, we began to see the reports again despite the fact that there were no ongoing load issues. As it turns out, this new firmware update prevented the adapters from automatically reconnecting to our network sporadically after updates. So when the adapter synced up periodically throughout the day to download the latest configuration or was idle for a while, it sporadically would not re-register with us until it was rebooted. This is not how it's supposed to function. Once we finally discovered this was the true issue, we disabled automatic firmware updates, set the adapters to reconnect every minute so they wouldn't go idle, and made some system changes to make our system more forgiving with connections. Once customers rebooted their adapters and picked up the changes, this seemed to resolve the issue entirely.

In an effort to avoid any additional issues since it had been a difficult week already for some customers, we worked on identifying any other areas in which we could forsee any potential problems. We began restructuring our core database clusters since we knew that with our rapid growth, we needed to optimize them. When these changes were made, human error came into play and a config file was setup incorrectly on some new database servers by leaving a setting as a default when it should have been set differently. While this error did not have an immediate impact on anything, it did Friday morning when we reached a connection limit on a primary database server. When this happened, it caused somewhat of a chain reaction and brought our database cluster offline. If this config file had been configured correctly, this would not have happened. This essentially resulted in a system-wide outage until we restored the database access as a result of human error with the configuration changes.

On Thursday, we also began seeing several issues with voicemail call quality. This turned out to be related to a malfunctioning Sonus gateway on an upstream provider's network that was incorrectly setting up the timing for RTP packets when interacting with our voicemail system. We applied a temporary fix on our end by proxying the audio and correcting it. On a side note, the upstream carrier fixed that gateway and this issue was fully resolved on Friday.

Future

Our biggest pitfall here ironically was being too aggressive in trying to resolve issues that came up by applying fixes which had not been tested as thoroughly as they should be.

The firmware update was applied to address a call quality issue some users reported. The backups causing some adapters to lose a connection were preventative. The database changes and resulting human error with the config that caused Friday's outage were very quickly put in place to prevent any additional issues since there had already been so many earlier in the week.

While we generally do a lot of BETA testing for new features, we did not do nearly as much testing with these fixes in an attempt to rapidly resolve issues. We're just not used to customers having problems and want to resolve them immediately when they do come up. Additionally, our sales volume has been approximately 6-7x what it was during our launch and we did not want these issues to last long and give new customers a negative first impression.

In the future, we're going to have to do more testing on fixes even if that means non-critical issues won't get resolved as quickly as we'd like in order to avoid even bigger issues.

We're also going to de-centralize some of our database infrastructure to avoid similar domino effects in the future and work on improving customer communication in outage situations.

Once again, on behalf of VOIPo, I apologize for the level of service you received this past week. We appreciate your continued confidence in VOIPo and anticipate a much smoother experience in the future.

If you are experiencing any problems, please contact us so we can get them resolved for you.

**burris** · 02-07-2009 06:51 AM

For me, even with the few problems that luckily didn't seem to affect me very much, VOIPo is the very best VOIP service I have had in the last almost 4 years.

The support response is awesome...imagine getting a phone call on the VOIP line telling me a change was made and a reboot might be necessary. Of course it wasn't necessary, or the call would have never reached me.
Easy to get spoiled with this level of service..

**kevm** · 02-07-2009 09:51 PM

Originally Posted by burris

For me, even with the few problems that luckily didn't seem to affect me very much, VOIPo is the very best VOIP service I have had in the last almost 4 years.

The support response is awesome...imagine getting a phone call on the VOIP line telling me a change was made and a reboot might be necessary. Of course it wasn't necessary, or the call would have never reached me.
Easy to get spoiled with this level of service..

I agree that the level of service on this network had me wondering the first few days, but the speed at which things are resolved and the attention that is being paid to customers during this difficult time will make VOIPo rise above the rest long term. Thanks for being so forthcoming with all of the information Tim

**digger16309** · 02-08-2009 11:50 AM

I was adversely affected by the recently chronicled issues, and at a very stressful time for my family.

However, the failover process kept working which meant all of the incoming calls did get through to another number. This is huge because though many of us take failover for granted, it is not available to everyone with Voip. For example, my parents who have Comcast Voice do not have a failover option.

Despite the added stress of not having a working phone for several days, the level of communication I received was high, certainly miles above my previous provider.

I was not very happy at the time but the communication here and through the ticketing system makes me put this behind quickly.

**Xponder1** · 02-08-2009 03:24 PM

Seems I would not have had a issue Saturday if I had rebooted the ATA like they told us to. I did not bother since it came back up and was working. Lesson learned.

**voipinit** · 02-09-2009 07:48 AM

Is re-registering the ATA every minute really the issue? I am a BYOD customer and have my HT-502 set to register every 60 mins., with no issues after the first fix (load issue -lost registration twice in two days) and haven't had to reboot since (a couple weeks now). Just seems to me to be an unnecessary load on the network, unless you're talking bout sending a keep-alive packet every minute as opposed to re-registration.

**polen_sj** · 02-09-2009 09:30 AM

I would agree. I'm probably missing something here, but it seems like increasing how *frequently* a re-registration is initiated would generate additional network traffic and server load. It would seem to me that the the relevant parameter is how *long* the device waits for re-registration to be acknowledged. The HT-502 has been set to 30 seconds of wait time (default), and I wonder if that was not enough time when the server load was heavy last week due to the hourly backups. Just a thought.

**Xponder1** · 02-09-2009 10:46 AM

I think they are waiting on Grandstream to fix the 502. This is a bandaid and they are aware of the load it creates. They determined their network and servers could handle it for the short term. It will not stay that way.