Devices Losing Registration

**VOIPoTim** · 01-30-2009 07:02 PM

This should be resolved now. We'd like to apologize for any inconvenience it caused and provide some insight into what happened.

Problem

As you may know, we rolled out new Grandstream firmware recently. In preparation for the firmware rollout, we made several systemwide changes to make our system compatible with the new firmware.

After successfully testing all of those changes and the new firmware with positive results, we proceeded to roll out the firmware to all customers.

In the days following the firmware roll-out we started seeing some isolated issues with some customers seeing their device "unregistering" which caused calls to go to voicemail or a failover number and prevented outgoing calls from being made. When customers rebooted the device, it would then reconnect and all would be well again.

Finding the Cause

We hadn't had this happen before and our system had been working amazingly well for months prior with no reports of this outside of things very isolated and caused by misconfigurations on home routers or similar situations.

We immediately assumed these issues were related to the new firmware and worked with affected customers to come to what we thought was a resolution. Over the next few days, this pattern of registrations dropping continued for some customers, but it was still somewhat isolated affecting what we estimate to be about 10% of our customer base.

When we started seeing reports of the same issue from customers that for various reasons did not have firmware updated, we then began going through all system changes that had been made in preparation for the firmware update and reverting the ones we could in case they caused the issue.

In our logs, we saw the registrations dropping, but nothing really to explain why. At this point, we also were looking at our "clone line" implementation possibly canceling the other registration out given some of the patterns we'd seen of one line remaining registered with the other not. Still, logs did not conclusively indicate this. At that point, we re-structured the way the cloned line is handled. This seemed to significantly reduce the issue, but we still had some reports of it happening.

Resolution

Overall, none of those fixes seemed to work. Then a user finally pointed out that vPanel was running slow at the same time his registration dropped. This prompted me to work with our systems admins more and less with developers to see what kind of other issues could be there.

As it turns out, one of our systems administrators had restructured our backups and was doing hourly backups on all database servers so we would have those in addition to the replication in case something ever become corrupt. This was done to resolve a previous issue with backups and load.

When these new backups were running, it was introducing some database latency into things which was causing slower responses for registrations and vPanel loading in general.

Apparently the Grandstream adapters which re-registered (normally done every hour) during this period of a few minutes were affected. The ATA would attempt to register and establish a successful connection but then the latency in the database access would prevent it from fully completing the registration request. To further complicate things, the ATA would send an unregister request with it's registration request. In some cases, the logs showed nothing, some the registration was just delayed, in some the registration was only 1 line, and the ATA was giving out a 500 Error message in some cases as well. We've now learned that GS ATAs are not as tolerant of any abnormal system conditions as some and sometimes didn't even attempt to re-register because they were so confused.

The fact that some users didn't register during that period would explain why they did not experience this issue.

In the midst of all the troubleshooting we also increased re-registration times up to every 5 minutes just so everyone was re-registering every 5 minutes in case there was just some issue with keep-alives being sent too infrequently. This put even more users into thebackup window of a few minutes since EVERYONE was now re-connecting every 5 minutes. That's why it gradually picked up and even some of our long-time users who had never seen this in 2 years suddenly saw it.

Service Credit

While this did not affect all customers, it should not have happened and was a big issue for those of you that it did affect.

It's very hard to determine what specific customers were impacted, but we know that more were than we would have liked in any event.

We're going to issue a service credit in the form of a renewal extension of 14 days for all current residential accounts that are active as of today. This will be applied at some point within the next week.

Current Issues

We're pretty confident that this particular issue is resolved. Immediately our support volume dropped by 90% when we changed the backups in question.

If you're experiencing any issues along these lines now, it is likely another issue entirely. If that's the case, please contact support so we can work with you to address them.

Future

This new backup process wasn't communicated well internally, so it was pretty much overlooked by our developers who were more focused on the recent slew of changes they'd made when looking for the cause.

We're working to bridge the gap to keep the two teams more in sync with each other. We've also started including all changes by sys admins in our internal changelog for development as well. This way it's easier to identify these issues and correct any issues that come up more quickly based on the more comprehensive picture.

We've also restructured this so the hourly backups are only being done on the secondary replicated database servers and not every server in every node to avoid this in the future.

No service is perfect and this demonstrates that we can have issues from time to time as well. With that being said, it's always our goal to resolve those issues quickly.

Once again, we apologize for any inconvenience this caused and want to thank you for your continued confidence in VOIPo.

Thread: Devices Losing Registration

Thread Tools

Display

Threaded View

Devices Losing Registration

Bookmarks

Bookmarks

Posting Permissions