PDA

View Full Version : Devices Losing Registration



VOIPoTim
01-30-2009, 07:02 PM
This should be resolved now. We'd like to apologize for any inconvenience it caused and provide some insight into what happened.

Problem

As you may know, we rolled out new Grandstream firmware recently. In preparation for the firmware rollout, we made several systemwide changes to make our system compatible with the new firmware.

After successfully testing all of those changes and the new firmware with positive results, we proceeded to roll out the firmware to all customers.

In the days following the firmware roll-out we started seeing some isolated issues with some customers seeing their device "unregistering" which caused calls to go to voicemail or a failover number and prevented outgoing calls from being made. When customers rebooted the device, it would then reconnect and all would be well again.

Finding the Cause

We hadn't had this happen before and our system had been working amazingly well for months prior with no reports of this outside of things very isolated and caused by misconfigurations on home routers or similar situations.

We immediately assumed these issues were related to the new firmware and worked with affected customers to come to what we thought was a resolution. Over the next few days, this pattern of registrations dropping continued for some customers, but it was still somewhat isolated affecting what we estimate to be about 10% of our customer base.

When we started seeing reports of the same issue from customers that for various reasons did not have firmware updated, we then began going through all system changes that had been made in preparation for the firmware update and reverting the ones we could in case they caused the issue.

In our logs, we saw the registrations dropping, but nothing really to explain why. At this point, we also were looking at our "clone line" implementation possibly canceling the other registration out given some of the patterns we'd seen of one line remaining registered with the other not. Still, logs did not conclusively indicate this. At that point, we re-structured the way the cloned line is handled. This seemed to significantly reduce the issue, but we still had some reports of it happening.

Resolution

Overall, none of those fixes seemed to work. Then a user finally pointed out that vPanel was running slow at the same time his registration dropped. This prompted me to work with our systems admins more and less with developers to see what kind of other issues could be there.

As it turns out, one of our systems administrators had restructured our backups and was doing hourly backups on all database servers so we would have those in addition to the replication in case something ever become corrupt. This was done to resolve a previous issue with backups and load.

When these new backups were running, it was introducing some database latency into things which was causing slower responses for registrations and vPanel loading in general.

Apparently the Grandstream adapters which re-registered (normally done every hour) during this period of a few minutes were affected. The ATA would attempt to register and establish a successful connection but then the latency in the database access would prevent it from fully completing the registration request. To further complicate things, the ATA would send an unregister request with it's registration request. In some cases, the logs showed nothing, some the registration was just delayed, in some the registration was only 1 line, and the ATA was giving out a 500 Error message in some cases as well. We've now learned that GS ATAs are not as tolerant of any abnormal system conditions as some and sometimes didn't even attempt to re-register because they were so confused.

The fact that some users didn't register during that period would explain why they did not experience this issue.

In the midst of all the troubleshooting we also increased re-registration times up to every 5 minutes just so everyone was re-registering every 5 minutes in case there was just some issue with keep-alives being sent too infrequently. This put even more users into thebackup window of a few minutes since EVERYONE was now re-connecting every 5 minutes. That's why it gradually picked up and even some of our long-time users who had never seen this in 2 years suddenly saw it.

Service Credit

While this did not affect all customers, it should not have happened and was a big issue for those of you that it did affect.

It's very hard to determine what specific customers were impacted, but we know that more were than we would have liked in any event.

We're going to issue a service credit in the form of a renewal extension of 14 days for all current residential accounts that are active as of today. This will be applied at some point within the next week.

Current Issues

We're pretty confident that this particular issue is resolved. Immediately our support volume dropped by 90% when we changed the backups in question.

If you're experiencing any issues along these lines now, it is likely another issue entirely. If that's the case, please contact support so we can work with you to address them.

Future

This new backup process wasn't communicated well internally, so it was pretty much overlooked by our developers who were more focused on the recent slew of changes they'd made when looking for the cause.

We're working to bridge the gap to keep the two teams more in sync with each other. We've also started including all changes by sys admins in our internal changelog for development as well. This way it's easier to identify these issues and correct any issues that come up more quickly based on the more comprehensive picture.

We've also restructured this so the hourly backups are only being done on the secondary replicated database servers and not every server in every node to avoid this in the future.

No service is perfect and this demonstrates that we can have issues from time to time as well. With that being said, it's always our goal to resolve those issues quickly.

Once again, we apologize for any inconvenience this caused and want to thank you for your continued confidence in VOIPo.

burris
01-30-2009, 07:12 PM
Tim..

I take my hat off to you..

You and your company are a class act!!

dcshobby
01-30-2009, 07:57 PM
Thank you for that great explanation of what was happening with our service. Its funny how something so minute like that can cause so many issues for so many people. Its good that you're going to increase communications between your internal people to avoid issues like this again. Thanks and keep up the great work!

scott2020
01-30-2009, 09:03 PM
Fantastic work! We all had faith that you and your team would figure it out quickly. That is what separates VOIPo from the rest of the pack. You show that you care about your customers and put together a great combination of technical infrastructure and brains behind the operation. Thank all of your team for all of us!

Brian
01-30-2009, 09:40 PM
Not only is the problem resolution incredible, but the fact that you would type up such a detailed post freely admitting specifics about the company's internal workings is unheard of today - thanks for keeping things so down-to-earth Tim!

fisamo
01-30-2009, 10:15 PM
This detailed account of the issue, and the service credit, are two reasons I am a VOIPo supporter. Note that supporter does not equal 'fanboy'... :cool: I won't recommend VOIPo to everyone (as you might notice on BBR), but it's a solid service for anyone who's considering VoIP at all.

In all honesty, I would not be surprised to see issues come up like this again over the next few months as your systems and processes (not just hardware/software, but policies, etc) are put into practice and you learn what works and what doesn't. If you continue to tackle problems as you did this one, you have a bright future ahead.

kevm
01-31-2009, 03:40 PM
fisamo. How are you. You were so helpful last year when I signed up for ATT CV. Now I am over here with a virtual number getting ready to port ATTCV over. I am glad to see you here. You were a wealth of infomative information at the other site. While I am very computer literate in general, VOIP is not a stron suit at this point. I'm working on it.
I must say that while I have had my share of issues setting this up, Tim and the team have been more than top notch as far as response and suggestions, taking care of issues as quickly as possible. They are always quick to respond and very pleasent to deal with. I have not sent the porting documents to them yet, but I have the virtual number forwarded to the number I will be porting and that seems to be working out well.
Some issues with simul ring and call forwarding that Tim and I have been back and forth on today. I am sure it will be resolved.
Anyway fisamo. Good to see you here. Look forward to talking to you on this forum.

Kevin

beatbox32
01-31-2009, 05:50 PM
Thanks for the in-depth explanation! I'm in IT, so I understand when changes cause issues and the need to keep good change control records. It sounds like a good learning experience and I look forward to continued solid service in the future. Thanks for the excellent support and keep up the good work.

Vumes
02-01-2009, 01:17 AM
Yes, thank you Tim for the explanation, and caring so much about your customers. With mass growth, problems are bound to come up, but you and your team are right on top of things. Thanks again Tim for all your hard work and dedication to your customers.

kevm
02-01-2009, 11:13 AM
Yes, thank you Tim for the explanation, and caring so much about your customers. With mass growth, problems are bound to come up, but you and your team are right on top of things. Thanks again Tim for all your hard work and dedication to your customers.

I second this response..

Kev

dvijen
02-02-2009, 11:27 AM
Very informative and open communication. Thanks.
FYI, I lost incoming calls at 11:10 AM this morning, upon checking vpanel, it appeared that registration occured right at that time, however, only one line got registered. Upon rebooting the adapter, it registered both the lines, and incoming calls started getting through. May be it happened because I called right at the time when it was registering?
Thought you would like to know
Regards
dvijen

TomP
02-02-2009, 11:45 AM
I also appreciate the detailed communication. Thanks Tim.

On the down side, I have not been able to receive incoming calls since Friday. Perhaps there's a lingering problem that's still affecting a few users?

VOIPoTim
02-02-2009, 11:55 AM
I also appreciate the detailed communication. Thanks Tim.

On the down side, I have not been able to receive incoming calls since Friday. Perhaps there's a lingering problem that's still affecting a few users?

If that's the case, please contact us for assistance. Definitely should not be down.

TomP
02-02-2009, 12:15 PM
If that's the case, please contact us for assistance. Definitely should not be down.

I did. Just waiting for a reply. Ticket #YBL-824360. Thx.

NY Tel Guy
02-02-2009, 08:23 PM
I just noticed that in my GS 502 FXS Port 1 shows:

Primary SIP Server: sip.voipowelcome.com

Outbound Proxy: (the field is blank)

FXS Port 2 shows:
Primary SIP Server: sip.voipowelcome.cpm
Outbound Proxy: sip.voipowelcome.com

How come the field for the proxy server is blank for line 1 ? yet filled in for line 2?
This is a device that is fully provisioned by Voipo FYI.

VOIPoTim
02-02-2009, 09:33 PM
We actually stopped using the outbound proxy in most cases. It's not needed. All that's really for is to send calls out to an alternate proxy besides the one the user registered to.


I just noticed that in my GS 502 FXS Port 1 shows:

Primary SIP Server: sip.voipowelcome.com

Outbound Proxy: (the field is blank)

FXS Port 2 shows:
Primary SIP Server: sip.voipowelcome.cpm
Outbound Proxy: sip.voipowelcome.com

How come the field for the proxy server is blank for line 1 ? yet filled in for line 2?
This is a device that is fully provisioned by Voipo FYI.

Vumes
02-03-2009, 05:33 PM
I seem to have lost registration again. I called home about 10 minutes ago, and it didn't ring there according to the wife. I then checked my vPanel and it did not show anything registered. I went into my computer at home and checked and the GS did in fact show it was registered. I rebooted the device anyway and vPanel shows registered again. Made a call home, and all is well again. I haven't had this problem in a while until today.

KayakinMike
02-03-2009, 06:03 PM
I seem to have lost registration again. I called home about 10 minutes ago, and it didn't ring there according to the wife. I then checked my vPanel and it did not show anything registered. I went into my computer at home and checked and the GS did in fact show it was registered. I rebooted the device anyway and vPanel shows registered again. Made a call home, and all is well again. I haven't had this problem in a while until today.

I've got this same problem today. I submitted a ticket to support.
Mike

KayakinMike
02-03-2009, 06:25 PM
I've got this same problem today. I submitted a ticket to support.
Mike

And I'm already fixed, thanks to the spectacular VoIPo support!
Mike

NY Tel Guy
02-03-2009, 08:22 PM
I seem to have lost registration again. I called home about 10 minutes ago, and it didn't ring there according to the wife. I then checked my vPanel and it did not show anything registered. I went into my computer at home and checked and the GS did in fact show it was registered. I rebooted the device anyway and vPanel shows registered again. Made a call home, and all is well again. I haven't had this problem in a while until today.
You know could Voipo look at my adaper or that of someone else who is NOT having reg problems to compare the setings of those who are having problems to mine (just a suggestion since through all of these times, mine has worked 100% of the time).

Just a thought.....

HelpNYC
02-03-2009, 08:24 PM
Wow!! great work... I just noticed the problem today. My mother wanted to call my sister in florida. Well! no go... after i made a reboot, the ATA kicked in.... shes on the phone as we speak :)

VOIPoTim
02-03-2009, 08:47 PM
This issue unfortunately came back...although now with significantly fewer people being affected. Right now we have about a dozen tickets about it compared to hundreds last time.

We have some people testing a workaround to stabilize things as we work with Grandstream on this. If you'd like to help with that, here is info: http://forums.voipo.com/showthread.php?p=8960#post8960

We appreciate everyone's patience. I'm confident this will be resolved shortly and service levels can get back to normal.

digger16309
02-04-2009, 09:02 AM
Oh great. This was fixed for me last night on reboot, now it is down again - no registration means failover. And that means I cannot get in touch with my wife, just home from the hospital, and because she does not have her cell phone on I cannot call and check on her. Nice.

I don't know if this helps you all or not, but when my device is registering, it is doing so with the IP address of my router instead of the IP address I have assigned to it. It used to register with the assigned IP, but not anymore.

VOIPoTim
02-04-2009, 09:36 AM
At this point, we have it isolated and are continuing to work with Grandstream on it.

Once we confirm with all our BETA testers that a fix applied last night for them resolved it for those of them with issues without causing any other issues, we'll roll it out for everyone (today).

One that's done and all the facts are on the table, I'll update everyone fully.

Xponder1
02-04-2009, 11:19 AM
Oh great. This was fixed for me last night on reboot, now it is down again - no registration means failover. And that means I cannot get in touch with my wife, just home from the hospital, and because she does not have her cell phone on I cannot call and check on her. Nice.

I don't know if this helps you all or not, but when my device is registering, it is doing so with the IP address of my router instead of the IP address I have assigned to it. It used to register with the assigned IP, but not anymore.

You will have to check and make sure you settings didn't get knocked out. Personally I leave mine set to DHCP but I removed the Linksys firmware from my WRT54GS V6 and installed DD-WRT. Using it I can specify this mac address = this ip so it always comes up correct. Perhaps your current firmware on your router has a feature like this or maybe you can replace it (the firmware)with something better.

digger16309
02-04-2009, 01:53 PM
You will have to check and make sure you settings didn't get knocked out. Personally I leave mine set to DHCP but I removed the Linksys firmware from my WRT54GS V6 and installed DD-WRT. Using it I can specify this mac address = this ip so it always comes up correct. Perhaps your current firmware on your router has a feature like this or maybe you can replace it (the firmware)with something better.

DHCP on my router was causing issues conflict issues because the router was assigning the same IP address to multiple devices. So I manually configured all devices instead.

VOIPoTim
02-04-2009, 02:04 PM
We've implemented a fix and have contacted all customers who reported an issue about it.

If you've seen this issue, reboot your adapter one more time and then it will very likely clear up.

So far, this has had a 100% success rate, but we'd like to see more customers that were experiencing it confirming that it fixes things for them.

If after a reboot, you continue to see this issue, please contact us.

KenH
02-04-2009, 04:23 PM
Everyone but me. I have been monitoring this and the DSLreport forums and am aware that something was done, but I was never actually notified.

Since I have temporarily disabled provisioning, can you tell me if it's just a change to the ATA config or a something that actually requires me to reboot and receive an update.

Thank you.

Ken

VOIPoTim
02-04-2009, 04:31 PM
Everyone but me. I have been monitoring this and the DSLreport forums and am aware that something was done, but I was never actually notified.

Since I have temporarily disabled provisioning, can you tell me if it's just a change to the ATA config or a something that actually requires me to reboot and receive an update.

Thank you.

Ken

It's a change to the ATA config.