|
Hello,
We have redundancy through several providers.
The problem today was caused by XO Communications
who themselves caused problems with some of our other providers as well.
We were not the only ones affected.
here is root cause report from a portion of the issue:
Hello.
As many of you know, the Suavemente routing system has suffered in the past few days due to a number of problems in our routing system. I will now explain what these problems were and how they were resolved.
We have had 3 routing issues
1) Intermitent missing connections: Some of you experienced networks that would stop responding intermintently, and would come back just like they left. This problem was due to a TCAM issue. Our routing tables overloaded due to the rapidly increasing number of routes on the internet. Since we do full route with our providers to enchance reachability for our customers, it reached a point where the entire BGP table took over all available routing memory, which is shared with the internal routing tables and the ARP table. So periodically, ARP entries for some servers would expire normally and would not be able to log a new entry in the ARP table because no memory was left.
Solution: After we detected this problem we were able to implement special filtering which cut down our BGP routing table in a huge way. We do not expect we will have any more problems from this issue in the Near and Distant future.
2) Downtime due to a faulty Link: We experienced a blackout on some of our routers due to a faulty Link. This problem was completely caused by an XO communications Tech who forced the link to a configuration not matching our own without our knowledge. The link appeared as UP in our systems on both ends, but no traffic could pass through it because of the misconfiguration. We implemented emergency links to support as many of our downed customers as possible until the issue was resolved.
Solution: XO communications finally localized the wrong configuration and corrected it.
3) Routing loops in some of our IP addresses: Some customers experienced routing loops on some IP addresses. This was due to a setting on an internal routing protocol which was disabled at the time we fixed the TCAM issue. We diagnosed the problem and corrected the setting within our systems. No more routing loops will be created by this issue.
We currently do not have reports of any other major or minor routing issues within our systems. we are working arrount the clock to enhance our monitoring of our network, in to provide you with the Quality of service you expect from Suavemente Inc.
Thank you.
Roy Diaz
IP Routing Tech Support
Suavemente Inc.
SplitInfinity Networks
|