View Single Post
Old 11-14-2005, 08:39 PM  
Gemhdar
Confirmed User
 
Gemhdar's Avatar
 
Industry Role:
Join Date: Aug 2004
Posts: 204
We wanted to take this opportunity to update you and fully explain the events that occurred on our network this past weekend.

Part of operating a large scale growing living network is constant upgrades. Upgrades on the distribution and access layers of our network are quite common and generally uneventful. Upgrades to our core infrastructure are a major task and occur on a major level every 12-18 months. Our core is a fully meshed 60Gbps (6x10Gbps) backbone that directs our peering traffic out to two local carrier hotels. Before an upgrade is made to our core, the following steps are taken:


1.) 1 month out ? General design implementation and conference between our IP Engineering department and manufacture hardware and software experts. During this meeting, so-called ?roll back? procedures are developed and median versions are established in the event of an emergency during the upgrade. Additionally, all major changes to our network are officially frozen until after the upgrade.

2.) 20 days out ? Test implementation using a standby device. For this task, we implement the suggested hardware and/or software changes on a standby switch router and mirror our traffic on this device. Inbound/outbound route and traffic continuity is tested at this point.

3.) 10 days out ? Second conference with manufacture to finalize any potential version mismatches and finalize manufacturer?s commitment to our upgrade as they recommend

4.) 1 day out ? Second ?mirror? test to ensure no conflicts exist in new switch router software.

For this upgrade, all of the above tasks were completed. During step #2, Foundry supplied us with a tweaked version of the operating system for both the switch chassis and the individual 10Gbps cards that handle our long haul transport. On this same CD, they supplied us with a revised hardcopy of the current OS on the switches. This was to act as our emergency ?roll back? procedure. The version of software currently running on these NetIron40G?s was implemented by Foundry and required a live patch to operate properly.

The events of the upgrade are as follows:

1.) Upgrade began. All traffic was failed to our secondary core infrastructure while the primary devices were removed from production and upgraded.

2.) Upgrade to primary core infrastructure was completed

3.) Traffic was failed 100% to the primary core infrastructure. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable.

4.) Secondary core was upgraded.

5.) Traffic was moved 100% to secondary core infrastructure. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable.

6.) The above simulated a failure on the network (eg, 100% to primary or secondary infrastructure). The next test performed was to release the network as normally operated, therefore, traffic was balanced based on normal operating preferences to the two core devices, essentially 50/50. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable. At this point, the core upgrade was considered completed. A few more non service impacting but high risk housekeeping items were completed, specially the installation and migration to some new fiber transport within our facility to accommodate the opening of our new DC3 in Tampa.

7.) QoS measurements were once again taken, at this time our technicians noticed very moderate packet loss and what was deemed at the time to be a slow memory leak.

8.) Foundry was also monitoring the boxes and had begun working the issue. They initially attributed this to an ?affiliation issue? via the OSPF downstream and HSRP lateral relationship between the core clusters. Because of this, all further maintenance was suspended within the window.

9.) The perceived memory leak worsened dramatically over the next hour. Foundry requested that we connect a standby non production switch via a mirror while failing one core out of production, performing a restart, and seeing if this temporarily stopped the leak.

10.) The reboot was performed, but the situation almost instantly became just as severe. Processor load on the outbound 10Gbps interface line cards was pegged at 99% on the rebooted core, and 90% on the unrebooted core.

11.) Processor load intensified to 99% on all outbound 10Gbps interfaces, >90% on the internal interfaces, and 99% on the switch management modules.

12.) Foundry was given 15 minutes to propose a course of action, but since no satisfactory course of action was forthcoming, the decision was made internally to revert back to the old code distributed on the pre-upgrade CD

13.) After steps 1-10 above were repeated, in reverse, traffic stabilized for less than 10 minutes. Amazingly, the high CPU load continued on previously operational code.

14.) At approximately 11am EST Sunday, on the urging on the manufacture, the routers were reconfigured and reinstalled using old configs and old code, however, the problem continued to manifest itself after a few minutes of operation.

15.) Sago management requested that the ?old patched? software be loaded on the routers.

16.) Steps 1-10 above were commenced to load the ?old patched? OS on the routers. The first to be upgraded failed and we found the ASICs refused the old code completely. Foundry investigated this issue and determined that the software upgrade implemented earlier apparently included a firmware upgrade that WAS NOT mentioned during the preupgrade meetings by Foundry.

17.) Foundry committed to produce code that would recreate the older firmware on the 10Gbps cards, thereby allowing a reload of the ?old patched? software instead of the ?old? software provided by Foundry on CD.

18.) While waiting for the production of this software, a Sago engineer noticed that no memory was allocated to IPv4 routes. Further, no CAM page files were devoted to storing routes or buffering for IPv4.

19.) Foundry investigated this finding and determined that only IPv6 allocations were being made. It was acknowledged that this was a typographical error made during the customer production of our software and existed in both the ?old? and ?new? software versions provided on CD.

20.) A conference call was held, and Foundry determined that given currently available resources, fixing the memory allocation issues was more feasible in a short timeframe. We elected to take their advice. The first software revisions became available at approximately 9PM EST Sunday and were implemented within the hour by following steps 1-10 above.

21.) 3 more updates/fixes became available throughout the night. The last was implemented around 6AM EST Monday.

22.) These each gradually reduced the persistent packetloss issues.

23.) The final packet loss issue involved an improperly computed equation causing an irreversible imbalanced on only 2 single Gbps outbound connection. This caused contained packetloss for outbound routes preferring those connections. Essentially, the switches were computing that connection pipes were larger than 1Gbps and trying to force more traffic out them than was feasibly possible.

24.) This fix was implement between 8PM and 9PM this evening.


At this time, all issues are resolved. Any customers still experiencing problems should immediately contact [email protected] as your issue is unrelated to anything above.

We sincerely apologize for the obvious problems this has caused you. Our network?s performance during this incident is unacceptable and contrary to the normal way our company operates. A relentless investigation into why this occurred will continue to ensure that our customers are never subjected to such an incident in the future.

(post was edited and cut as it was too large continued below)
Gemhdar is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote