GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)
-   Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)
-   -   Hosted at Sagonet? Whole &%$! datacenter is down (https://gfy.com/showthread.php?t=539894)

Boss Traffic Jim 11-14-2005 08:26 AM

Quote:

Originally Posted by johndoebob
Who hosts there anyway? Their lines are the worst I've seen yet, timeout galore.

If you host at Sagonet you're just asking for problems. The money you save can't be worth the constant downtime you're getting there.

I was going to say something like that as well. :2 cents:

chowda 11-14-2005 11:03 AM

my box is up again

tdp-Cool-content 11-14-2005 11:09 AM

yes there is a connection now, but is it not slower then befor, I have a feeling that it takes ages to load my site now. can anyone else see this on their site?

theFeTiShLaDy 11-14-2005 11:17 AM

that sucks if they were down for more than 5hours.

Gemhdar 11-14-2005 12:17 PM

Here is a brief update of what had happened during the scheduled upgrade and what is going on to this point. If any one has any further questions please feel free to email me or contact me directly via Instant Messenger...

With that being said...

The network upgrade began previously to Saturday night / Sunday morning with verification that Foundry had successful implementation of the new upgrades into our specific routers and environment.

Beginning Saturday night, the dual core routers were removed individually from HSRP (Hot Swap Router Protocol) and one taken down. Upgrades were completed, tested, and the router was brought back online, traffic switched to pass over it, successful completion and traffic was flowing normally. Then the 2nd core router was also brought down, upgraded, tested, brought back online, finally HSRP enabled again. Traffic was flowing properly over both routers and other network upgrades of normal maintenance were started (replacing some cables and such to various portions to the network.

Traffic began to start showing packet loss due to memory leaks, causing CPU overloads in both routers and engineers began to investigate the problem and identify a solution. It was confirmed that the software upgrade itself had issues, and attempted rollbacks to previous software were unsuccessful. Though not mentioned in the upgrade reports, there was an underlying firmware upgrade that took place, not allowing us to roll back the software to the previous versions, thus moving forward was the only alternative. Patches were done to allow for traffic to begin flowing through the routers once again, while we worked on the problem at hand. Significant packet loss was still prevalent but traffic was flowing. We then worked with Foundry's engineers to have new code installed onto the routers, which was done last evening at approximately 9PM. From the base OS load, patches and tweaks were then continued on throughout the night and still continue as we receive reports of existing packet loss. Though most customers have expressed increasing performance, we are still aware of issues, and resolving them one by one to restore us to previous conditions and allow for the necessary upgrades so that we may continue with planned upgrades to our network.

Though frustrating, we are all doing the very best we can to restore service as quickly as possible. Additional plans are underway for continued upgrades and service offerings, and we will announce later on this week long awaited information as to network and backbone provider additions. Later on today, we expect the majority of all issues to be resolved, though a specific ETA is not available, as quickly as possible is when things will be fixed. We appreciate your continued patience and understanding at this time, and full technical reports will be given once our network engineers are available to discuss the matter, as currently their only priority is to resolve any remaining problems and have the network back to its previous state.

xdcdave 11-14-2005 12:59 PM

Why did the techs from Foundry, that were supposedly on site, not bring replacement hardware with them? Why did Sago not have replacement hardware on site?

The amount of downtime we experienced yesterday was completely unacceptable, and frankly, I hope Sago sees a major loss of clients from this preventable outage.

Gemhdar 11-14-2005 08:39 PM

We wanted to take this opportunity to update you and fully explain the events that occurred on our network this past weekend.

Part of operating a large scale growing living network is constant upgrades. Upgrades on the distribution and access layers of our network are quite common and generally uneventful. Upgrades to our core infrastructure are a major task and occur on a major level every 12-18 months. Our core is a fully meshed 60Gbps (6x10Gbps) backbone that directs our peering traffic out to two local carrier hotels. Before an upgrade is made to our core, the following steps are taken:


1.) 1 month out ? General design implementation and conference between our IP Engineering department and manufacture hardware and software experts. During this meeting, so-called ?roll back? procedures are developed and median versions are established in the event of an emergency during the upgrade. Additionally, all major changes to our network are officially frozen until after the upgrade.

2.) 20 days out ? Test implementation using a standby device. For this task, we implement the suggested hardware and/or software changes on a standby switch router and mirror our traffic on this device. Inbound/outbound route and traffic continuity is tested at this point.

3.) 10 days out ? Second conference with manufacture to finalize any potential version mismatches and finalize manufacturer?s commitment to our upgrade as they recommend

4.) 1 day out ? Second ?mirror? test to ensure no conflicts exist in new switch router software.

For this upgrade, all of the above tasks were completed. During step #2, Foundry supplied us with a tweaked version of the operating system for both the switch chassis and the individual 10Gbps cards that handle our long haul transport. On this same CD, they supplied us with a revised hardcopy of the current OS on the switches. This was to act as our emergency ?roll back? procedure. The version of software currently running on these NetIron40G?s was implemented by Foundry and required a live patch to operate properly.

The events of the upgrade are as follows:

1.) Upgrade began. All traffic was failed to our secondary core infrastructure while the primary devices were removed from production and upgraded.

2.) Upgrade to primary core infrastructure was completed

3.) Traffic was failed 100% to the primary core infrastructure. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable.

4.) Secondary core was upgraded.

5.) Traffic was moved 100% to secondary core infrastructure. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable.

6.) The above simulated a failure on the network (eg, 100% to primary or secondary infrastructure). The next test performed was to release the network as normally operated, therefore, traffic was balanced based on normal operating preferences to the two core devices, essentially 50/50. QoS samples were taken via IronView and SolarWinds. Minimum latency, packet loss, and SNMP thresholds were deemed acceptable. At this point, the core upgrade was considered completed. A few more non service impacting but high risk housekeeping items were completed, specially the installation and migration to some new fiber transport within our facility to accommodate the opening of our new DC3 in Tampa.

7.) QoS measurements were once again taken, at this time our technicians noticed very moderate packet loss and what was deemed at the time to be a slow memory leak.

8.) Foundry was also monitoring the boxes and had begun working the issue. They initially attributed this to an ?affiliation issue? via the OSPF downstream and HSRP lateral relationship between the core clusters. Because of this, all further maintenance was suspended within the window.

9.) The perceived memory leak worsened dramatically over the next hour. Foundry requested that we connect a standby non production switch via a mirror while failing one core out of production, performing a restart, and seeing if this temporarily stopped the leak.

10.) The reboot was performed, but the situation almost instantly became just as severe. Processor load on the outbound 10Gbps interface line cards was pegged at 99% on the rebooted core, and 90% on the unrebooted core.

11.) Processor load intensified to 99% on all outbound 10Gbps interfaces, >90% on the internal interfaces, and 99% on the switch management modules.

12.) Foundry was given 15 minutes to propose a course of action, but since no satisfactory course of action was forthcoming, the decision was made internally to revert back to the old code distributed on the pre-upgrade CD

13.) After steps 1-10 above were repeated, in reverse, traffic stabilized for less than 10 minutes. Amazingly, the high CPU load continued on previously operational code.

14.) At approximately 11am EST Sunday, on the urging on the manufacture, the routers were reconfigured and reinstalled using old configs and old code, however, the problem continued to manifest itself after a few minutes of operation.

15.) Sago management requested that the ?old patched? software be loaded on the routers.

16.) Steps 1-10 above were commenced to load the ?old patched? OS on the routers. The first to be upgraded failed and we found the ASICs refused the old code completely. Foundry investigated this issue and determined that the software upgrade implemented earlier apparently included a firmware upgrade that WAS NOT mentioned during the preupgrade meetings by Foundry.

17.) Foundry committed to produce code that would recreate the older firmware on the 10Gbps cards, thereby allowing a reload of the ?old patched? software instead of the ?old? software provided by Foundry on CD.

18.) While waiting for the production of this software, a Sago engineer noticed that no memory was allocated to IPv4 routes. Further, no CAM page files were devoted to storing routes or buffering for IPv4.

19.) Foundry investigated this finding and determined that only IPv6 allocations were being made. It was acknowledged that this was a typographical error made during the customer production of our software and existed in both the ?old? and ?new? software versions provided on CD.

20.) A conference call was held, and Foundry determined that given currently available resources, fixing the memory allocation issues was more feasible in a short timeframe. We elected to take their advice. The first software revisions became available at approximately 9PM EST Sunday and were implemented within the hour by following steps 1-10 above.

21.) 3 more updates/fixes became available throughout the night. The last was implemented around 6AM EST Monday.

22.) These each gradually reduced the persistent packetloss issues.

23.) The final packet loss issue involved an improperly computed equation causing an irreversible imbalanced on only 2 single Gbps outbound connection. This caused contained packetloss for outbound routes preferring those connections. Essentially, the switches were computing that connection pipes were larger than 1Gbps and trying to force more traffic out them than was feasibly possible.

24.) This fix was implement between 8PM and 9PM this evening.


At this time, all issues are resolved. Any customers still experiencing problems should immediately contact [email protected] as your issue is unrelated to anything above.

We sincerely apologize for the obvious problems this has caused you. Our network?s performance during this incident is unacceptable and contrary to the normal way our company operates. A relentless investigation into why this occurred will continue to ensure that our customers are never subjected to such an incident in the future.

(post was edited and cut as it was too large continued below)

Gemhdar 11-14-2005 08:40 PM

(continued from above)

The following changes have already been implemented:

1.) Mandated the presence of a third-party auditor to review any proposed changes to our core infrastructure by either Sago or any of our manufacturers. We rely heavily on input from the experts that make the network gear we use, but that strategy utterly failed us in this incident. A search was begun today for a firm capable of readily assisting us with a network as robust and large as ours.

2.) A full audit by the above mentioned firm by the end of the month of all custom software provided by or implemented by any manufacturer in our network.

3.) A requirement that any manufacturer providing onsite or telephone assistance for a schedule network window include enough personnel to reserve a 24hr shift of engineers that were available during the planning phases of an upgrade. Fatigue was an issue yesterday.

4.) An internal requirement that no more than 50% of senior engineers be involved in any single task within an 8hr period. Again, this is to combat fatigue.

5.) While we have standby devices onsite, we are now requiring any mirror?ed traffic tests to include all devices at that layer in the network interacting with each other.

6.) A purchase was made today for packet generating hardware capable of simulating our load for testing purposes. Previously, packet generation was done on a scaled down basis to test mostly DDoS related safeguards prior to implementation.

This incident, like all our upgrades, was planned in excruciating detail. The upgrade was supposed to be a

short, painless operation and it was widely believed by both Sago and Foundry engineers that this could be accomplished without any impact to customer operations. A wide network upgrade window was only scheduled as a formality, given the gravity of the changes being made.

We cannot stress enough that everything in our power will be done to ensure there is never a recurrence of this event. Network performance as of late has not been near our company?s expectations or capabilities and it is impossible to express the level of commitment our staff has to showing how well our network will perform in the future.

As many of you are wondering, there was a reason for this upgrade. While many of these upgrades were slated to be announced later as we DO NOT HAVE DEFINITIVE INSTALLATION TIMEFRAMES (please read the emphasis on this) from our transport carriers, these changes will be implemented beginning December 1st and finished by a projected end of Q1 06.

So, as an early announcement, please know that we are implementing many of the following projects:

1.) Local peering for our Tampa datacenter with 17 other carriers, including 3 backbones.

2.) Transport between our Tampa facility and new 100,000 square foot Atlanta facility

3.) Merger of our Tampa and Atlanta Datacenter networks via dark fiber.

4.) Addition of 4 of the following backbones (final announcement will be made later this week): Savvis, Global Crossing, BTN, Telsia, and/or Cogent (for onnet Cogent traffic ONLY).

5.) Establishment of peering with 10+ other providers and 2+ carriers in Atlanta.

The following project is currently underway with an unknown timeline:

Implementation of a hard line network to New York City ? 60 Hudson to establish peering and better European transport.

Once completed, these changes will give us what we feel is one of the most stable and high performance foot prints of any provider in our competitive spectrum. No amount of announcements or press releases will prove this to you after the events of the last few days, only performance. That will be our primary objective over the coming weeks and months.

In the meantime, we can only offer our assurances and continue to update you as to our findings related to this incident. Our extensive apologies can only go so far, so we look forward to the opportunity to prove our abilities to you. If you have any questions, please contact me directly.

d00t 11-14-2005 08:56 PM

Let me be the first to say ... isnt that great.. but what are you going to do for clients now? At the end of the day nobody cares what proceedure you used...we just want our boxes accessible. 27 hours of downtime/extreme packet loss is totally unacceptable, regardless of how planned out it may have been or not.

chaze 11-14-2005 09:02 PM

Sounds like your growing fast. My experience is shit happens but you have to answer the phone when it does.

last time we had a dns error and a huge part of our network was effected I had a couple friends come down to help cover all the phone lines. The thing never stopped but atleast everyone was on the same page.

Glad everything is back.

We are also looking to buy another cage, we have plenty of room in LA but I think having another location "maybe with you guys in FL" will help out in case of emergancies.

Mr.Right - Banned For Life 11-14-2005 09:33 PM

50...... where are you woj

milan 11-14-2005 09:59 PM

Sago Networks - Sorry to here about this issues I hope you guys are getting there.
My recomendation use Juniper for Core Foundry sucks at their support even worse, We were suppose to have some upgrades as well but stopped at the last minute as the foundry support team are morons.

I hope you guys are up and running again, I feel the pain and stress...

Gemhdar 11-14-2005 10:28 PM

Quote:

Originally Posted by d00t
Let me be the first to say ... isnt that great.. but what are you going to do for clients now? At the end of the day nobody cares what proceedure you used...we just want our boxes accessible. 27 hours of downtime/extreme packet loss is totally unacceptable, regardless of how planned out it may have been or not.

D00t,

Make sure you email your request to [email protected] and cc myself for the SLA violations. They will be sorted out starting tomorrow morning and taken care of ASAP.

chowda 11-14-2005 10:49 PM

Quote:

Originally Posted by Sago Networks
D00t,

Make sure you email your request to [email protected] and cc myself for the SLA violations. They will be sorted out starting tomorrow morning and taken care of ASAP.


i wonder if a complainer like myself can get something too. :Oh crap

Gemhdar 11-14-2005 11:05 PM

Quote:

Originally Posted by chowda
i wonder if a complainer like myself can get something too. :Oh crap

Hey Chowda,

same holds true for [email protected] and cc me :)

RonUSMC 11-14-2005 11:06 PM

I only use theplanet.com, they are the largest Co-Lo in the country. Their datacenters look like Wal-Marts. 24 Hour phone support that picks up on the 2nd ring 99% of the time.

webair 11-14-2005 11:43 PM

Quote:

Originally Posted by Sago Networks
Hey Chowda,

same holds true for [email protected] and cc me :)


hey jason good to see you around man :thumbsup

Gemhdar 11-15-2005 06:46 AM

Thanks Mike,

I try. :)


All times are GMT -7. The time now is 09:21 AM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123