GoFuckYourself.com - Adult Webmaster Forum - View Single Post - Bandwidth Quality Tutorial, MojoHost approach to good routing.

Brad Mitchell · 01-08-2008, 08:37 AM

Hi Everyone

I was out of town skiing since Friday and just now finally have the time to respond to all of the posts. This is a very detailed response and for those of you who take the time to read it I thank you in advance!

Cheers,

Brad

-------------------

Hi Sam,

Just a few comments on your post, because you seem to have some misconceptions as to what the FCP does and does not do.

>> However, unfortunately the FCP is not the be-all end-all of network performance - it's not a panacea, nor is it
>> even adequate on its own. It is certainly not a replacement for qualified, top-tier network engineers.

First off, I totally agree with this comment. The FCP is certainly not a "set it and forget it' appliance like a Ronco oven. It certainly cannot and does not replace the need for skilled network engineers, quality equipment, proper network design and a diverse mix of quality transit providers and peers. If you simply expect to just plug in an FCP (or any other intelligent traffic engineering device) into a poorly designed network lacking substantial diversity and having a poor native routing policy, you really would not get value out of the device.

The FCP's real value proposition is for multi-homed networks that have the bases above covered. It simply augments an already solid network environment by providing a level of automated, real-time, qualitative route analysis and policy adjustment that can simply not be matched by nearly ANY level of manual (human) effort -- no matter what your budget may be for traffic engineers.

Not to get overly technical here, but the ugly truth here is native BGP4 could care less about actual path performance. There are no objective metrics built into the protocol. It is not aware of path latency, link speed, congestion, errors, etc. -- it is only aware that path exists, or it does not. Simply having a big mix of tier-one carriers without properly optimizing your tables is buying you very little.

The only real clues you have (and they're not very good) in evaluating the anticipated performance of different transit paths to reach a given destination are to consider 'hop' count (AS-Path length) and potentially also to consider MEDs received from transit carrier networks. Problem is, neither of these is a generally a reliable indication of latency or other real link quality attributes at all. Both are able to be manipulated by outside forces for unknown reasons (e.g. ...Did someone prepend their AS-path 5x via carrier "x" because its of poor quality, or because its expensive and they don't want to use it?). Did your upstream provider lower the MED he sends you on such and such path because its better, or cheaper?

Bottom line is, for the most part, traffic engineers don't have a whole lot to go on (performance-wise) without using tools external to the protocol itself to make any real effective judgments when trying to optimize native BGP policy for better end-user performance. Other than making sweeping large scale generalizations like "Provider X has really poor quality coverage of Europe, so lets discourage egress to all the RIPE issued blocks via that carrier." It is a fairly tedious process to sniff out poorly performing paths and move them to another carrier, especially not really knowing if 10 minutes after you moved traffic off of carrier "X" onto carrier "Y" to reach a given destination that the original carrier repaired a circuit or added capacity, etc. Now perhaps that path you just changed would have been better off left alone. You'll never know unless you happen to look at it again. The reality is that other than those sweeping generalizations which are often not beneficial at all -- most deliberate routing policy changes made by traffic engineers even in the largest and most highly skilled & staffed network providers are done reactively in response to either a large scale (obvious) incident, or based on the complaint of a customer which triggers an internal analysis that discovers "hey... we do have a much better way to reach network "Z"... etc. You would quickly run out of money if you tried paying a bunch of traffic engineers to sit all day looking at your active traffic using netflow or a sniffer etc. and then making series of surgical adjustments to your routing policy in order to guarantee your customers you're giving them the best product you could given your carrier mix. It just simply isn't done.

This is the beauty of an automated route analysis/control platform like the FCP. It is actually looking in real time at real conversations taking place between hosts on your network and a remote network. If the conversations are significant in number, size, or duration out to a specific ASN, an answering host within that network is then 'flagged' for analysis. The FCP then 'probes' (similar to a UDP traceroute) ALL your available transit provider links and obtains their current performance characteristics (latency, PL, etc.) in reaching that destination network. If a significantly better performing link is found in your mix of carriers than the one you're already using, and if by moving the traffic over to that link won't oversubscribe any links or kill you on costs relative to your own 95ths, then its moved. If an hour later those conditions have changed and now its another provider that offers the best performance, then its moved again.

It gets even better. Lets say that an upstream provider created a blackhole condition in reaching a given remote network. If the FCP notices a sudden increase in packet loss on a existing conversation between a host on your network and one on the recently blackholed network (remember, it is getting SPANed <Port Mirrored> traffic of all of your physical provider links for analysis) it will instantly check if the host is reachable via any provider and move away from the blackhole if that's an option given your other available carriers.

Now, what the FCP does not do:

It has NOTHING to do with ingress routing, only egress. It cannot make your ISP send traffic to us in a way that it doesn't want to. Through native policy, I can try to influence how your ISP might send me traffic (such as in the example earlier by prepending my ASN, etc.) but your provider can choose to ignore this, or treat it differently than my intent, etc.

So, on to your examples -- all of which are examples of ingress routing from our point of view:

1) Your traceroutes indicate that your provider is either Quest itself, or is forwarding traffic to Quest on the very first hop (can't see since you removed from post). It appears that Quest has no direct peering in Tampa, FL (not unusual) -- so, they are backhauling your traffic into Washington DC. That seems to take us around 40ms or so to get from Tampa to WDC and bop around a few hops until we get to Quest's edge router with L3 at hop 5. Now, on L3's net, we bounce around DC some more and then pass through Atlanta on our way back to Miami -- with a total RTT of 64ms. 64ms, given this geographic path does not seem that unreasonable to me. However, it does seem that the lions share of the path latency is in fact on the outbound Quest leg from Tampa to DC. The return using L3 from DC into Miami is almost half the latency, but again, in my opinion, both are reasonable given the circumstances.

If by chance you were looking at the reported latency figures in hop 12 on the first trace -- your traceroute just happened to coincide with the execution of a high-priority process taking place on one of our routers CPUs (likely BGP Scanner) which is giving you the 'false' latency figure. You'll notice that it completed by the next hop and your end-host latency is reported at the more believable 64ms. (same thing goes for hop 8 of your second trace -- which is NOT to or on our network.)

Again, this is all ingress routing from our perspective, so the FCP is simply not involved. If you're looking for better performance from your network into Miami, perhaps you should consider another provider or try talking Quest into adding local peerings.

2) I'm not sure what the traces to the other hosts accomplish. They are to completely different networks/geographies, and honestly none of the RTTs are that bad. Although as you indicate, Tampa is indeed closer in physical distance to Miami than it is to New York -- however networks are often not engineered as the crow flies. In this particular case, Quest is putting your outbound traffic to us on a slightly latent link to Washington and then handing it over to L3 who turns around and sends it to Miami by way of Atlanta in half the time.

3) Just for kicks, I checked our path options to reach you (or as close to you as I could get lacking your IP address. Specifically, I did a manual path analysis to Quest's router in Tampa (tpa-core-01.inet.qwest.net <205.171.27.101> using the FCP on our network, and have some interesting results to share. Now, I will first point out that this particular path (to reach 205.171.0.0/18) is NOT currently being engineered by the FCP, likely because we're not exchanging sufficient traffic with that network. So, based on our native policy, we are currently routing that prefix via Level3, and our path RTT is 64ms -- same as yours on the way in. (looks like the identical path in reverse for that matter).

(continued...)

01-08-2008, 08:37 AM
Brad Mitchell Confirmed User Industry Role: Join Date: Nov 2001 Location: Southfield, MI Posts: 9,813	Hi Everyone I was out of town skiing since Friday and just now finally have the time to respond to all of the posts. This is a very detailed response and for those of you who take the time to read it I thank you in advance! Cheers, Brad ------------------- Hi Sam, Just a few comments on your post, because you seem to have some misconceptions as to what the FCP does and does not do. >> However, unfortunately the FCP is not the be-all end-all of network performance - it's not a panacea, nor is it >> even adequate on its own. It is certainly not a replacement for qualified, top-tier network engineers. First off, I totally agree with this comment. The FCP is certainly not a "set it and forget it' appliance like a Ronco oven. It certainly cannot and does not replace the need for skilled network engineers, quality equipment, proper network design and a diverse mix of quality transit providers and peers. If you simply expect to just plug in an FCP (or any other intelligent traffic engineering device) into a poorly designed network lacking substantial diversity and having a poor native routing policy, you really would not get value out of the device. The FCP's real value proposition is for multi-homed networks that have the bases above covered. It simply augments an already solid network environment by providing a level of automated, real-time, qualitative route analysis and policy adjustment that can simply not be matched by nearly ANY level of manual (human) effort -- no matter what your budget may be for traffic engineers. Not to get overly technical here, but the ugly truth here is native BGP4 could care less about actual path performance. There are no objective metrics built into the protocol. It is not aware of path latency, link speed, congestion, errors, etc. -- it is only aware that path exists, or it does not. Simply having a big mix of tier-one carriers without properly optimizing your tables is buying you very little. The only real clues you have (and they're not very good) in evaluating the anticipated performance of different transit paths to reach a given destination are to consider 'hop' count (AS-Path length) and potentially also to consider MEDs received from transit carrier networks. Problem is, neither of these is a generally a reliable indication of latency or other real link quality attributes at all. Both are able to be manipulated by outside forces for unknown reasons (e.g. ...Did someone prepend their AS-path 5x via carrier "x" because its of poor quality, or because its expensive and they don't want to use it?). Did your upstream provider lower the MED he sends you on such and such path because its better, or cheaper? Bottom line is, for the most part, traffic engineers don't have a whole lot to go on (performance-wise) without using tools external to the protocol itself to make any real effective judgments when trying to optimize native BGP policy for better end-user performance. Other than making sweeping large scale generalizations like "Provider X has really poor quality coverage of Europe, so lets discourage egress to all the RIPE issued blocks via that carrier." It is a fairly tedious process to sniff out poorly performing paths and move them to another carrier, especially not really knowing if 10 minutes after you moved traffic off of carrier "X" onto carrier "Y" to reach a given destination that the original carrier repaired a circuit or added capacity, etc. Now perhaps that path you just changed would have been better off left alone. You'll never know unless you happen to look at it again. The reality is that other than those sweeping generalizations which are often not beneficial at all -- most deliberate routing policy changes made by traffic engineers even in the largest and most highly skilled & staffed network providers are done reactively in response to either a large scale (obvious) incident, or based on the complaint of a customer which triggers an internal analysis that discovers "hey... we do have a much better way to reach network "Z"... etc. You would quickly run out of money if you tried paying a bunch of traffic engineers to sit all day looking at your active traffic using netflow or a sniffer etc. and then making series of surgical adjustments to your routing policy in order to guarantee your customers you're giving them the best product you could given your carrier mix. It just simply isn't done. This is the beauty of an automated route analysis/control platform like the FCP. It is actually looking in real time at real conversations taking place between hosts on your network and a remote network. If the conversations are significant in number, size, or duration out to a specific ASN, an answering host within that network is then 'flagged' for analysis. The FCP then 'probes' (similar to a UDP traceroute) ALL your available transit provider links and obtains their current performance characteristics (latency, PL, etc.) in reaching that destination network. If a significantly better performing link is found in your mix of carriers than the one you're already using, and if by moving the traffic over to that link won't oversubscribe any links or kill you on costs relative to your own 95ths, then its moved. If an hour later those conditions have changed and now its another provider that offers the best performance, then its moved again. It gets even better. Lets say that an upstream provider created a blackhole condition in reaching a given remote network. If the FCP notices a sudden increase in packet loss on a existing conversation between a host on your network and one on the recently blackholed network (remember, it is getting SPANed <Port Mirrored> traffic of all of your physical provider links for analysis) it will instantly check if the host is reachable via any provider and move away from the blackhole if that's an option given your other available carriers. Now, what the FCP does not do: It has NOTHING to do with ingress routing, only egress. It cannot make your ISP send traffic to us in a way that it doesn't want to. Through native policy, I can try to influence how your ISP might send me traffic (such as in the example earlier by prepending my ASN, etc.) but your provider can choose to ignore this, or treat it differently than my intent, etc. So, on to your examples -- all of which are examples of ingress routing from our point of view: 1) Your traceroutes indicate that your provider is either Quest itself, or is forwarding traffic to Quest on the very first hop (can't see since you removed from post). It appears that Quest has no direct peering in Tampa, FL (not unusual) -- so, they are backhauling your traffic into Washington DC. That seems to take us around 40ms or so to get from Tampa to WDC and bop around a few hops until we get to Quest's edge router with L3 at hop 5. Now, on L3's net, we bounce around DC some more and then pass through Atlanta on our way back to Miami -- with a total RTT of 64ms. 64ms, given this geographic path does not seem that unreasonable to me. However, it does seem that the lions share of the path latency is in fact on the outbound Quest leg from Tampa to DC. The return using L3 from DC into Miami is almost half the latency, but again, in my opinion, both are reasonable given the circumstances. If by chance you were looking at the reported latency figures in hop 12 on the first trace -- your traceroute just happened to coincide with the execution of a high-priority process taking place on one of our routers CPUs (likely BGP Scanner) which is giving you the 'false' latency figure. You'll notice that it completed by the next hop and your end-host latency is reported at the more believable 64ms. (same thing goes for hop 8 of your second trace -- which is NOT to or on our network.) Again, this is all ingress routing from our perspective, so the FCP is simply not involved. If you're looking for better performance from your network into Miami, perhaps you should consider another provider or try talking Quest into adding local peerings. 2) I'm not sure what the traces to the other hosts accomplish. They are to completely different networks/geographies, and honestly none of the RTTs are that bad. Although as you indicate, Tampa is indeed closer in physical distance to Miami than it is to New York -- however networks are often not engineered as the crow flies. In this particular case, Quest is putting your outbound traffic to us on a slightly latent link to Washington and then handing it over to L3 who turns around and sends it to Miami by way of Atlanta in half the time. 3) Just for kicks, I checked our path options to reach you (or as close to you as I could get lacking your IP address. Specifically, I did a manual path analysis to Quest's router in Tampa (tpa-core-01.inet.qwest.net <205.171.27.101> using the FCP on our network, and have some interesting results to share. Now, I will first point out that this particular path (to reach 205.171.0.0/18) is NOT currently being engineered by the FCP, likely because we're not exchanging sufficient traffic with that network. So, based on our native policy, we are currently routing that prefix via Level3, and our path RTT is 64ms -- same as yours on the way in. (looks like the identical path in reverse for that matter). (continued...) __________________ President at MojoHost \| brad at mojohost dot com \| Skype MojoHostBrad 71 industry awards for hosting and professional excellence since 1999