view Side-By-Side changes
Internet Engineering Task ForceA. S. MaunderGagan L. Choudhury Internet DraftCisco SystemsVera D. Sapozhnikova Expires inAugust, 2001 draft-ietf-ospf-scalability-00.txt G. ChoudhuryMay, 2003 AT&TLabs March, 2001draft-ietf-ospf-scalability-02.txt Anurag S. Maunder Sanera Systems Vishwas Manral Netplane Systems November, 2002 Explicit Marking and Prioritized Treatment of SpecificIGPOSPF Packets for FasterIGPConvergence and Improved Network Scalability and Stability<draft-ietf-ospf-scalability-00.txt>Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents asInternet-Drafts.Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed atwww.ietf.org/ietf/1id-abstracts.txt.http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed atwww.ietf.org/shadow.html.http://www.ietf.org/shadow.html. Distribution of this memo is unlimited.Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved.AbstractThere has been a lot of interest in the networking community to allow for fast failure detection followed by the fast restoration and recovery. It may be possible to provide fast recovery using special mechanisms; however, there is a strong interest in addressingIn thisissue at a more fundamental level i.e. at IGP convergence because it addresses the problem at a much broader scale. Faster IGP convergence inevitably requires faster detection by using smaller hello interval timers (unless one relies on link level detection which is not always possible), fast flooding and more frequent SPF calculations. However,draft weprovide analytic and simulation results*propose the following mechanisms toshow that this compromisesimprove the scalability and stability of OSPF-based network: (1) Process thenetwork, mainly becauseHello packetsreceivedat arouter are indistinguishable fromhigher priority compared to otherpackets and may experience long queueing delays during a sudden burst of many LSA updates.OSPF packets. Inthis draft we suggest a need fororder to facilitate this, explicitly mark the Helloand potentially somepackets, to differentiate them from otherIGP packetsOSPF packets. One way of special marking is tobe marked explicitly so that efficient Maunder,use a different Diffserv Choudhury et. al.Expires: August, 2001 [page[Page 1]implementations can detect and act upon these messagesInternet Draft Explicit Marking May, 2003 codepoint for Hello packets compared to other OSPF packets. (2) In the absence of special marking, or ina priority fashion thus allowing significant reductionaddition to it, use other mechanisms inconvergence timeorder not to miss Hello packets. One example is to treat any packet received over a link as a surrogate for a Hello packet (an implicit Hello) forIGP while maintaining network stability. The figures and graphs are missing fromtheASCII versionpurpose of keeping thedraft.link alive. (3) Thepdf versionssame type ofthis draft canexplicit marking and prioritized treatment may befound in the Internet- Drafts repository. 1 Motivation The motivation of this draft isbeneficial toaddress two key issues: (1) Fast restoration under failure conditions (2) Increased network scalability and stability The motivation for allowing fast restoration under failure conditionsother OSPF packets as well. One important example issimilar to the one provided in [1]draft-alaettinoglu- isis-convergence-00.txt. The theoretical limit for link-state routing protocols to re-route is in link propagation time scales, i.e., in tensLSA acknowledgment packet that can reduce retransmissions during periods ofmilliseconds. However, in practice it takes seconds to tenscongestion. Other examples include (a) Database description (DBD) packet from a slave that is used as an acknowledgement, and (b) LSAs carrying intra-area topology change information. It is possible that some implementations are already using one or more ofseconds to detectthelink failure and disseminate this informationabove mechanisms in order not to miss thenetwork followed by the convergence on the new set of paths. This is an inordinately long periodprocessing oftransient time for missioncriticaltraffic destined to the non-reachable nodespackets during periods of congestion. However, we suggest thenetwork. One componentabove mechanisms to be included as part of thelong re-route time is the link failure detection timestandard so that all implementations can benefit from them. Table ofbetween 20 and 30 seconds through three missed Hello packets with the typical hello intervalContents 1. Introduction...................................................2 2. The Network Under Simulation...................................5 3. Simulation Results ............................................7 4. Observations on Simulation Results ...........................11 5. Need for Prioritized Treatment of10 seconds (between 30Critical OSPF Packets and40 seconds if missed hello threshold is 4). This component would be much shorter in the presence of link level detection, but as pointed out in [1]draft- alaettinoglu-isis-convergence-00.txt it does not work in some cases. For example, a device driver may detect the link level failure but failSpecial Marking tonotify itFacilitate That............................12 6. Summary.......................................................13 7. Acknowledgments...............................................14 8. References....................................................14 9. Authors' Addresses............................................15 1. Introduction Due tothe IGP level. Also, if a router fails behind a switchworld-wide increased traffic demand, data networks are ever increasing ina switched environment then even though the switch gets the link level notification it cannot communicate thatsize in terms of number of nodes, number of links, adjacencies per node and Link State Database size. Our motivation is toother routers. Therefore for faster reliable detection atimprove theIGP level, one hasability of large networks toreducewithstand thehello interval. Reference [1]draft- alaettinoglu-isis-convergence-00.txt suggests that this be reduced to belowsimultaneous or near-simultaneous update of asecond, perhaps even to tenslarge number ofmilliseconds. A second component of the long re-route time is delayed SPF (shortest-path- first) computation. The typical delay value is 5 seconds but needs to be reduced significantly to have sub-second rerouting. The second issue we address is the ability of a network to withstand the simultaneous or near-simultaneous update of a large number of link-state- advertisementlink-state-advertisement messages, or LSAs. We call this event, an LSA storm. An LSA storm may begeneratedinitiated due to many reasons. Here are some examples: (a) one or more link failures due to fiber cuts, Choudhury et. al. [Page 2] Internet Draft Explicit Marking May, 2003 (b) one or more node failures for some reason, e.g.,failed power supplysoftware crash or some type of disaster in anoffice,office complex hosting many nodes, (c) requirement of taking down and later bringing back many nodes during a software/hardware upgrade, (d) near-synchronization of the once-in-30-minutes refresh instants of some types of LSAs, (e) refresh of all LSAs in the system during a change in software version.The LSA storm tendsIn addition todrivethenode CPU utilization to 100% forLSAs generated as aperioddirect result oftime and the durationlink/node failures, there may be other indirect LSAs as well. One example in MPLS networks is traffic engineering LSAs generated at other links as a result ofthis period increases with the sizesignificant change in reserved bandwidth resulting from rerouting of Label Switched Paths (LSPs) that went down during the link/node failure. The LSA storm causes high CPU and memory utilization at the nodeadjacency, i.e., the number of trunks connectedprocessors causing incoming packets toit. During this periodbe delayed or dropped. Delayed acknowledgements (beyond the retransmission timer value) results in retransmissions, and delayed Hello packetsreceived at(beyond thenode would see high delays and if this delay exceeds typically three or four hello intervals Maunder, et. al. Expires: August, 2001 [page 2] (typically 30 or 40 seconds)Router-Dead interval) results in links being declared down. A trunk-down event causes Router LSA generation by its end-point nodes. If traffic engineering LSAs are used for each link thenthe associated trunkthat type of LSAs would also bedeclared down. Depending ongenerated by theimplementation, there may be other impacts of a long CPU busy periodend-point nodes and potentially elsewhere aswell. For example,well due to significant changes ina reliable node architecture with an active and a standby processor, a processor-switch may result during an extended CPU-busy period which may mean that allreserved bandwidths at other links caused by theadjacencies would be lostfailure andneed to be re- established. Bothreroute of LSPs originally using theabove eventsfailed trunk. Eventually, when the link recovers that wouldcause more database synchronization with neighborsalso trigger additional Router andnetwork-widetraffic engineering LSAs. The retransmissions and additional LSAflooding whichgenerations result inturn might cause extended CPU-busy periods at other nodes. This may cause unstable behavior in the network for an extended period of timefurther CPU andpotentiallymemory usage, essentially causing ameltdown inpositive feedback loop. We define theextreme case. Due to world- wide increased traffic demand, data networks are ever increasingLSA storm size as the number of LSAs insize. Asthenetwork size grows, a bigger LSAoriginal storm anda higher adjacency at certain nodes would be more likely and so would increasenot counting any additional LSAs resulting from theprobability of unstable behavior. One way to addressfeedback loop described above. If thescalability issueLSA storm isto divide the network hierarchically into different areas so that flooding of LSAs remains localized within areas. However, this approach increasestoo large then thenetwork management and design complexity and less optimal routing between areas. Also area 0positive feedback loop mentioned above maysee the flooding ofbe large enough to indefinitely sustain a largenumber of summary LSAsCPU andsome of the new protocols may not work well undermemory utilization at many network nodes, thereby driving thehierarchical system. Thus it is importantnetwork toallowan unstable state. In the past, networkto grow towards as large a size as possible under a single area. The undesirable impact of large LSA storms is understoodoutage events have been reported inthe networking communityIP andit is well known thatATM networks using link-state protocols such as OSPF, IS-IS, PNNI or some proprietary variants. See, for example [Ref1-Ref4]. In many of these examples, large scale flooding of LSAs or other similar control messages (either naturally ordue to a bug) hastriggered by some bug or inappropriate Choudhury et. al. [Page 3] Internet Draft Explicit Marking May, 2003 procedure) have been partly or fully responsible forseveralnetworkevents in the past causing a meltdown or a near-meltdown. Recently, proposals haveinstability and outage. It has beensubmittedsuggested [Ref5] toavoid synchronization of LSA refreshes [2]draft-ietf-ospf-refresh-guide-01.txt andreduceflooding overheadthe Hello interval and Router-Dead interval significantly incase more than one interface goesorder for OSPF tothe same neighbor [3] draft-ietf-ospf-isis-flood-opt-00.txt,detect link failures and[4]draft- ietf-ospf-ppp-flood-00.txt. In this proposal werecoveries faster. Reduction of Router-Dead interval wouldlike tomakethe point that reducing hello intervals andit even morefrequent SPF computation would in fact reduce network scalability and stability. We will use a simple and approximate but easy-to-understand analytic modellikely forthis purpose.links to be declared down due to missed Hellos. Wewill alsouse amore involvedsimulationmodel. Next, we would likemodel tomake the pointshow thatmany of the underlying causes of network scalability could be avoided if certain IGP messages could be specially marked and provided prioritized treatment. 2 Analytic Model for Delay seen By a Received Hello Packet Duringthere is a certain LSAStorm For every trunk interface, a node has to send and receive a Hello packet once every hello interval. Sending of a Hello packet can be triggeredstorm size threshold above which the network may show unstable behavior caused bya timer and it is possible to give higher prioritylarge number of retransmissions, link failures due totimer-driven jobs and thereby ensure that it is not excessively delayed even during extended CPU-busy periods. However, a receivedmissed Hellopacket cannot be easily distinguished from other IGP or IPpackets andtherefore is typically served in a first-come-first- served fashion.subsequent link recoveries. Wedo a simple and approximate analysis ofalso show that thedelay experienced by this packet during an LSA storm at a node with highest adjacency. LetÆs assume: ? S = Size ofLSA storm(i.e., number of LSAs in it). Also, it is assumed that eachsize causing instability may be substantially increased by providing prioritized treatment to Hello and LSAis carried in one LSU packet. ? L = Link adjacency ofAcknowledgment packets. Furthermore, if we prioritize Hello packets then even when thenode under consideration.network operates somewhat above the stability threshold, links are not declared down due to missed Hellos. This implies that even though there isMaunder, et. al. Expires: August, 2001 [page 3] assumedcontrol plane congestion due tobemany retransmissions, themaximumdata plane stays up and no new LSAs are generated (besides the ones in thenetwork. ? t1 = Time to send or receive one IGP packet over an interface (the same time is assumed fororiginal storm and the refreshes). Based on these observations we propose prioritized treatment of Hello,LSA, duplicateLSA acknowledgment andLSA acknowledgement even though in general there mayother critical OSPF packets and a special marking to facilitate that. One might argue that the scalability issue of large networks should besome differences.solved solely by dividing the network hierarchically into multiple areas so that flooding of LSAs remains localized within areas. However, thiswouldapproach increases the network management and design complexity and may result in less optimal routing between areas. Also, ASE LSAs are flooded throughout the AS and it may be agood approximationproblem ifmajoritythere are large numbers ofthe time is in the actthem. Furthermore, a large number ofreceiving or sendingsummary LSAs may need to be flooded across Areas anda relatively small parttheir numbers would increase significantly if multiple Area Border Routers are employed forpacket-type-specific work. Inthenumerical examples we assume t1 = 1 ms. ? t2 = Timepurpose of reliability. Thus it is important todo one SPF calculation. Forallow the network to grow towards as largenetwork, this timea size as possible under a single area. Our proposal here isusually in hundredssynergistic with a broader set ofmsscalability and stability improvement proposals. [Ref6, Ref7] proposes flooding overhead reduction inthe numerical examples we assume t2 = 200 ms. ? Hi = Hello interval. ? Si = minimum interval between successive SPF calculations. ? ro = Rate at which non-IGP work comescase more than one interface goes to thenode (e.g., forwarding of data packets). For the numerical examples we assume ro = 0.2. ? T = Total work broughtsame neighbor. [Ref8] proposes a mechanism for greatly reducing LSA refreshes in stable topologies. [Ref9] compares several restricted flooding algorithms in terms of their ability tothe node during the LSA storm. For each LSA update generated elsewhere, the node will receive one newwithstand large LSApacket over one interface, send an acknowledgement packet over that interface,storms andsend copiesrobustness to failure conditions. [Ref10] proposes a wide range of congestion control and failure recovery mechanisms. Choudhury et. al. [Page 4] Internet Draft Explicit Marking May, 2003 Section 2 describes theLSA packet over the remaining L-1 interfaces. Also, assuming that the implicit acknowledgement mechanism is in use,network under simulation and Section 3 provides thenode will subsequently receive either an acknowledgement orsimulation results. Section 4 gives the basic observations based on the simulation results. Section 5 explains the need for prioritized treatment of certain critical OSPF packets and special marking to facilitate that. Section 6 gives the summary. 2. The Network Under Simulation We generate aduplicate LSArandom network over a rectangular grid using a modified version of Waxman's algorithm [Ref11] that ensures that theremaining L-1 interfaces. So over each interface one packetnetwork issentconnected andone is received. It can be seen that the same would be true for self-generated LSAs. So the total workhas a pre-specified number of nodes, links, maximum number of neighbors perLSA update is 2*L*t1. Since there are S LSAs innode, and maximum number of adjacencies per node. The rectangular grid resembles thestorm, we get T = 2*S*L*t1 (1) In Equation (1) we ignore retransmissionscontinental U.S.A. with maximum one-way propagation delay ofLSAs30 ms incase acknowledgements are not received or processed within 5 seconds. This impactthe East-West direction andother details are taken into accountmaximum one-way propagation delay of 15 ms in thesimulation model to be presented later. ? T2 = Time period over which the work comes. Due to differencesNorth-South direction. We consider two different network sizes as explained inpropagation times and congestion at other nodes, it is possible for the work arrival time to be spread out overSection 3. The network has along interval. However, since we are considering theflat, single-area topology. Each nodewith highest adjacency, i.e., one with highest congestion, (thisisassuminga Router and each link is a point-to-point link connecting two routers. We assume thatall nodes have the same processing powernode CPU andaboutmemory (not thesame non-IGP workload) most oflink bandwidth) is thework will comemain bottleneck inone chunk. We verified this to be usually true using simulations. One part of T2the LSA flooding process. This will typically be true for high speed links (e.g., OC3 or above) and/or links where OSPF traffic gets an adequate Quality ofthe order of link propagation delay and we assume that there is a second part which is proportionalService (QoS) compared toT. Therefore we get, T2other traffic. Different Timers: LSA refresh interval =A + B*T (2) Where A and B are constants. For the numerical examples we assume A1800 seconds, Hello refresh interval = 10ms and B = 0.1. ? DSeconds, Router-Dead interval =Maximum delay experienced by a Hello packet during the40 seconds, LSAstorm. We assume first-come-first-served serviceretransmission interval: two values are considered, 10 seconds andhence the delay seen by the Hello packet would be the total outstanding work at the node at the arrival instant plus its own processing time. We assume5 Seconds (note thatoutstanding work steadily increases over Maunder, et. al. Expires: August, 2001 [page 4] the interval T2 and so the maximum delay is seen byaHello packet that comes nearretransmission is disabled on theendreceipt ofthis interval. We write downeither anapproximate expression for D and then explain the various terms on the right hand side: D = T û T2 + max(1,2*T2/Hi)*t1 + max(1,T2/Si)*t2 + ro*T2 (3) The first term is the total work brought in due to theexplicit acknowledgment or a duplicate LSAstorm. The second term is the workover thenode was able to finish since we are assumingsame interface thatit was continuously busy duringacts as an implicit acknowledgment) Minimum time between successive generation of theperiod T2. The third termsame LSA = 5 seconds, Minimum time between successive Dijkstra SPF calculations isthe total work due to the sending and receiving1 second. Packing ofHello packets during the period T2. Note that itLSAs: It is assumed thatat least one Hello packet is processed, i.e., itself. The fourth term is due to SPF processing duringfor any given node, the LSAs generated over a 1-second periodT2 and we assume that at leastare packed together to form an LSU but no more than 3 LSAs are packed in oneSPFLSU. LSU/Ack/Hello Processing Times: All processingis done. The last term is the total non- IGP work coming totimes are expressed in terms of thenode overparameter T. Two values of T are considered, 1 ms Choudhury et. al. [Page 5] Internet Draft Explicit Marking May, 2003 and 0.5 ms. In theinterval T2 ? Dmax = maximum allowed valuecase ofD, i.e., if D exceeds this value thena dedicated processor for processing OSPF packets theassociated link would be declared down. Inprocessing time reported represents thenumerical examples below we assume Dmax = 3*Hi (4)true processing time. If the processor does other work and only a fraction of its capacity can be dedicated to OSPF processing then weassumehave to inflate the processing time appropriately to get the effective processing time and in that case it is assumed that theprevious Hello packet was minimally delayed then exceeding Dmax really means four missed hellos sinceinflation factor is already taken into account as part of the reported processing time. The fixed time to send or receive any LSU, Ack or Hello packetunder study itself came after a period Hi.is T. In addition, a variable processing time is used for LSU and Ack depending on thenumerical examples below, both Dnumber andDmax change with choicetypes ofsystem parameters and we are mainly interested in identifying if D exceeds Dmax. For this purpose we define the following ratioLSAs packed. No variableDelay Ratio = D/Dmax (5) and identify if Delay Ratio exceeds 1. In Figures 1-3 we plotprocessing time is used for Hello. Variable processing time per Router LSA is (0.5 + 0.17L)T where L is theDelay Ratio as a functionnumber ofLSA Storm size with nodeadjacencies10, 20 and 50 respectively. All parameters exceptadvertised by the Router LSA. For other LSA types (e.g., ASE LSA or a "Link" LSA carrying traffic engineering information about a link), the variable processing time per LSA is 0.5T. Variable processing time for an Ack is 25% that of theonescorresponding LSA. It is to be notedexplicitly on the figuresthat if multiple LSAs areas stated earlier. Figure 1 assumes Hello packetspacked in a single LSU packet then the fixed processing time is needed only once but the variable processing time is needed for every10component of the packet. The processing time values we use are roughly in the same range of what has been observed in an operational network. LSU/Ack/Hello Priority: Two non-preemptive priority levels and three priority scenarios are considered. Within each priority level processing is FIFO with new packets of lower priority being dropped when the lower priority queue is full. The higher priority packets are never dropped. In Priority scenario 1, all LSUs/Acks/Hellos received at a node are queued at the lower priority. In Priority scenario 2, Hellos received at a node are queued at the higher priority but LSUs/Acks are queued at lower priority. In Priority scenario 3, Hellos and Acks received at a node are queued at the higher priority but LSUs are queued at lower priority. All packets generated internally to a node (usually triggered by a timer) are processed at the higher priority. This includes the initial LSA storm, LSA refresh, Hello refresh, LSA retransmission and new LSA generation after detection of a failure or recovery. Buffer Size for Incoming LSUs/Acks/Hellos (lower priority): Buffer Choudhury et. al. [Page 6] Internet Draft Explicit Marking May, 2003 size is assumed to be 2000 packets where a packet is either an Ack, LSU, or Hello. LSA Refresh: Each LSA is refreshed once in 1800 seconds and the refresh instants of various LSAs in the LSDB are assumed to be uniformly distributed over the 1800 seconds period, i.e., they are completely unsynchronized. If however, an LSA is generated as part of the initial LSA storm then it goes on a new refresh schedule of once in 1800 seconds starting from its generation time. LSA Storm Generation: As defined earlier, "LSA storm" is the simultaneous or near simultaneous generation of a large number of LSAs. In the case of only Router and ASE LSAs we normally assume that the number of ASE LSAs in the storm is about 4 times that of the Router LSAs, but the ratio is allowed to change if either the Router or the ASE LSAs have reached their maximum possible value. In the case of only Router and Link LSAs (carrying traffic engineering information) we normally assume that the number of Link LSAs in the storm is about 4 times that of the Router LSAs, but the ratio is allowed to change if either the Router or the Link LSAs have reached their maximum possible value. For any given LSA storm we keep generating LSAs starting from Node index 1 and moving upwards and stop until the correct number of LSAs of each type have been generated. The LSAs generated at any given node is assumed to start at an instant uniformly distributed between 20 and 30 seconds from the start of the simulation. Successive LSA generations at a node are assumed to be spaced apart by 400 ms. It is to be noted that during the period of observation there are other LSAs generated besides the ones in the storm. These include refresh of LSAs that are not part of the storm and LSAs generated due to possible link failures and subsequent possible link recoveries. Failure/Recovery of Links: If no Hello is received over a link (due to CPU/memory congestion) for longer than Router-Dead Interval then the link is declared down. At a later time, if Hellos are received then the link would be declared up. Whenever a link is declared up or down, one Router LSA is generated by each Router on the two sides of the point-to-point link. If "Link LSAs" carrying traffic engineering information is used then it is assumed that each Router would also generate a Link LSA. In this case it is also assumed that due to rerouting of LSPs, three other links in the network (selected randomly in the simulation) would have significant change in reserved bandwidth which would result in one Link LSA being generated by the routers on the two ends of each such link. 3. Simulation Results In this section we study the relative performance of the three Choudhury et. al. [Page 7] Internet Draft Explicit Marking May, 2003 Priority scenarios defined earlier (no priority to Hello or Ack, priority to Hello only, and priority to both Hello and Ack) with a range of Network sizes, LSA retransmission timer values, LSA types, processing time values and Hello/Router-Dead-Interval values: Network size: Two networks are considered. Network 1 has 100 nodes, 1200 links, maximum number of neighbors per node is 30 and maximum number of adjacencies per node is 50 (same neighbor may have more than one adjacencies). Network 2 has 50 nodes, 600 links, maximum number of neighbors per node is 25 and maximum number of adjacencies per node is 48. Dijkstra SPF calculation time for Network 1 is assumed to be 100 ms and that for Network 2 is assumed to be 70 ms. LSA Type: Each node has 1 Router LSA (Total of 100 for Network 1 and 50 for Network 2). There are no Network LSAs since all links are point-to-point links and no Summary LSAs since the network has only one area. Regarding other LSA types we consider two situations. In Situation 1 we assume that there are no ASE LSAs and each link has one "Link" LSA carrying traffic engineering information (Total of 2400 for Network 1 and 1200 for Network 2). In Situation 2 we assume that there are no "Link" LSAs and half of the nodes are ASA-Border nodes and each border node has 10 ASE LSAs (Total of 500 for Network 1 and 250 for Network 2). We identify Situation 1 as "Link LSAs" and Situation 2 as "ASE LSAs". LSA retransmission timer value: Two values are considered, 10 seconds and 5 seconds (default value). Processing time values: Processing times for LSUs, Acks and Hello packets have been previously expressed in terms of a common parameter T. Two values are considered for T, which are 1 ms and 0.5 ms respectively. Hello/Router-Dead-Interval: It is assumed that Router-Dead interval is four times the Hello interval. In one case it is assumed that Hello interval is 10 seconds and Router-Dead-Interval is 40 seconds (default values), and in the other case it is assumed that Hello interval is 2 seconds and Router-Dead-Interval is 8 seconds. Based on Network size, LSA type and processing time values we develop 6 Test cases as follows: Case 1: Network 1, Link LSAs, retransmission timer = 10 sec., T = 1 ms, Hello/Router-Dead-Interval = 10/40 sec. Case 2: Network 1, ASE LSAs, retransmission timer = 10 sec., T = 1 ms, Hello/Router-Dead-Interval = 10/40 sec. Case 3: Network 1, Link LSAs, retransmission timer = 5 sec., Choudhury et. al. [Page 8] Internet Draft Explicit Marking May, 2003 T = 1 ms, Hello/Router-Dead-Interval = 10/40 sec. Case 4: Network 1, Link LSAs, retransmission timer = 10 sec., T = 0.5 ms, Hello/Router-Dead-Interval = 10/40 sec. Case 5: Network 1, Link LSAs, retransmission timer = 10 sec., T = 1 ms, Hello/Router-Dead-Interval = 2/8 sec. Case 6: Network 2, Link LSAs, retransmission timer = 10 sec., T = 1 ms, Hello/Router-Dead-Interval = 10/40 sec. For each case and for each Priority scenario we study the network stability as a function of the size of the LSA storm. The stability is determined by looking at the number of non-converged LSUs as a function of time. An example is shown in Table 1 for Case 1 and Priority scenario 1 (No priority to Hellos or Acks). =========|========================================================== | Number of Non-Converged LSUs in the Network at Time(in sec) LSA | STORM |====|=====|=====|=====|=====|=====|=====|=====|========|== SIZE |10s | 20s | 30s | 35s | 40s | 50s | 60s | 80s | 100s | =========|====|=====|=====|=====|=====|=====|=====|=====|========|== 100 | 0 | 0 | 24 | 29 | 24 | 1 | 0 | 1 | 1 | (Stable)| | | | | | | | | | ---------|----|-----|-----|-----|-----|-----|-----|-----|--------|-- 140 | 0 | 0 | 35 | 48 | 46 | 27 | 14 | 1 | 1 | (Stable)| | | | | | | | | | ---------|----|-----|-----|-----|-----|-----|-----|-----|--------|-- 160 | 0 | 0 | 38 | 57 | 55 | 40 | 26 | 65 | 203 | (Unstable) | | | | | | | | | =========|========================================================== Table 1: Network Stability Vs. LSA Storm (Case 1, No priority to Hello/Ack) The LSA storm starts a little after 20 seconds andSPF calculation every 5 seconds which are typical default values today. With a node adjacencyso for some period of10,time after that theDelay Ratio is below 1 even withnumber of non-converged LSUs should stay high and then come down for a stable network. This happens for LSA storms of sizes 100 and 140. With an LSA storm of size1000. However, with a node adjacency of 20,160, theDelay Ratio exceeds 1 at around a stormnumber ofsize 800non-converged LSUs stay high indefinitely due to repeated retransmissions, link failures due to missed Hellos for more than the Router-Dead interval which generates additional LSAs andwith a node adjacency of 50,also due to subsequent link recoveries which again generate additional LSAs. We define network stability threshold as theDelay Ratio exceeds 1 at around amaximum allowable LSA stormofsize325. Figure 1: Delay Ratio with Hello Every 10 Seconds, SPF Every 5 Seconds, Dmax = 30 seconds Infor which the number of Choudhury et. al. [Page 9] Internet Draft Explicit Marking May, 2003 non-converged LSUs come down to alarge network itlow level after some time. It turns out that for this example the stability threshold isnot unusual to have LSA storms150. The network behavior as a function ofsize several hundreds sincethe LSAdatabasestorm sizemaycan beseveral thousands. Thiscategorized as follows: (1) If the LSA storm isparticularly true ifwell below the stability threshold then the CPU/memory congestion lasts only for a short period and during this period there aremany type 5 LSAsvery few retransmissions, very few dropped OSPF packets andthereno link failures due to missed Hellos. This type of LSA storms arespecial LSAs for carrying information about available bandwidth at trunks as is commonobserved routinely inATMoperational networks andmight be used in MPLS-basednetworksas well. Figure 2 decreasesrecover from them easily. (2) If thehello interval to 2 seconds and SPF calculation is done once a second.LSA stormthresholds are significantly reduced. Specifically, with a node adjacency of 10, Maunder, et. al. Expires: August, 2001 [page 5]is just below theDelay Ratio exceeds 1 at around a storm of size 310; with a node adjacency of 20,stability threshold then theDelay Ratio exceeds 1 at aroundCPU/memory congestion lasts for astorm of size 160;longer period andwith a node adjacency of 50, the Delay Ratio exceeds 1 at around a stormduring this period there may be considerable amount ofsize only 65. Figure 2: Delay Ratio withretransmissions and dropped OSPF packets. If HelloEvery 2 Seconds, SPF Every 1 Second, Dmax = 6 seconds Figure 3 decreasespackets are not given priority then there may also be some link failures due to missed Hellos. However, thehello interval even furthernetwork does go back to300 ms and SPF calculation is done once every 500 ms. LSA storm thresholds are really small now. Specifically, withanode adjacencystable state eventually. This type of10, the Delay Ratio exceeds 1 at around aLSA stormof size 40,may happen rarely in operational networks and they recover from it witha node adjacency of 20,some difficulty. (3) If theDelay Ratio exceeds 1 at around aLSA stormof size 20, and with a node adjacency of 50,is above the stability threshold then theDelay RatioCPU/memory congestion may last indefinitely unless some special procedure for relieving congestion isalready over 1 even with a stormfollowed. During this period there are considerable amount ofsize 10. Figure 3: Delay Ratio withretransmissions and dropped OSPF packets. If HelloEvery 300 ms, SPF Every 500 ms, Dmax = 900 ms Whenever Delay Ratio exceeds 1, the associatedpackets are not given priority then there would also be linkis declared down even if it is actually up and eventually other undesirable events start (e.g., trunk flapping and cascading of extended CPU overload periodsfailures due toother nodes). Therefore, themissed Hellos. This type of LSA stormthreshold at which the Delay Ratio exceeds 1mayalso roughly be consideredhappen very rarely in operational networks and usually some manual procedure such as taking down adjacencies in heavily congested nodes is needed. (4) If Hello packets are given priority then the network stabilitythreshold. Figures 1-3 show that the stabilitythresholdrapidly decreases asincreases, i.e., thehello interval and SPF computation interval decreases. One reason fornetwork can withstand a larger LSA storm. Furthermore, even if the network operates at or somewhat above this higher stability threshold, Hellos are still not missed and so there are no link failures. So even if there is congestion in theincreased CPU workcontrol plane due tomore frequent hello and SPF computations, butincreased retransmissions requiring some special procedures for congestion reduction, thedominant reason is that Dmax itself decreasesdata plane remains unaffected. (5) If both Hello andso a smaller CPU busy interval is needed to exceed it. Specifically, Dmax is 30 seconds in Figure 1, 6 Seconds in FigureAcknowledgement packets are given priority then the stability threshold increases even further. Choudhury et. al. [Page 10] Internet Draft Explicit Marking May, 2003 In Table 2and only 900 ms in Figure 3. It is clear fromwe show theabove examples that in order to maintainnetwork stabilityasthreshold for thehello interval decreases, it is necessaryfive different cases and for the three different priority scenarios defined earlier. |===========|========================================================| | | Maximum Allowable LSA Storm Size For | | Case |=================|==================|===================| | Number | No Priority toprovide faster prioritized treatment|Priority toreceivedHellopackets which can of course be only done if those packets can be distinguished from other IGP| Priority to Hello | | | Hello orIP packets.Ack | Only | and Ack | |===========|=================|==================|===================| | Case 1 | 150 | 190 | 250 | |___________|_________________|__________________|___________________| | Case 2 | 185 | 215 | 285 | |___________|_________________|__________________|___________________| | Case 3Simulation Study We have also developed a simulation model to capture more accurately the impact of an| 115 | 127 | 170 | |___________|_________________|__________________|___________________| | Case 4 | 320 | 375 | 580 | |___________|_________________|__________________|___________________| | Case 5 | 120 | 175 | 225 | |___________|_________________|__________________|___________________| | Case 6 | 185 | 224 | 285 | |___________|_________________|__________________|___________________| Table 2: Maximum Allowable LSAstormStorm for a Stable Network 4. Observations on Simulation Results Table 2 shows that in all cases prioritizing Hello packets increases thenodes of the network. It captures the actual congestion seen at various nodes, propagation delay between nodes and retransmissions in case an LSA is not acknowledged. It also tries to approximate a realnetworkimplementationstability threshold, anduses processing times that are roughlyinthe same orderaddition, prioritization ofmagnitude as measured inLSA Acknowledgment packets increases thereal network (ofstability threshold even further. The reasons for theorder of milliseconds). Thereabove observations aretwo categoriesas follows. The main sources ofIGP messages. Category one messages are triggered by a timer and include the Hello refresh,sustained CPU/memory congestion (or positive feedback loop) following an LSArefresh and retransmission packets. Category 2 messagesstorm arenot triggered by a timer and include received Hello, received(1) LSA retransmissions andreceived acknowledgements. Timer-triggered messages are given non-preemptive priority over the other type. A beneficial effect(2) links being declared down due to missed Hellos which in turn causes further LSA generation and future recovery ofthis strategy is that Hello packets are sent out with little delay even under intense CPU overload. However,thereceivedlink causing even more LSA generation. Prioritizing Hello packets avoids and practically eliminates thereceived acknowledgement packets may see long queueing delays under intense CPU overload. Figures 4 and 5 below show sample resultssecond source of congestion. Prioritizing Acknowledgements significantly reduces thesimulation study when applied to a Maunder, et. al. Expires: August, 2001 [page 6] network with about 300 nodes and 800 trunks. The hello intervalfirst source of congestion, i.e., LSA retransmissions. It isassumedto be5 seconds,noted that retransmissions can not be completely eliminated due to theminimum interval between successive SPF calculations is 1 second, and a trunk is declared down if no Hello packet is received for three successive hello intervals, i.e., 15 seconds. Duringfollowing reasons. Firstly, only thestudy, an LSA storm of size 300 and 600 (Figures 4 and 5 respectively)explicit Acknowledgments arecreated at instant of time 100 seconds. Threeprioritized but duplicate LSAs carrying implicit Acknowledgments arepacked within one LSU packetstill served at the lower priority. Secondly, LSAs may get greatly delayed or dropped at the input queue of receivers andittherefore Acknowledgments may not even get generated in which case prioritizing Acks would not help. Another factor to keep in mind isassumedthatthey remain packedsince Hellos and Acks are prioritized, thesame way duringLSAs see bigger delay and potential for Choudhury et. al. [Page 11] Internet Draft Explicit Marking May, 2003 dropping. However, theflooding process. Besidessimulation results show that on thestorm, therewhole prioritizing Hello and LSA Acks are always beneficial and significantly improve the network stability threshold. Our simulation study also showed that in each of thenormal once-in- thirty-minutes LSA refreshes and those LSAs are packed one per LSU packet. We definecases, instead of prioritizing Hello packets if we treat any packet received over aquantity ôdispersionö which islink as a surrogate for a Hello packet (an implicit Hello) then we get about thenumber of LSUsame stability threshold as obtained with prioritizing Hello packets. If we prioritize Hello packetsgenerated inthen even when the networkbut not received and processed in at least one node. Figures 4 and 5 plot dispersion as a function of time. Before the LSA storm,operates somewhat above thedispersionstability threshold, links are not declared down due tonormal LSA refreshes remains small. As expected, right after the storm the dispersion jumps and then comes down againmissed Hellos. This implies that even though there is control plane congestion due to many retransmissions, thepre-storm level after some period of time. In Figure 4 with an LSA storm size 300, the ôheavy dispersion periodö lasted about 11 secondsdata plane stays up and notrunk losses were observed. In Figure 5 with an LSA storm of size 600,new LSAs are generated (besides theôheavy dispersion periodö lasted about 40 seconds. Some trunk losses were observed a little after 15 seconds withinones in theôheavy dispersion periodö but eventually all trunks recoveredoriginal storm and thedispersion came down to the pre-storm level. Figure 4: Dispersion Versus Time (LSA Storm Size = 300) Figure 5: Dispersion Versus Time (LSA Storm Size = 600) 4refreshes) 5. Need forSpecial Marking andPrioritized Treatment ofSpecific IGP packets The analyticCritical OSPF Packets andsimulation modelsSpecial Marking to Facilitate That The observations in the previous section clearly show thata major cause for unstable behavior in networks is receivedprioritizing Hello and LSA Acknowledgment packetsat a node getting queued behind other work broughtare greatly beneficial intoimproving thenode during an LSA stormscalability andmissing the deadlinestability oftypically three or four hello intervals. This need not happenlarge networks. In addition tooutgoing Hellothese packetsthat are triggered by a timer since the node CPU can giveitprioritized treatment. Clearly, if the received Hello packet canmay bespecially markedbeneficial todistinguish it fromtreat certain otherIGP and IP packets then they can also be given prioritized treatment and they would not miss the deadline even during a large LSA storm. Some specific field of IPOSPF packetsmay be used for this purpose. Besidesat theHello packets there may be other IGP packets that could also benefit from special marking and prioritized marking. We give two examples but clearly others are possible. ?higher priority as well. One exampleis the LSA acknowledgement packet. This packet disables retransmission and if a large queueing delay to this packet expires(during theretransmission timer (typical default value is 5 seconds) thendatabase exchange process between neighbors following aneedless retransmission will happen causing extra traffic load. Retransmission eventlink recovery) isusually rare due to the reliable nature of transmission links, but duringthe600 LSA storm simulation in Figure 5 many retransmission events were noted. Usually, retransmission events happen more with a longer CPU busy period. Clearly,Database Description packet from aspecial marking and prioritization Maunder, et. al. Expires: August, 2001 [page 7] ofslave that is used as an acknowledgment for theLSA acknowledgementprevious Database Description packetwould eliminate many needless retransmissions. ? A secondsent from the master. Another example is an LSA carrying abad news, i.e., a failurechange information which may trigger SPF calculation and rerouting ofa trunk or a node.Label Switched Paths. It is preferable to transmit this information faster than other LSAs in the network thateither carry good news orare just once-in-30-minutesrefreshes. The explicit identification can also be used torefreshes and typically would not triggerthe SPF calculation after processing LSAes carrying bad information. This will obviate theany route computation or route change. Given that there is a needof loweringfor providing prioritized treatment to certain OSPF packets, theSPF calculation interval under all circumstances and thus reducingnext natural question is how to facilitate this prioritization. If it is possible to examine the packet header (for the purpose of prioritization) much faster than processingoverhead. The example in this draft focussed explicitly onthecontrol domain.whole packet then prioritized treatment is possible without any protocol changes. However,it can easily be seenwe also propose thathaving an explicit identificationa special marking be used forcertain æchosenÆcategorizing all OSPF packetswill help minimize their drop probability in the traffic plane also. The explicit identification allows these controlinto one of two priority classes. It is also important to separately mark OSPF packets from other IP packets. One way to do this is to reserve two diffserv Choudhury et. al. [Page 12] Internet Draft Explicit Marking May, 2003 codepoints, one for higher priority OSPF packets and another one for lower priority OSPF packets. With this special marking it would beeasily distinguished from the dataeasy for OSPF implementers to treat Hello, LSA acknowledgment, and other critical OSPF packetsinat a higher priority and thereby significantly improve theline cardscalability andhence their processing (forwarding) can be expedited even under large traffic conditions. 5stability of networks using OSPF. 6. Summary In thisproposaldraft we point out thatifthe node processors of a largeLSA storm is generatednetwork may be subjected to a sustained CPU/Memory congestion as a result of a large LSA storm caused by some type of failure/recovery ofnodes/trunksnodes/links or synchronization amongrefreshes thenrefreshes. There is a certain LSA storm size threshold above which the network may show unstable behavior caused by large number of retransmissions, link failures due to missed Hello packetsreceived atand subsequent link recoveries. Using anodesimulation study we show that the LSA storm size causing instability maysee large queueing delaysbe substantially increased by providing prioritized treatment to Hello andmissLSA Acknowledgment packets. Furthermore, if we prioritize Hello packets then even when the network operates somewhat above the stability threshold, links are not declared down due to missed Hellos. This implies that even though there is control plane congestion due to many retransmissions, the data plane stays up and no new LSAs are generated (besides the ones in the original storm and the refreshes). Based on the above observations we propose the following: (1) Process the Hello packets at a higher priority compared to other OSPF packets. In order to facilitate this, explicitly mark thedeadlineHello packets, to differentiate them from other OSPF packets. One way oftypically threespecial marking is to use a different Diffserv codepoint for Hello packets compared to other OSPF packets. (2) In the absence of special marking, orfour hello intervals. This causesin addition to it, use other mechanisms in order not to miss Hello packets. One example is to treat any packet received over a link as a surrogate for a Hello packet (an implicit Hello) for the purpose of keeping thetrunklink alive. Our simulation study shows that this mechanism is just as effective as explicitly prioritizing Hello packets. (3) The same type of explicit marking and prioritized treatment may be beneficial tobe down andother OSPF packets as well. One important example ispotentially the beginningLSA acknowledgment packet that can reduce retransmissions during periods ofunstable behavior in the network. Thiscongestion. Our simulation Choudhury et. al. [Page 13] Internet Draft Explicit Marking May, 2003 study shows that prioritization of both Hello and LSA Acknowledgment packets isalready a concern in todayÆs network but would beconsiderably more effective than just prioritizing Hello packets. Other examples include (a) Database description (DBD) packet from amuch bigger concern if the hello intervalslave that is used as an acknowledgement, andminimum interval between SPF calculations(b) LSAs carrying intra-area topology change information. It is possible that some implementations aresubstantially reduced (belowalready using one orperhaps well below a second)more of the above mechanisms in order not toallow faster rerouting, as proposed in [1]draft-alaettinoglu-isis-convergence-00.txt. To avoid the above, we proposemiss theuseprocessing ofa special marking for Hello packets (perhaps using a special field in IP packets) so that they may be distinguished from other IGP and IPcritical packetsand provided a prioritized treatmentduringintense CPU overloadperiodscaused by LSA storms. We also point outof congestion. However, we suggest the above mechanisms to be included as part of the standard so thatother IGP packets couldall implementations can benefit fromspecial markings as well. Two examples are LSA acknowledgement packets and LSA packets carrying bad news. 5them. 7. AcknowledgmentsThe authorsWe would like tothank members of the High-Speed Packet Switching division of AT&Tacknowledge Jerry Ash, Margaret Chiosi, Elie Francis, Jeff Han, Beth Munson, Roshan Rao, Moshe Segal, Mike Wardlow, and Pat Wirth fortheir help during the study. 6collaboration and encouragement in our scalability improvement efforts for Link-State-Protocol based networks. 8. References[1] draft-alaettinoglu-isis-convergence-00.txt November, 2000 [2] draft-ietf-ospf-refresh-guide-01.txt July, 2000 [3] draft-ietf-ospf-isis-flood-opt-00.txt October, 2000 [4] draft-ietf-ospf-ppp-flood-00.txt November, 2000 Maunder,[Ref1] Pappalardo, D., "AT&T, customers grapple with ATM net outage," Network World, February 26, 2001. [Ref2] "AT&T announces cause of frame-relay network outage," AT&T Press Release, April 22, 1998. [Ref3] Cholewka, K., "MCI Outage Has Domino Effect," Inter@ctive Week, August 20, 1999. [Ref4] Jander, M., "In Qwest Outage, ATM Takes Some Heat," Light Reading, April 6, 2001. [Ref5] C. Alaettinoglu, V. Jacobson and H. Yu, "Towards Milli- second IGP Convergence," Work in Progress. [Ref6] A. Zinin and M. Shand, "Flooding Optimizations in Link-State Routing Protocols," Work in Progress. [Ref7] J. Moy, "Flooding over Parallel Point-to-Point Links," Work in progress. [Ref8] P. Pillay-Esnault, "OSPF Refresh and flooding reduction in Choudhury et. al.Expires: August, 2001 [page 8] 8[Page 14] Internet Draft Explicit Marking May, 2003 stable topologies," Work in progress. [Ref9] G. Choudhury, V. Manral, "LSA Flooding Optimization Algorithms and Their Simulation Study," Work in progress. [Ref10] J. Ash, G. Choudhury, V. Sapozhnikova, M. Sherif, A. Maunder, V. Manral, "Congestion Avoidance & Control for OSPF Networks", Work in Progress. [Ref11] B. M. Waxman, "Routing of Multipoint Connections," IEEE Journal on Selected Areas in Communications, 6(9):1617-1622, 1988. 9. Authors' AddressesAnurag S. Maunder Cisco Systems email: amaunder@cisco.comGagan L. Choudhury AT&TLabs,Room D5-3C21 200 Laurel Avenue Middletown, NJ, 07748 USA Phone: (732)420-3721 email: gchoudhury@att.com*The study was done whenVera D. Sapozhnikova AT&T Room C5-2C29 200 Laurel Avenue Middletown, NJ, 07748 USA Phone: (732)420-2653 email: sapozhnikova@att.com Anurag S. Maunderwas a Sr. Member of Techical Staff at AT&T.Sanera Systems 370 San Aleso Ave. Second Floor Sunnyvale, CA 94085 Phone: (408)734-6123 email: amaunder@sanera.net Vishwas Manral NetPlane 189, Prashasan Nagar, Road Number 72 Jubilee Hills, Hyderabad India email: Vishwasm@netplane.com Choudhury et. al. [Page 15] ----