Operations, Monitoring and Quality Assurance Plan
NOC Services.. The GLORIAD team currently provides a 24x7 NOC service with Russian and American network engineers monitoring the Moscow-Chicago link and coordinating all trouble tickets through the network provider. The CNIC team monitors the Chicago-Hong Kong-Beijing segment of the network (and similarly manages trouble ticketing processes with providers).
For the larger GLORIAD program, plans for a distributed global NOC will be formalized to include a 24x7 sharing of responsibility for management of all network facilities on the GLORIAD ring and use of a common trouble ticketing system, BUGS [R54]. The operation of the NOC will go far beyond simple link monitoring and will be closely tied to the utilization monitoring, security monitoring, and performance and measurement monitoring activities (Sect 3.6). The monitoring systems will be integrated with the BUGS system to automatically file trouble tickets. If the various monitoring and measurement systems indicate, for example, packet loss exceeding a certain threshold, routing anomalies, suspect activity/traffic patterns, tickets will be filed and alarms triggered so that the problem can be tracked through the GLORIAD backbone and the appropriate domestic networks. BUGS trouble tickets are of course time stamped; a set of management reports is available to run regularly showing the nature of problems reported, the source of the trouble ticket, and the amount of time required to resolve the problem – providing a valuable feedback loop.
The GLORIAD trouble ticketing system will be available to personnel and systems associated with partnering networks, enabling regular sharing of information about network changes, outages, performance issues and critical trouble tickets. The ability to create trouble tickets will be extended to the general user community - both from the ticketing system itself as well as a general form available from the GLORIAD website [R55].
For coordination among network engineering staff located literally around the world, the US PIs have developed a web-based chat system that will be utilized by the on-duty network engineering team [R56]. This system will be integrated with the trouble ticketing system so that the issuance of all new trouble tickets will not only notify staff via email, but will also insert messages into the chat facility and its continuously updated archive facility.
The NOC staff will coordinate both with their GLORIAD counterparts and also with colleagues in operations of partnering networks.
Performance Issues and Quality Control. GLORIAD will provide 2 general classes of service - Layer 3 routed service and a Layer 2 switched circuit service. The same approach taken for several years for addressing one aspect of quality control for the routed service will be exercised here - to ensure that the capacity on the backbone available for the routed service exceeds anticipated demand, ensuring minimal packet loss. The new architecture of GLORIAD, making it relatively simple to increase or decrease the capacity dedicated to the Layer 3 service, will help the network team assign appropriate capacities. Also, GLORIAD is deploying network equipment in both Chicago and Seattle – with Seattle providing major routing and circuit service to Asia; and Chicago providing services towards Europe/Russia. The ability to take care of network requirements in Seattle is useful for some performance issues.
However, GLORIAD is going a step further in deploying a performance measurement infrastructure to gather more detailed service metrics. Partnering with the NLANR MNA team [L64], the GLORIAD team is committed to deploying a mesh of AMP [R57] boxes to measure and maintain a continual record of metrics such as packet loss, latency, jitter – as well as traceroutes between monitored pairs of hosts. The team will continue to work with the MNA team to perform “one way” measurements also, where not practical to deploy actual AMP boxes (see GLORIAD/MNA’s current work on monitoring 45 Russian sites at [R58]).
The various metrics involved in monitoring for end-to-end applications use (such as within the L2 service) represents a difficult problem. Different classes of high-end applications have differing service requirements - for example transfer of large data sets (terabyte+) require very high throughput and protocol efficiency (increasingly suggesting use of protocols other than standard TCP because of its slow start-up and inappropriate (for advanced services) congestion control). Another set – including collaborative visualization, remote steering of computational runs and remote instrumentation control – require very stable network performance with regards to message control, and are thus more sensitive to latency and jitter issues. Both are aided by dedicated end-to-end service but issues remain about overall network throughput, latency and jitter – requiring monitoring at this level.
Performance Monitoring. The GLORIAD investigators developed a sophisticated system under the NaukaNet program, MADAS [R59], to monitor all data flows across the US-Russia network. This system now provides a complete history of all non-trivial flows (> 40K bytes) since the system began operating in September 1999. The investigators modified the system to include flows to China when the GLORIAD network began operating in January 2004. This web-accessible system provides a comprehensive look at use of the network – end-point to end-point - by protocol and, where possible, by general application class. It also maintains information on number of flows, packets and bytes transferred and includes timestamps for the begin and end of each flow, enabling useful statistical information over time about average network throughput (in general, but more importantly end-point to end-point). The system also relates information about the individual flows back to institutions, including their location (with latitude and longitude) and relation to associated institutions, thus providing a rich capability of studying (and graphically presenting) useful utilization and performance metrics between partnering institutions.
In expanding its use under GLORIAD, the PIs will work with the NLANR/MNA team on enabling live packet header analysis to lessen reliance on the Cisco Netflow product which may not be able to scale appropriately to anticipated future line speeds. GLORIAD will work with the San Diego NLANR/MNA team [L64] to deploy a passive monitor box (PMA [R62]) at the GLORIAD/StarLight facility for this purpose. Data will be continuously exported, as now, to the same relational database for subsequent analysis and processing. But the system will feature a rules-based system enabling certain events (routing anomalies, suspicious traffic patterms, etc.) to trigger alarms (and post items to the GLORIAD trouble ticketing system) for resulting action. This is of special interest to the PIs who will devote personal time to its development.
The immediate identification (and notification of network engineering teams) of certain network abuses such as ICMP flood and distributed denial of service attacks (those that can be easily identified) is an important goal. However, the same system will also be useful for identifying routing anomalies - for example, when traffic is being routed between Moscow and St. Petersburg via Chicago or, more commonly, when unidirectional data flows are noted (i.e., flows for which there is not even a control stream in the other direction) that indicate (typically) data transiting the high performance network in one direction and a commodity link in the other (often leading to poor performance ).
Using the MonALISA [R63] system and basic MRTG tools, the GLORIAD team also will monitor all Layer 2 switched Ethernet services for basic link utilization rates. This information will be tied back to the to-be-deployed scheduling system so that information is maintained on both user requests and resulting performance (at least on the basic throughput realized). This will be supplemented by a periodic survey of users registered for the scheduling system to gauge a more subjective perception of quality. This information will be tracked over time using a database developed for this purpose.
The GLORIAD team is quite enthused about its new partnership with the MonALISA team, which has committed to extended service to provide comprehensive network measurements for the GLORIAD infrastructure, including passive real time traffic values on each segment (this is already in place on the Chicago-Moscow and Chicago-Beijing links [R64]); active filtering and monitoring of interesting traffic patterns based on destinations or protocols; non-intrusive active measurements for end-to-end performance; and a dynamic network topology map and connections with peer networks. GLORIAD will also work with the MonALISA team on developing prediction tools that provide both short term and long term predictions to assist the to-be-deployed resource reservation services, and to provide anomaly alerts (integrated with GLORIAD’s trouble ticketing system) and automatically make extra measurements to assist in troubleshooting.
The GLORIAD team is also working with the Packeteer [R65] company on evaluating one of their advanced PacketShaper [R31] products, which enables evaluation of applications-level performance across the network - monitoring for network and application utilization, application response times, network/server delay and congestion metrics [R67]. This product also has integrated some interesting security issue detection capabilities, as well as the ability to dynamically shape traffic flows based on identifiable application classes (used, for example, for rate-limiting “music sharing” on campus networks). The feature of most interest to the GLORIAD team is its applications level monitoring of the routed IP service.
Network performance (and the necessary issue of performance monitoring) has been a critical issue for the GLORIAD team for the 6 years it and its predecessor network have been in operation. The GLORIAD team will provide tools for reporting and resolving service problems as well as maintaining good histories of various performance and utilization metrics, will improve on its already deployed strategies for actively monitoring for performance issues, and will maintain on the “GLORIAD classroom” detailed information about various tools that GLORIAD users themselves can use for monitoring and evaluating network performance. These are all resources pointed at ensuring that GLORIAD’s users receive the performance they require.
Use of International links. The same basic usage policy will be in place for GLORIAD as is for Little GLORIAD now - US science and education networks will peer (for the L3 service) with the CSTnet and RBnet networks (and the networks they carry). All agree not to pass traffic from other networks without informing consortium members. The GLORIAD team will depend upon its extensive utilization monitoring system to watch for unexpected uses – for example commercial traffic leaks, traffic from regions of the world that should not be routed, etc. These are all issues that the current team is well equipped to handle based on its several years’ experience with its own developed monitoring system [R29]. As the project develops, and as the resources of GLORIAD are made available to other networks, the usage issue is likely to become more complex. One of the working groups to be established on the advisory board for GLORIAD will deal with allocation and use, and will help revise and broaden policies to meet new demands.
Shared Infrastructure Issues. GLORIAD will manage its own circuits across its entire ring, except for utilizing capacity offered by the TransLight program between Seattle and Chicago for Years 1 and 2 [L65]. GLORIAD is providing both routing and switching equipment in Seattle [L63] to minimize the necessity of bringing all Asian traffic to Chicago - in fact, much of that traffic will be dealt with in the Seattle facility. Up to 2 Gb Ethernets will be available to GLORIAD by agreement with the TransLight team (although typically this will be more than is necessary). By Year 3 of the project when it upgrades to 10G service, GLORIAD will procure its own service from Seattle to Chicago.
Project Management. GLORIAD is a complex program with many people, requirements and deadlines. The project plan, timeline, organization of the advisory board, working group assignments, etc. will be maintained on the web site at: http://www.gloriad.org/theplan/.


