1. Symptoms
Today’s “patient” is a large securities company with nearly 11 branch networks in the city. The company’s information center engineers described the symptoms as follows: For some time now, the entire network experiences frequent interruptions during transactions, sometimes 2-3 times in a single day. Initially, each interruption was brief and did not attract significant attention; simple tests were conducted, and the interruptions lasted from a few seconds to a dozen seconds or so, with no clear pattern, and usually occurring when the market was open. However, the issue continued to worsen, with the frequency of interruptions increasing. Several experienced users complained to the management that they had recently encountered issues with online transactions that were supposed to be straightforward: “When I press the transaction confirmation key, the computer does not respond at all, and I have no idea if the transaction was successful or not. I have to wait for a while. I told my friends that a trick is to wait for half a minute, and the computer will eventually display the transaction. It doesn’t work every time, but this used to be a rare occurrence, and recently, we’ve experienced it several times, it seems to be getting worse day by day, and it’s making us quite nervous.” Yesterday at 15:26, near the close of the afternoon market, the issue occurred again: the market data display and updates were normal, but trading commands were not being responded to, except for a small number of transactions (but a significant number of transactions were still blocked). All 11 subsidiary trading networks reported issues. The network administrators initially suspected a central network problem, immediately checked the trading server in the computer center, and found that CPU utilization, protocol exchange, and packet exchange were all normal. The experiment included re-logging into the server and performing Ping tests, which were also normal. As the closing time approached, the system was taken offline. After the close, the trading network was kept running, and the simulation trading function was initiated to diagnose the problem. A series of 40 simulated transactions were conducted within the computer center LAN, and all were successful. Meanwhile, internal and external simulated transactions were conducted in three subsidiary trading networks, with a 100% success rate for internal transactions but only about 15% for external transactions. It was now almost certain that the issue was within the network itself. With the simulation trading running continuously, the computer center’s network management system was used to examine the network’s and servers’ working status, which all indicated normal operation. The ports on the switch connecting to the other 11 branch networks were examined. They showed traffic, but with occasional interruptions (about 3% of the Ping tests were unresponsive). The cable links to the switch connecting the server and network management machine in the same network segment were checked using a DSP-100 cable analyzer, and no problems were found. This confirmed that the server’s network segment was functioning normally, leading to suspicion of a damaged switch port. The cable from the server network segment was reinserted into another available port on the switch and corresponding configurations were set, which promptly restored normal network operation, and the issue disappeared. To everyone’s relief, no issues occurred the following day during market hours.
2. Diagnostic Process
At 19:50 in the evening, the team arrived at the securities company’s location and immediately started the system. The self-test showed no abnormalities, and the simulation trading system was started. The communication with the subsidiary trading networks was normal. The network topology showed that each subsidiary trading network was connected to the computer center’s local segment through DDN dedicated lines and routers. The MIBs of the routers were checked and showed no abnormalities or error records. The MIBs of the switch’s port connections were also inspected one by one and displayed no anomalies or errors. The trading server and network management machine were both on the same network segment and connected to the switch port through an intelligent hub. The hub’s operation table showed normal data. An F683 network tester was connected to the hub port, continuously monitoring the traffic. The flow rate was around 98%, indicating that the network was working normally and quite efficiently. This problem was classified as a soft failure. It could be caused by hardware failures, application software issues, power supply equipment, external interference, or a combination of these factors. Since checking the local network, logging into the server, and performing Ping tests all produced normal results, the issue was initially determined to be related to the hub network segment. To pinpoint the network problem, simultaneous two-way traffic tests, channel performance testing, fault monitoring, ICMP Ping tests, and ICMP monitoring were conducted at a selected remote subsidiary trading network and the network management center. For ease of observation and comparison, the frame length of the traffic was set to 100 bytes, with a total traffic load of approximately 30% (15% each, around 10K each). At 21:30, as expected, the issue occurred. ICMP Ping tests showed disruptions, and the work tables of the switch and router recorded pauses or interruptions in data and displayed FCS frame errors. The results from opening the corresponding tables on the remote side revealed that the router received 17% of the traffic, while the switch received 2% of the traffic, with ICMP Ping losses reaching 90%. ICMP monitoring showed an unreliability rate of approximately 97%. From the central site, opening the MIBs of the router and switch displayed received traffic percentages of 0.5% to 0.9%. This indicated that data could reach the router but couldn’t be forwarded smoothly through the switch’s ports. Finally, the UPS power parameters were tested using an F43 power harmonic tester to verify the UPS power, which was found to be compliant. The problem was indeed determined to be with the switch. As the network problem was intermittent, it was hypothesized that there was an issue with the fourth slot of the switch. To test this hypothesis, the team switched to the spare port located in the fifth slot, which immediately restored network operation. To be sure, the port was returned to the fourth slot, and the problem was eliminated entirely. In an effort to recreate the problem, the fourth slot was tapped with a wooden handle, and the issue appeared again, but it was only intermittent. When continuously tapping the slot, the problem was intermittent and did not always appear. The circuit board in the fourth slot was examined, and thick oxidation was found on the connector pins, appearing as black oxide. The pins were sanded with 0000# fine sandpaper and cleaned with alcohol. After reinstalling the circuit board, the problem was entirely resolved and did not reoccur, even with further tapping. To be sure, the other seven slots were inspected, and none showed signs of black oxide on the pins. It was clear that only the fourth slot had used a set of substandard connectors during production. The switch itself was therefore considered substandard. A temporary solution was implemented using the available port in the fifth slot as an alternative. The network continued to run without downtime, and continuous observation was maintained until the following day’s market close.
3. Conclusion
Network issues can be hardware failures or software failures, and sometimes they are a combination of both. In some cases, the symptoms exhibited by network issues are not enough to determine the root cause immediately. This problem was caused by a hardware issue with poor contact on the switch’s fourth slot. When equipment starts, the components operate at lower temperatures and perform normally, but as the components heat up and expand, poor contacts and failures occur. As a result, the network had only exhibited issues during the first few hours after each market opening, which became more frequent over time, leading to longer durations of issues. The issue had transitioned from intermittent to frequent, and longer disruptions were being observed. Such issues can often become continuous hard failures over time. By contrast, hard failures are usually easier to diagnose. As the point of failure was on the side of the switch facing the central network, it was challenging to observe the router and switch’s working conditions from the central site. Therefore, determining the issue from the network management system’s perspective was difficult. It was crucial to monitor the router and switch from the other side of the router for real-time assessment. When the network shows symptoms of unbalanced traffic, combined with a 90% loss rate for ICMP Ping tests and ICMP monitoring results, the location of the issue becomes evident. In this case, it was indeed a problem with the switch. Intermittent issues like this are referred to as “soft failures,” and they can be caused by software or hardware problems. They often require multi-point testing for accurate diagnosis, and portable testing tools are essential for such diagnostics. The trend in diagnosing network failures is moving toward networked testing tools and networked fault diagnostics. While many network devices and networked equipment support limited network management functions, monitoring network performance and quickly locating network issues requires a mix of both fixed testing tools (such as fixed probes and network management systems) and portable testing tools (such as network testers and traffic analyzers). Having spare equipment for critical network devices is essential, as it ensures redundancy. Critical network equipment doesn’t necessarily need to be the most expensive or feature-rich, but they should be reliable, well-supported, and widely used to simplify technical support. Assigning the maintenance of critical network equipment to integrators or manufacturers can be risky, as it means giving up control of the network’s fate to external entities. Therefore, it’s important to train personnel appropriately and provide them with user-friendly tools. This is especially significant for junior and mid-level network maintenance technicians and engineers, who make up over 90% of the maintenance workforce but might not be as familiar with hardware diagnostics.
4. Afterword
The network continued to operate normally for several consecutive days. The newly purchased switch was put into use after successful testing. The previous switch became the backup device for the computer center.