1. Symptoms
Today, it’s the weekend, and I was preparing to embark on a “Two-Day Trip to Huangshan” with family and friends when I received an “emergency call” from the Network Hospital. I thought this could ruin my entire vacation plans. As expected, a bank sought help from the hospital as the entire network in the West City area was paralyzed. Communication with the data center was severely disrupted, and only occasional transactions were completed, but at a slow pace, for some unknown reason. Since the data center’s network management system was also down, there was no way to monitor the status of any network devices.
2. Diagnostic Process
I hastily left my family and friends at the train station and headed to the bank’s data center. En route, I continued to communicate with the head of the data center to understand the situation. The system failure occurred around 4:30 AM (about 4 hours ago). The on-duty staff noticed an alarm signal in the network management system, and about 20 seconds later, the network management server was essentially frozen. To further investigate the issue, the system was restarted three times, with each occurrence of the network management server failing in approximately 20 seconds. Both the main server and the network management server passed their self-checks.
When inquiring about the internal network status at various branch offices, everything appeared normal except for transaction processing. It could be concluded that the problem resided within the computer systems at the data center. The data center was equipped with HP’s network management software, OpenView, but had no additional network maintenance tools. Therefore, once the network management system failed, the operations and maintenance staff were left without effective troubleshooting tools.
The network’s main servers in the East and West City areas were connected via a switch in two separate segments. The city’s main settlement server was on the same segment as the East City area. Observing the working status of the East City area’s segment using an F683 network tester revealed abnormal traffic on Plot3Port4 of the Cisco 5500 switch, which was the port connected to the West City area’s main server and network management system segment. To further examine the working status of this segment, an F683 network tester and a protocol analyzer (PI) were connected to the segment, revealing continuous network traffic at 97%, with 98% being error frames. Error types included 40% short frames, 58% long frames, and 2% having frame lengths between 50 to 60 bytes. These long frames ranged from 3000 to 5200 bytes, and the MAC addresses of the offending machines were reported. Unfortunately, the data center did not have a MAC address backup table (only an IP address and hostname mapping table). Attempts to locate the corresponding machines using the MAC addresses were unsuccessful.
In an attempt to resolve the issue, the network card driver for the main server was reinstalled, but this did not resolve the problem. Using the F683 tester to check the server’s port, the protocol displayed “Unknown.” The server’s network card was replaced, and its driver reinstalled with the correct settings. Upon rebooting the system, everything returned to normal.
3. Conclusion
The server’s network card was damaged, causing a 98% error rate in transmitted data frames, with less than 1% of the data being transmitted successfully. Therefore, occasional transactions were still possible. We know that excessively long frames act as a closed network, primarily slowing down or causing network failure, while a high flow of short frames can disrupt network device protocols and lead to device failures (workstations are more sensitive to this in practice). The network management server was being damaged and crashing about 20 seconds after receiving a high error frame rate, preventing the observation of network parameters.
Many devices only check some parameters during self-check (some parameters, especially certain physical parameters, cannot be tested through self-check). In this case, both the network management server and the main server passed their self-checks, and in practice, the network card’s physical function had already failed. However, during self-check, the network card communicated normally with the operating system’s communication protocol, maintained very low network activity with the help of the less than 1% normal frames. Other sites would gradually fail under the “bombardment” of high traffic error frames.
4. Diagnostic Recommendations
Switches are effective for segmenting networks and isolating network faults. It’s best to have important network devices like the main server and network management server use dedicated switch ports. Avoid connecting other devices through shared hubs, as this can help quickly isolate the faulty device and reduce losses caused by network downtime. If a switch failure is encountered, network topology diagrams can help quickly pinpoint the problem, improving the efficiency of maintenance work. Additionally, MAC addresses are crucial for documenting network information. They provide significant convenience for not only troubleshooting network equipment problems but also swiftly identifying “malicious users,” unauthorized members of the network.
5. Afterword
You certainly wouldn’t have expected that just two hours later, I boarded another train headed for Huangshan, and my mood was still good.