Background
Users have reported significant delays in VNC usage scenarios. When the VNC client connects to the VNC server, touch and click responses are occasionally delayed, with delays ranging from 500ms to 2 seconds. Data packets captured via Wireshark revealed issues such as zero window, packet disorder, and retransmissions. Users questioned why the server sent retransmissions after 500ms instead of performing fast retransmissions. At first glance, this seemed like a common fault, but further analysis of the VNC TCP delay revealed distinct and unusual phenomena. In this article, I will share the intricate analysis process undertaken.
The case is taken from the Wireshark official Q&A forum
https://osqa-ask.wireshark.org/questions/24275/tcp-retransmission-with-a-delay-time-of-two-seconds
VNC TCP Delay Analysis
Basic Information
The basic information of the trace file is as follows
$ capinfos 283-02-Capture2.pcapng
File name: 283-02-Capture2.pcapng
File type: Wireshark/... - pcapng
File encapsulation: Ethernet
File timestamp precision: microseconds (6)
Packet size limit: file hdr: (not set)
Number of packets: 352
File size: 205 kB
Data size: 194 kB
Capture duration: 2.402748 seconds
First packet time: 2013-09-02 14:55:19.372951
Last packet time: 2013-09-02 14:55:21.775699
Data byte rate: 80 kBps
Data bit rate: 646 kbps
Average packet size: 551.68 bytes
Average packet rate: 146 packets/s
SHA256: 4604d1adafb045024f33636bf269d4b82077a59ae53a78b0dd2e17db1693a30a
RIPEMD160: 24e50fff31cda5eace4150de0146346cc493e45a
SHA1: c6f9b15dc2a532489a8fc56bbf465ee0b6497ce6
Strict time order: True
Capture oper-sys: 64-bit Windows 7, build 7600
Capture application: Dumpcap 1.10.1 (SVN Rev 50926 from /trunk-1.10)
Number of interfaces in file: 1
Interface #0 info:
Name = \Device\NPF_{A9EF82F5-FA0D-49F4-AF52-1C2066D04340}
Encapsulation = Ethernet (1 - ether)
Capture length = 65535
Time precision = microseconds (6)
Time ticks per second = 1000000
Time resolution = 0x06
Operating system = 64-bit Windows 7, build 7600
Number of stat entries = 0
Number of packets = 352
The user has filtered the required data packets very well, which are very clear, with only one TCP Stream 0, which is convenient for troubleshooting analysis. The expert information display also proves the problem phenomenon reported by the user, including statistics such as zero window, disorder, and retransmission.
The only drawback is that the TCP session is incomplete, tcp.completeness == 12
which means that it only contains Data and Ack, but no three-way handshake and other data packets.
TCP session integrity function, see the previous article
In-Depth Analysis of VNC TCP Delay
First, at the beginning of the trace file, the client 192.168.0.66 has a zero window phenomenon, marked as [TCP ZeroWindow], and notified the server 192.168.0.10.
Even though the Win Window Factor size cannot be identified because the trace file lacks the three-way handshake, the receive window for this client is definitely filled.
Client to server direction
The packet capture point shows that there may be packet loss and retransmission in this transmission direction. According to the prompt, a TCP segment with Seq Num 2711 and Len 6 is missing, and TCP retransmission is performed at No. 239.
The first question is, is No.239 really a timed-out retransmission packet? Let’s continue with this question. The same phenomenon has recurred afterwards, as shown below, No.245 prompts lost segments and No.254 prompts retransmission, accompanied by the zero window phenomenon that the client itself keeps notifying.
Now we are entering into a brain-burning deduction and analysis process. Is No.254 a TCP timeout retransmission packet? (Ignore question 1 No.239 for now)
- First of all, why is it not a fast retransmission? Although the client and server are in the same LAN and the RTT is only 1.x milliseconds, due to the particularity of the application protocol interaction, the client rarely transmits data packets with data fields. Therefore, individual packet loss or disorder does not bring about continuous DUP ACK phenomenon, so it cannot trigger the client’s fast retransmission, so it is not a fast retransmission.
- But is it really a TCP timeout retransmission packet? If we assume that No.254 is a timeout retransmission packet, then theoretically the initial packet should be sent after No.238 packet. However, the time interval with No.254 is only 3.6 ms, which does not meet the RTO minimum time of 200ms, so Wireshark’s judgment is not accurate. (It may be that the TCP three-way handshake was not captured, and there is no reference to the IRTT value)
- If it is neither a fast retransmission nor a timeout retransmission, then is data packet No. 254 an out-of-order packet? According to general abnormal phenomena, it is indeed judged as such, because only out-of-order can cause the data segment with a small Seq Num to appear at the end.
- But is this really the case? First, let’s look at the phenomenon. Theoretically, the IP IDs of all packets from client 192.168.0.66 should be in an increasing state when sent from the source, so the IP IDs of packets with smaller Seq Num should also be smaller. If it is the disordered packet judged in 3 , the IP ID of No.254 should theoretically be after the IP ID of No.238 packet and before the IP ID of No.248 packet. However, as shown in the figure below, this is not the case. The IP ID sizes of all packets from client 192.168.0.66, including No.254, are consistent with the order of the packets, increasing from top to bottom. What does this indicate?
5. If the analysis results from 1-4 are correct, this phenomenon indicates that there may be a problem with the client 192.168.0.66 kernel protocol stack , which means that the data segmentation is normal at the TCP level, but out of order at the IP level, so the smaller TCP Seq Num will appear larger on the IP ID. However, such a conclusion is too incredible for a system kernel protocol stack implementation…
Based on the above, from my personal understanding and the scenarios I can think of (not excluding any possibility), there may be the following four preliminary conclusions:
a. There is a problem at the client kernel protocol stack level, and the IP and TCP levels are out of order; No.254 is the original out-of-order data packet.
b. It is still at the client system kernel protocol stack level, but going back to derivation 2, the client has reduced the RTO minimum value of 200ms, and it still triggers a timeout retransmission; No.254 is a timeout retransmission data packet.
c. There is an intermediate device in the communication path between the client and the server, which will modify the IP ID field of the reallocated data packet, so if the disorder occurs, the above situation will occur; No.254 is the original out-of-order data packet with the IP ID modified.
d. There is an intermediate device in the communication path between the client and the server, which will modify the IP ID field of the reallocated data packet, so if the packet is lost, the above situation will also occur; No.254 is the timeout retransmission data packet with the IP ID modified.
Maybe some students still remember, by the way, what about question 1 No.239? Is it the same as No.254? It looks similar, but there is indeed a detail difference. I wonder if the careful students have noticed the ACK Num of No.239 in the above figure . Yes, it is smaller than the ACK Num of the previous data packet, which means that this data packet is not a timeout retransmission data packet, but an original data packet. However, due to the problem of ACK Num, the difference between No.239 and No.254 is that No.239 was not normally accepted by the server.
After that, triggered by DUP ACK, the client performed a real fast retransmission on No.277, and the server received and confirmed it normally.
Combined with the clues brought by packet No.239, the final conclusion is narrowed down to two types:
a. There is a problem at the client kernel protocol stack level, and the IP and TCP levels are out of order; No.254 and No.239 are out of order original packets.
b. There is an intermediate device in the communication path between the client and the server, which will modify and reallocate the IP ID field of the packet. If out of order occurs, the above situation will occur; No.254 and No.239 are out of order original packets with modified IP ID.
The conclusions drawn here are mainly based on the actual phenomena of data packets, from personal understanding and step-by-step deduction, and after a long period of thinking and argumentation, this result was reached. I don’t know whether it is necessarily correct. Also, because this is an Internet case, the actual situation cannot be verified.
Server to client direction
Turning to the server transmission direction, there are also very different places. After the client 192.168.0.66 ACK 135753, the server 192.168.0.10 continuously sent 5 MSS data packets, but the client 192.168.0.66 did not confirm it, but replied with 4 TCP ZeroWindow
, because the client’s receiving window was 0 and could not confirm the data.
Then the server 192.168.0.10 resends the 5 MSS data packets. Wireshark marks the first 4 as out of order and the last one as a retransmission. However, since the interval is too short, Wireshark makes a wrong judgment. In theory, these 5 data packets should all be retransmission data packets, but are they timeout retransmissions or fast retransmissions?
There is still a lack of further verification of the original environment, so the possible results of the analysis are mainly as follows:
- There is a detail to consider on the server side. After receiving the TCP ZeroWindow notification from the client, the server still retransmits (regardless of timeout retransmission or fast retransmission). At the level of the standard protocol stack, data should not be sent during the zero window period. So it brings up a question: in such a special environment, how does the server’s protocol stack work? Will it continue to send packets regardless of the zero window if the conditions of fast retransmission or timeout retransmission are met, or will it stop sending packets according to the zero window notification, regardless of whether the conditions of fast retransmission or timeout retransmission are met.
- From the perspective of actual data packets, the server continues to send packets during the zero window period. So back to the previous question, is it a timeout retransmission or a fast retransmission?
- If it is a timeout retransmission, then the result is the same as the derivation of the client above. If the RTO minimum value of 200ms is reduced at the system kernel protocol stack level, a timeout retransmission will be triggered ;
- If it is a fast retransmission, the server will treat the TCP ZeroWindow data packet as a DUP ACK, thereby triggering a fast retransmission, which is also a rather strange behavior.
Continuing the analysis, since the client has a zero window and cannot confirm the data packet, the server retransmitted No.249 Seq Num 135753 for the first time after No.240 Seq Num 135753 was sent for the first time. Normally, Seq Num 135753 should be retransmitted for the second time in No.265, but this packet is missing and does not appear until No.297. At the same time, the IP ID also increases normally, and note that ACK Num 4209 also increases at this time.
Combining the fault phenomenon and derivation analysis of the client and server, the final analysis conclusion may be drawn as follows: there is an intermediate device in the communication path between the client and the server, which will reallocate and modify the IP ID field of the data packet. If a strange disorder occurs at the same time, retransmission or false retransmission will occur. Because the client has a zero window and cannot confirm data at the same time, the fault phenomenon is repeated until the client window returns to normal, and the problem disappears.
Summarize
Analyzing VNC TCP delay issues without a concrete environment is challenging. The case highlights complex interactions between packet loss, retransmissions, and zero window states. Further investigation is necessary to clarify the underlying causes and facilitate effective troubleshooting.