Last update : 29/04/2005 19:30:00

About this article

First of all, I would like to apologize for my bad English. If you would like to correct this article, feel free to do so and send it back to me, I would be most grateful! (Update 15/04/2005 : thanks to Daniel Grandjean who submitted a corrected version !)

I am going to write about TCP/IP and general network matters, but I am not a TCP/IP expert and I will certainly misname or misspell some words/concepts. Please report it to me if you read something absurd!

Purpose of this article

The purpose of this article is to explain, what - I think - can only be called a bug or misconfiguration in the default TCP/IP settings of Microsoft Windows Server 2003 Standard Edition. With this article, I would like to bring awareness about this problem so that something can be done to correct it, either by finding a "hack" or by having Microsoft patch its OS.

Introduction

A few months back, I wanted to set up a Windows Server 2003 to stream high quality Windows Media Video (WMV) files. The videos were encoded using multi-stream technology, so they effectively contained 4 streams each : 250kbit/s, 400kbit/s, 700kbit/s and 1400kbit/s.

Everything was fine, except that on my 2Mbit/s ADSL connection I wasn't able to watch the videos at the highest quality. A the beginning I blamed poor network conditions, bad network hardware, and several other things until the day I tested with a 20 Mbit/s ADSL connection. I realized that I couldn't get more than 2-3Mbit/s on a standard HTTP transfer on the same server. It was not a problem with the client because I could use all 20Mbit/s on several other servers in the same data-center, sharing the same connection.

It was then obvious that there was something wrong with Windows Server 2003, which will be referred later as W2003.

Answers to Frequently Asked Questions

  • it was not an application-specific problem

The same behaviour was observed with an HTTP server, an FTP server, a video streaming server and Windows file-sharing (direct-hosted connection).

  • it was not a high-level protocol problem (HTTP, FTP, RTSP, SMB,...)

Different applications used different protocols, so they were not to blame either.

  • it was not a server hardware problem

The server was tested with two Intel and one 3Com NICs. Tests were also done on totally different servers.

  • No QoS was turned on

The QoS functions of W2003 were turned off and there wasn't any QoS or firewalling equipment between the box and the network.

  • Network conditions were good

In more than 3 months, no significant packet loss or fragmentation has been observed. The packets were all of the advertised MTU.

  • The wiring was ok

On LAN (RTT <1 ms), W2003 achieved on average 97Mbit/s with a 100Mbit/s NIC.

  • Disabling TCP Nagle algorithm (No delay) didn't help
  • Update 22/04/2005 : Service Pack 1 doesn't help.
  • Update 29/04/2005 : it is not caused by ADSL connections since I have also tried with high bandwidth symmetrical connections

Testing conditions

During the next few weeks, I conducted several other tests to isolate the problem.

All the tests pointed out :

  • the only affected low-level protocol was TCP/IP
  • UDP and ICMP were not affected
  • bandwidth adds-up if several files are downloaded simultaneously (with 2-3 downloads I can effectively use 2 Mbit/s and thus fill the available client-side bandwidth on my ADSL connection)
  • The higher the RTT, the lower the bandwidth :
  • available bandwidth 1Mbit/s and RTT 60ms -> bandwidth used 100%
  • available bandwidth 2Mbit/s and RTT 60ms -> bandwidth used ~80%
  • available bandwidth 2Mbit/s and RTT 90ms -> bandwidth used ~60%
  • available bandwidth 100Mbit/s and RTT 13ms -> bandwidth used ~15%
  • available bandwidth 100Mbit/s and RTT <1ms -> bandwidth used 100%

"A Picture Is Worth A Thousand Words"

Here are two bandwidth graphs from ethereal that show the bandwidth usage (bytes/sec) from the server side. The two servers shared the same connection :

X-axis : Seconds Y-axis : Bytes

Linux server (Debian Sarge, kernel 2.6) :

Windows Server 2003 Standard Edition :

What do Microsoft have to say ?

When I started realizing that it might be a bug, I decided to contact Microsoft and tell them about it. The only way I knew of doing it was through the "Professional Support" service at 99$ per incident.

So I exchanged one email every day with them for almost 3 weeks. I told them everything I wrote on this article and sent them many packet captures files from both the Linux server and the Windows server. Their final conclusion about the differences between Linux and Windows was :

"It seems that the Windows 2003 server delays the send out actions because the congestion window is reached. Clearly, the Linux uses a different congestion window mechanism. To prevent congestion on a WAN connection, congestion window is adopted to hold the send actions on the server side to prevent potential network congestions."

So they basically say that Windows slows down the packets to prevent a "potential network congestion". Well, it's not a bug, it's a feature then, right ?!

Update 29/04/2005 : Even if it really is a feature, I would argue that the congestion avoidance algorithm used by Windows Server 2003 is really too pro-active. It avoids a congestion even when a congestion is not about to happen. Or maybe we have different definitions of "network congestion". Anyway, what I know is that I get 2Mbit/s on a Linux server and with exactly the same network conditions I get a little more than 1Mbit/s with Windows. Why can't Windows take advantage of available bandwidth as much as Linux ?

The Professional Support guy told me he would forward this case to the Product Team and "let them re-evaluate the current congestion window algorithm" but he didn't promise me that the behavior will change in the future since "the current algorithm may have been considered as the safest one to prevent congestion".

Update 29/04/2005 : The support guy told me today that "The product team thinks that the current algorithm is still the best method to avoid possible congestion on the network. There is currently no plan to change the behavior."

Trying to isolate the problem

I observed packet captures and I think I was able to find the reason for this reduced bandwidth problem. The server seems to wait for ACK's after ~20/30kb of sent data, although the client has advertised a TCP Window of 64kb. It's only when the client sends the required ACK's that the server resumes the stream of data. But sometimes, it can take a full round trip between the moment the server stops sending data and the moment the server receives the required ACK. This explains why actual throughput is closely correlated to RTT between client and server.

From my tests, I can say this unexpected stop of the TCP stream is to be held responsible for the reduced throughput. What I don't understand, is why the server stops sending data, although it still hasn't filled the client receive window size !?

Update 29/04/2005 : As pointed out by several people after my comment on Slashdot, this behaviour is most likely related to the congestion avoidance mechanism.

Conclusion

It definetely looks like a bug to me, but what strikes me most is that Google knows nothing about it ! What about all the Windows Media streaming providers out there ?! Even if they never served 1400kbit/s streams, I cannot believe they never did a bandwidth test on their servers.

Eventually, I came to accept the idea that Windows Server 2003, an OS designed for server tasks, is not able to fill a 2Mbit/s ADSL connection. Yes I know it sounds incredible but I've been looking without success for another conclusion for the past 3 months.

How can you help ?

By testing and reporting back to me if you experience the same problem on your Windows Server 2003. This will speed things up trying to convince Microsoft that they DO have a bug and MUST do something about it.

Contact the author

You can contact me at "spada AT zaup.org".