Tuning Linux kernel for GPFS

November 2005

Nadia Derbey


The objective of this paper is to define a profile in terms of kernel tunables for Linux nodes using a GPFS file system.
First, it lists the various recommendations that have been collected from the following references:
Since these recommendations all come from IBM official documentation, they are all coherent. Thus, they all should be collected into the same profile.

Reference [1] suggests the following

kernel parameter
recommended value
ifconfig <interface> mtu 9000 up
MTU size for the communication adapter (if GPFS is configured over Gigabit Ethernet, in order to enable Jumbo Frames)
Enable TCP to negociate the use of window scaling (> 64K buffers) with the other end during connection setup
(8 MB)
Maximum receive buffer size.
Overrides tcp_rmem max value if max(tcp_rmem) > rmem_max
(8 MB)
Maximum send buffer size.
Overrides tcp_wmem max value if max(tcp_wmem) > wmem_max

memory size of the TCP send buffers:

memory size of the TCP receive buffers:
minimum size
default size
maximum size
Max # of received packets that will be processed before resulting in congestion

Reference [2] suggest the following (SLES 9 specific)

On the SUSE LINUX ES 9 distribution, it is recommended you adjust the vm.min_free_kbytes kernel tunable. This tunable controls the amount of free memory that Linux kernel keeps available (i.e. not used in any kernel caches). When vm.min_free_kbytes is set to its default value, on some configurations it is possible to encounter memory exhaustion symptoms when free memory should in fact be available. Setting vm.min_free_kbytes to a higher value (Linux sysctl utility could be used for this purpose), on the order of magnitude of 5-6% of the total amount of physical memory, should help to avoid such a situation.

kernel parameter
recommended value
>= 5% RAM
<= 6% RAM
Used to force the VM to keep a minimum # of Kbytes free. This number is used to compute the # of reserved free pages for each memory zone in the system.

Reference [3]) suggests what follows

Some customers with large GPFS Linux-only clusters (128 or more nodes) have experienced occasional command lockups when starting or stopping the GPFS subsystem. This problem was identified as being caused by occasional failures of the Linux TCP layer to correctly handle listen queue overflows on passive TCP sockets. This behavior is detailed below along with some help to reduce this problem.

As a part of the normal GPFS startup procedure, many (or all) nodes in the cluster may try to communicate with the primary or secondary GPFS cluster data server node and attempt to execute a command on that node through remote shell. This procedure involves establishing a connection to the server TCP socket created by the remote shell daemon on each these nodes. If the listen queue of the server sockets is not large enough, a queue overflow will occur if too many nodes try to initiate remote shell connections simultaneously.

In a large GPFS environment, prolonged listen queue overflows can present a substantial problem. If the listen queue on the remote shell server is kept full for an extended period of time, serve-side TCP starts dropping incoming connection requests without notifying clients about its decision to drop them. This leaves some of the connections in a half-opened state, thereby causing some of the remote shell client processes to hang.

Preventive steps that may reduce the likelihood that prolonged listen queue overflows will occur in your GPFS environment include:

End of Document