Libtune API design

Version 0.6
2005/08/22

Nadia Derbey
Nadia.Derbey@bull.net



1. Overview
2. Installing an application
    2.1. Checking the requirements
        2.1.1. Installation requirements
            2.1.1.1. Hardware requirements
            2.1.1.2. Distribution requirements
            2.1.1.3. Software requirements
            2.1.1.4. Communication requirements
        2.1.2. Hyper-threading support
        2.1.3. Memory requirements
        2.1.4. Swap requirements
        2.1.5. Disk requirements
        2.1.6. File system requirements
    2.2. Tuning the system-wide parameters
        2.2.1. Shared memory parameters
        2.2.2. Message queues parameters
        2.2.3. Semaphores parameters
        2.2.4. File system parameters
        2.2.5. Network parameters
        2.2.6. Memory parameters
    2.3. Setting the per-user limits
    2.4. Setting the per-process parameters
    2.5. Getting statistics
        2.5.1. System-wide statistics
        2.5.2. Per-process statistics
3. Summary
4. The tunables database
    4.1. The tunables database generation
    4.2. Accessing the database
    4.3. Portability issues
5. The libtune API
    5.1. Alternative
    5.2. Access rights and required privileges
    5.3. Getting information (tun_get)
        5.3.1. Parameters
        5.3.2. Returned values
    5.4. Setting information (tun_set)
        5.4.1. Parameters
        5.4.2. Returned values
    5.5. Locating information (tun_locate)
        5.5.1. Parameters
        5.5.2. Returned values
    5.6. Getting the key word for a location (tun_get_kwd)
        5.6.1. Parameters
        5.6.2. Returned values
    5.7. Getting help information (tun_help)
        5.7.1. Parameters
        5.7.2. Returned values
    5.8. Updating TUNDB_D (tun_update)
        5.8.1. Parameters
        5.8.2. Returned values
6. Deliverables





1. Overview

Accessing kernel tunables, system information and resource consumptions is needed during the whole life cycle of an application, starting from its installation. This access is usually implemented through installation and supervision scripts. Unfortunately, the following issues have been identified:
  1. These scripts are rarely portable. since they require to get, set and change values that are represented by objects that may change from distribution to distribution, or even from one release to the other inside the same distribution.
  2. There are quite multiple ways of accessing the kernel configuration and tunables: procfs, sysfs, existing syscalls or library routines, etc...
This raises the need for a standard, well defined API to manipulate the kernel configuration and tunables for software products to relay on.
The goal of this design is to define a standard API to unify the various ways Linux developers have to access kernel tunables, system information, resource consumptions. The libtune API should be built on top of the existing mechanisms, instead of replacing them, in order to maintain backward compatibility. As seen above, this API will be useful during the whole life of an application.
In the following chapters, we first present what is done today when an application is installed and while it is running. Then, the libtune API is presented.

2. Installing an application

Generally, installing an application can be divided into the following four tasks:
  1. Check the requirements
  2. Tune the system-wide parameters (kernel, vm, file system, networking, etc)
  3. Set the per-user limits
  4. Set the per-process parameters
The checking (1) part can be considered as a read-only task, while the tune (2) and set (3, 4) parts are fully write tasks (though a read part can be added in order to check that the parameters have been correctly set).
The following chapters detail the various information that should be checked or set during an application installation. They also describe the place where that information can be found on the machine. All the information have been collected in the following documents:

2.1. Checking the requirements

2.1.1. Installation requirements

2.1.1.1. Hardware requirements
This part describes the hardware the product is supported on, or has been tested on. We need to distinguish between both, because the first part can be characterized by a mandatory aspect, while the 2nd one can be seen as advisory only.

Requirement
Where to get the information from
CPU type the product is supported on
/proc/cpuinfo
CPU architecture (32 / 64 bits)

machine model

disk model
/sys/bus/scsi/devices/0:0:0:0 for example
info extraction can be built upon libsysfs
2.1.1.2. Distribution requirements
Requirement
Where to get the information from
kernel version
/proc/sys/kernel/osrelease take the 1st 3 digits ($(VERSION).$(PATCHLEVEL).$(SUBLEVEL))
e.g. in "2.6.9-1.667smp" only take "2.6.9"
distribution
/proc/version
This file contains the string linux_banner, that is built as follows:
"Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n";
The LINUX_COMPILER variable (obtained by $(CC) -v should (don't know if it is always true?) contain the distro level:
gcc -v | tail -1
gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)
compiler version
/proc/version (see above)
2.1.1.3. Software requirements
2.1.1.4. Communication requirements

2.1.2. Hyper-threading support

Requirement
Where to get the information from
Number of sibling CPUs on the same physical CPU for architectures which use hyper-threading /proc/cpuinfo (siblings)
Package id of a logical CPU (if hyper-threading)
/proc/cpuinfo (physical id)

2.1.3. Memory requirements

Requirement
Expressed in
Where to get the information from
Memory space
KB or MB
/proc/meminfo (MemTotal)
Size of a huge page
kB
/proc/meminfo (Hugepagesize)
Total number of huge pages
# of pages
/proc/meminfo (HugePages_Total)
Total number of available huge pages
# of pages
/proc/meminfo (HugePages_Free)

2.1.4. Swap requirements

Requirement
Expressed in
Where to get the information from
Swap space
KB or MB or percentage of the memory space
/proc/meminfo (SwapTotal)

2.1.5. Disk requirements

Requirement
Where to get the information from
Free disk space
statfs()
Disk size limitation (max size supported for GPFS)
/proc/partitions

2.1.6. File system requirements

Max file system size

2.2. Tuning the system-wide parameters

2.2.1. Shared memory parameters

Parameter
Where to set / get the value
Maximum number of shm segment ids
/proc/sys/kernel/shmmni
Maximum shm segment size (in bytes)
/proc/sys/kernel/shmmax
Maximum number of shm segment pages system-wide
/proc/sys/kernel/shmall
Minimum shm segment size (in bytes)
ipcs -lm / shmctl(SHMINFO)->shmmin

2.2.2. Message queues parameters

Parameter
Where to set the value
Maximum size of a message queue
/proc/sys/kernel/msgmnb
Maximum number of message queue ids
/proc/sys/kernel/msgmni

2.2.3. Semaphores parameters

Parameter
Where to set the value
Max number of semaphore identifiers (semmni)
/proc/sys/kernel/sem (4th field)
Max number of semaphores per id (semmsl)
/proc/sys/kernel/sem (1st field)
Max number of semaphores in system (semmns = semmni * semmsl)
/proc/sys/kernel/sem (2nd field)
Max number of operations per semop call (semopm)
/proc/sys/kernel/sem (3rd field)
All previous parameters together
/proc/sys/kernel/sem

2.2.4. File system parameters

Parameter
Where to set the value
Maximum number of file handles
/proc/sys/fs/file-max

2.2.5. Network parameters

Parameter
Where to set the value
Maximum receive buffer size
/proc/sys/net/core/rmem_max
Maximum send buffer size
/proc/sys/net/core/wmem_max
Maximum number of received packets that will be processed before resulting in congestion
/proc/sys/net/core/netdev_max_backlog
Minimum / default / maximum memory size of the TCP receive buffers
/proc/sys/net/ipv4/tcp_rmem
Minimum / default / maximum memory size of the TCP send buffers
/proc/sys/net/ipv4/tcp_wmem
Enable TCP to negotiate the use of window scaling (> 64K buffers) with the other end during connection setup
/proc/sys/net/ipv4/tcp_window_scaling
Timeout for a FIN packet before the socket is forcibly closed
/proc/sys/net/ipv4/tcp_fin_timeout
Maximum number of queued connection requests which= have still not received an ACK from the connecting client
/proc/sys/net/ipv4/tcp_max_syn_backlog
Allow to reuse TIME-WAIT sockets for new connections
/proc/sys/net/ipv4/tcp_tw_reuse
Local port range that is used by TCP and UDP to choose the local port
/proc/sys/net/ipv4/ip_local_port_range
MTU size
ifconfig
or
/proc/sys/net/ipv6/conf/<interface>/mtu

2.2.6. Memory parameters

Parameter
Expressed in
Where to set the value
Kernel policy for memory allocation
0, 1, 2
/proc/sys/vm/overcommit_memory
Percentage of the memory that should be added to the swap to determine the maximum address space that is allowed to be committed
Percentage of the physical memory
/proc/sys/vm/overcommit_ratio
Swappiness
[0, 100]
/proc/sys/vm/swappiness
Number of configured huge pages
Number of pages
/proc/sys/vm/nr_hugepages

2.3. Setting the per-user limits

Parameter
Default value
How to set the value (command level)
How to set value (syscall level)
Number of open file descriptors (hard limit)
_SC_OPEN_MAX
ulimit -Hn
Note: should be set to less than /proc/sys/fs/file-max
setrlimit(RLIMIT_NOFILE)
Number of open file descriptors (soft limt)
_SC_OPEN_MAX
ulimit -Sn
setrlimit(RLIMIT(NOFILE)
Maximum number of processes (hard limit)
_SC_CHILD_MAX
ulimit -Hu
setrlimit(RLIMIT_NPROC)
Maximum number of processes (soft limit)
_SC_CHILD_MAX
ulimit -Su
setrlimit(RLIMIT_NPROC)
Maximum stack size (soft limit)
_STK_LIM
ulimit -Ss
setrlimit(RLIMIT_STACK)
Maximum size of the data segment (soft limit)
unlimited (RLIM_INFINITY)
ulimit -Sd
setrlimit(RLIMIT_DATA)
Maximum size of virtual memory (soft limit)
(datalim / 1024L) + (stacklim / 1024L)
ulimit -Sv
setrlimit(RLIMIT_RSS)

2.4. Setting the per-process parameters

Parameter
Where to set the value
Base address for shared libraries (specific to some ditros, like RH)
/proc/<pid>/mapped_base

2.5. Getting statistics

After the application installation has completed and while it is running, there is a need to get some system statistics about resource consumptions and utilization, in order to appropriately tune the kernel parameters.

2.5.1. System-wide statistics

After installation has succeeded and while the installed application is running, a need occurs to get some system statistics. These statistics are usually taken from /proc. Only the /proc entries that has not yet been presented above are described hereafter:

Parameter
Where to set the value
Time spent in user mode
/proc/stat (column 1 + column 2)
Time spent in kernel mode
/proc/stat (column 3)
Time spent idle
/proc/stat (column 4)
Number of context switches
/proc/stat (ctxt)
Free memory
/proc/meminfo (MemFree)
Total amount of swap space
/proc/meminfo (SwapTotal)
Free swap area
/proc/meminfo (SwapFree)
Network device statistics
/proc/net/dev

2.5.2. Per-process statistics

Parameter
Where to set the value
Status information about a given process
/proc/<pid>/stat

3. Summary

From what has been presented in the 1st chapters of this paper, libtune should be able to set and get information in the following ways:

4. The tunables database

We saw in the previous chapter that invoking the libtune API to set or get information is equivalent to one of:
Performing these operations is done with the support of what we will call the "tunables database" (TUNDB). This is actually an array, indexed by the data that will be accessed. Each entry of TUNDB is described as follows:

The following array shows an example of how to fill the TUNDB for each case presented in the previous chapter:

Characteristics of the underlying object

Example

attributes

location

field number or line delimiter

strategy routine

data format
name scope
contents scope
Existing routine
system  wide
N/A
min shm segment size
N/A
N/A
N/A
strat_shmmin
Entire file contents

system wide
system wide
max number of shm segment ids
ATTR_NAME_SYS_WIDE | ATTR_CONT_SYS_WIDE
/proc/sys/kernel/shmmni
N/A
strat_file
per process
system wide
Shared libraries base address for a process
ATTR_NAME_PER_PROC | ATTR_CONT_SYS_WIDE
/proc/%d/mapped_base
N/A
strat_file
per network interface
system wide
MTU size
ATTR_NAME_PER_NETINT | ATTR_CONT_SYS_WIDE
/proc/sys/net/ipv6/conf/%s/mtu
N/A
strat_file
single line

system wide
system wide
max number of semaphores per id
ATTR_NAME_SYS_WIDE | ATTR_CONT_SYS_WIDE
/proc/sys/kernel/sem
1
strat_sub_file_line
per process
system wide
ppid of a process
ATTR_NAME_PER_PROC | ATTR_CONT_SYS_WIDE
/proc/%d/stat
4
strat_sub_file_line
one line of data per single object
system wide
per network interface
Number of receive packets for a network interface
ATTR_NAME_SYS_WIDE | ATTR_CONT_PER_NETINT

/proc/net/dev
2
strat_sub_file_lines
several lines, each line containing a fixed string followed by the corresponding value
system wide
system wide
Total swap space
ATTR_NAME_SYS_WIDE | ATTR_CONT_SYS_WIDE /proc/meminfo
"SwapTotal"
strat_sub_file_block
per process
system wide
Sleep average time of a process
ATTR_NAME_PER_PROC | ATTR_CONT_SYS_WIDE
/proc/%d/status
"SleepAVG"
strat_sub_file_block
block of data for each object

system wide
per CPU
Package id of a cpu
ATTR_NAME_SYS_WIDE | ATTR_CONT_PER_CPU
/proc/cpuinfo
"physical id"
strat_sub_file_block

Note: it should be noted that files with a variable location should be present only once in the database: it is not reasonable to fill a new entry for the same pseudo-file associated to a different identifier. Ex: in the case of the process associated pseudo-files (/proc/<pid>/*)
, we would reach PID_MAX * around 20 files at least! This is the reason why a printf format has been chosen for the file location

4.1. The tunables database generation

The objects accessed by the TUNDB database that has been presented in the previous chapter can be classified in the following 2 classes:
This classification leads to split the TUNDB database into the following "sub-databases": The following discussion only concerns TUNDB_S6, the database that is to be dynamically generated. In order to generate TUNDB_S6, we have to choose between:
  1. making each newly registered pseudo-file under /sys or /proc register itself in TUNDB_S6.
    • procfs pseudo-files: create_proc_entry() can be changed to fill TUNDB_S6 with each newly created pseudo-file. And remove_proc_entry() can be changed to do the reverse operation.
    • sysfs pseudo-files: sysfs_create() can be changed to fill TUNDB_S6 with each newly created pseudo-file, and sysfs_hash_and_remove() can be changed to do the reverse operation.
    The advantage of this solution is that the data base is always in a state that reflects that of the underlying pseudo filesystems. Its drawback is that it requires a change at the kernel level.
  2. scanning the procfs and sysfs pseudo-filesystems in order to discover their tree structure and fill TUNDB_S6 with the set of pseudo-files. The advantage of this solution is that it can be completely developed outside the kernel. Cons: there are cases were the key words will not perfectly reflect the pseudo filesystems tree structure, even if this scanning is to be done periodically. Actually, we can imagine some drivers or modules that generate their own files under /proc and that are not loaded at the time the scanning is done. This means that the corresponding generated files would not be manageable by libtune until a new scanning is done. On the other hand, adding and removing files from the pseudo filesystems is not such a frequent operation, except for files associated to running processes. But these are files whose names will be present in TUNDB_S6 from the very beginning, in their variable format (/proc/%d/cmdline). So it seems reasonable to think that such a "latency" is acceptable.
So the second solution is the one that will be kept. It will be implemented through a daemon that will periodically scan the pseudo filesystems and update TUNDB_S6 according to the results of its scan. Actually, TUNDB_S6 is split into:

The structure of TUNDB is shown in the following scheme:




The exchanges between the daemon and the library to update TUNDB_D are summarized in the following scheme:




In order to avoid TUNDB_D contents to disappear as soon as libtune is unloaded, TUNDB_D contents are actually stored in a file (/var/tuned/tundb_d) mapped by the library when needed. This file is initialized by a binary (libtuninit) after libtune has been installed.
When the tuned daemon is first called, it requests from the library to map /var/tuned/tundb_d into TUNDB_D.
Then the daemon periodically scans the /proc and /sys pseudo filesystems (forgetting the /proc/<pid>/* files).
If it detects that one or more files have been added since it has last been awaken, it requests from the library to add the corresponding entries into TUNDB_D. If the daemon detects that one or more files have been removed since it has last been awaken, it requests from the library to remove the corresponding entries from TUNDB_D.

4.2. Accessing the database

In order for an application to get or set system information, it should address an entry into TUNDB, through its index. This index enables the library to find all the needed information to get or set the addressed tunable (attributes, file location if any, associated strategy routines). Addressing a TUNDB entry is obvious for the static part (TUNDB_S1 to TUNDB_S6): the indexes can easily be defined as constants in a documented header file.
Then, since the static part of TUNDB is actually made of 6 distinct arrays, we need to convert the documented constant into an index in the appropriate array. This is done as follows:
  1. The maximum number of indexes into each TUNDB is the same (TUNDB_MAX = 0x400)
  2. An array called TUNLIMITS contains the following information for each TUNDB array:
  3. When a predefined documented keyword is referenced:
    1. it is first divided by TUNDB_MAX. This gives us the array where the corresponding entry should be found.
    2. the value of the first index for this array is substracted from the keyword value. This gives us the actual index in the array to access the needed information.
The following scheme summarizes the process that has just been described:



TUNDB_D
, on its side, is an automatically generated part of TUNDB. For this array, each new index will be an increment of the last existing index in TUNDB. Moreover, in order for the applications to know which index to use for which file, a set of commands and API interfaces should be implemented, to query information from the database (it should be noted that a help field exists for each entry in the database).

4.3. Portability issues

We saw in the overview of this document that the main problem for installation or supervision scripts is a portability issue:
That's why a set of database initialization files will be maintained in a tree structure sorted by distro, then by release, then by architecture. Obviously, a single index in the database should enable the access to the same information across distributions, releases and architectures. The choice for the right source files will be done at compilation time.

5. The libtune API

The API that comes out from the previous chapters is quite simple:
Since it is not POSIX compliant, this API is not intended to be integrated into the glibc: it will be a completely separate API.

5.1. Alternative

Looking at the limited number of actions during an installation, an alternative to the proposed API would be to define a single entry per wanted action. For example: This would be feasible while the number of actions remains under a reasonable limit. But the problem with this kind of API is that is has to be enhanced each time a new action is needed. The tun_set() / tun_get() solution is the most generic one, that' why it is the one we kept: it can be used not only during applications installation but also, for example, by a daemon in charge of periodically collecting statistics, and of adjusting the kernel parameters based on its observations.

5.2. Access rights and required privileges

The query interfaces are not restricted to specific users.

The interfaces used to get or set information, on their side, are submitted to the same access rights as the underlying object:

5.3. Getting information (tun_get)

size_t tun_get(int keyword, void *identifier, char **out_buff, size_t *out_sz)

5.3.1. Parameters

This routine takes the following parameters:

5.3.2. Returned values

On  success,  tun_get() returns the number of characters read, including the terminating null character, but not including the EOF character. This value can be used to handle embedded null characters in the data read.
tun_get() returns -1 on failure, and errno is set.

5.4. Setting information (tun_set)

size_t tun_set(int keyword, void *identifier, char *in_buff, size_t in_sz, char **out_buff, size_t *out_sz)

5.4.1. Parameters

This routine takes the following parameters:

5.4.2. Returned values

On  success,  tun_set() returns the number of output characters, including the terminating null character, but not including the EOF character. This value can be used to handle embedded null characters in the data read.
tun_set() returns -1 on failure, and errno is set.

5.5. Locating information (tun_locate)

This routine can be used, given an index into the TUNDB database, to locate the underlying pseudo-file.
This routine is only meaningful for indexes that correspond to information that is managed through pseudo files.

int tun_locate(int keyword, char **location, int *loc_sz)

5.5.1. Parameters

This routine takes the following parameters:

5.5.2. Returned values

5.6. Getting the key word for a location (tun_get_kwd)

This is the reverse operation of the preceding one: given a location, it returns the associated TUNDB index (to use for example in a set / get operation). Actually, since many indexes may have the same location in TUNDB, this routine returns a set of indexes. Ex: of entries that have the same underlying locations:
int tun_get_kwd(char *location, int **keywords, int *nb_keywords)

5.6.1. Parameters

This routine takes the following parameters:

5.6.2. Returned values

5.7. Getting help information (tun_help)

This routine can be used, given an index into the TUNDB database, to return the corresponding help string.

int tun_help(int keyword, char **help, int *help_sz)

5.7.1. Parameters

This routine takes the following parameters:

5.7.2. Returned values

5.8. Updating TUNDB_D (tun_update)

This routine is for internal use only: it is for use by the tuned daemon. It is called to initialize, update and clean the TUNDB_D array (dynamic part of TUNDB).

int tun_update(int cmd, char *fname)

5.8.1. Parameters

This routine takes the following parameters:

5.8.2. Returned values

The returned values depend on the requested action, as follows:

cmd
returned value on success
returned value on failure
TUN_INIT
0
-1
TUN_CLEAN
0
0
TUN_ADD
>= 0 (associated keyword = D_IDX_FIRST + new index)
-1
TUN_REMOVE
0
0

6. Deliverables

These are the remaining phases for the libtune API: