Libtune API design

Version 0.6
2005/08/22

Nadia Derbey
Nadia.Derbey@bull.net

1. Overview

Accessing kernel tunables, system information and resource consumptions is needed during the whole life cycle of an application, starting from its installation. This access is usually implemented through installation and supervision scripts. Unfortunately, the following issues have been identified:

These scripts are rarely portable. since they require to get, set and change values that are represented by objects that may change from distribution to distribution, or even from one release to the other inside the same distribution.
There are quite multiple ways of accessing the kernel configuration and tunables: procfs, sysfs, existing syscalls or library routines, etc...

This raises the need for a standard, well defined API to manipulate the kernel configuration and tunables for software products to relay on.
The goal of this design is to define a standard API to unify the various ways Linux developers have to access kernel tunables, system information, resource consumptions. The libtune API should be built on top of the existing mechanisms, instead of replacing them, in order to maintain backward compatibility. As seen above, this API will be useful during the whole life of an application.
In the following chapters, we first present what is done today when an application is installed and while it is running. Then, the libtune API is presented.

2. Installing an application

Generally, installing an application can be divided into the following four tasks:

Check the requirements
Tune the system-wide parameters (kernel, vm, file system, networking, etc)
Set the per-user limits
Set the per-process parameters

The checking (1) part can be considered as a read-only task, while the tune (2) and set (3, 4) parts are fully write tasks (though a read part can be added in order to check that the parameters have been correctly set).
The following chapters detail the various information that should be checked or set during an application installation. They also describe the place where that information can be found on the machine. All the information have been collected in the following documents:

Short guide how to install Oracle 10g on Gentoo Linux (http://www.linux-services.org/oracle/)
Quick Beginnings for DB2 Servers - Installing a DB2 server on Linux (ftp://ftp.software.ibm.com/ps/products/db2/info/vr8/pdf/letter/db2ise80.pdf)
DB2 OLAP server installation guide (http:// www-306.ibm.com/software/data/db2/db2olap/docs/v82docs/esb_install/esb_installtfrm.htm?srvhsrj.htm)
DB2 UDB version 8.2 for Linux - DB2 UDB and the 2.6 kernel (http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/html_single/DB2-HOWTO.html)
GPFS v2.3 FAQS (http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfs_faqs.html)
GPFS v2.3 Concepts, planning and Installation Guide - Linux configuration and tuning considerations (http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs23/bl1ins10/bl1ins1043.html)
WebSphere Application server (http://www-306.ibm.com/software/webservers/appserv/doc/v51/prereqs/hardware511.htm)

2.1. Checking the requirements

2.1.1. Installation requirements

2.1.1.1. Hardware requirements

This part describes the hardware the product is supported on, or has been tested on. We need to distinguish between both, because the first part can be characterized by a mandatory aspect, while the 2nd one can be seen as advisory only.

`Requirement`	`Where to get the information from`
`CPU type the product is supported on`	`/proc/cpuinfo`
`CPU architecture (32 / 64 bits)`
`machine model`
`disk model`	`/sys/bus/scsi/devices/0:0:0:0 for example info extraction can be built upon libsysfs`

2.1.1.2. Distribution requirements

`Requirement`	`Where to get the information from`
`kernel version`	`/proc/sys/kernel/osrelease take the 1st 3 digits ($(VERSION).$(PATCHLEVEL).$(SUBLEVEL)) e.g. in "2.6.9-1.667smp" only take "2.6.9"`
`distribution`	`/proc/version This file contains the string linux_banner, that is built as follows: "Linux version " UTS_RELEASE " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ") (" LINUX_COMPILER ") " UTS_VERSION "\n"; The LINUX_COMPILER variable (obtained by $(CC) -v should (don't know if it is always true?) contain the distro level: gcc -v \| tail -1 gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)`
`compiler version`	`/proc/version (see above)`

2.1.1.3. Software requirements

packages required
version for each package

2.1.1.4. Communication requirements

protocol required (TCP vs UDP)
NFS required yes / no
type of interconnect (ethernet, myrinet)
FC switches

2.1.2. Hyper-threading support

`Requirement`	`Where to get the information from`
`Number of sibling CPUs on the same physical CPU for architectures which use hyper-threading`	`/proc/cpuinfo (siblings)`
`Package id of a logical CPU (if hyper-threading)`	`/proc/cpuinfo (physical id)`

2.1.3. Memory requirements

`Requirement`	`Expressed in`	`Where to get the information from`
`Memory space`	`KB or MB`	`/proc/meminfo (MemTotal)`
`Size of a huge page`	`kB`	`/proc/meminfo (Hugepagesize)`
`Total number of huge pages`	`# of pages`	`/proc/meminfo (HugePages_Total)`
`Total number of available huge pages`	`# of pages`	`/proc/meminfo (HugePages_Free)`

2.1.4. Swap requirements

`Requirement`	`Expressed in`	`Where to get the information from`
`Swap space`	`KB or MB or percentage of the memory space`	`/proc/meminfo (SwapTotal)`

2.1.5. Disk requirements

`Requirement`	`Where to get the information from`
`Free disk space`	`statfs()`
`Disk size limitation (max size supported for GPFS)`	`/proc/partitions`

2.1.6. File system requirements

Max file system size

2.2. Tuning the system-wide parameters

2.2.1. Shared memory parameters

`Parameter`	`Where to set / get the value`
`Maximum number of shm segment ids`	`/proc/sys/kernel/shmmni`
`Maximum shm segment size (in bytes)`	`/proc/sys/kernel/shmmax`
`Maximum number of shm segment pages system-wide`	`/proc/sys/kernel/shmall`
`Minimum shm segment size (in bytes)`	`ipcs -lm / shmctl(SHMINFO)->shmmin`

2.2.2. Message queues parameters

`Parameter`	`Where to set the value`
`Maximum size of a message queue`	`/proc/sys/kernel/msgmnb`
`Maximum number of message queue ids`	`/proc/sys/kernel/msgmni`

2.2.3. Semaphores parameters

`Parameter`	`Where to set the value`
`Max number of semaphore identifiers (semmni)`	`/proc/sys/kernel/sem (4th field)`
`Max number of semaphores per id (semmsl)`	`/proc/sys/kernel/sem (1st field)`
`Max number of semaphores in system (semmns = semmni * semmsl)`	`/proc/sys/kernel/sem (2nd field)`
`Max number of operations per semop call (semopm)`	`/proc/sys/kernel/sem (3rd field)`
`All previous parameters together`	`/proc/sys/kernel/sem`

2.2.4. File system parameters

`Parameter`	`Where to set the value`
`Maximum number of file handles`	`/proc/sys/fs/file-max`

2.2.5. Network parameters

`Parameter`	`Where to set the value`
`Maximum receive buffer size`	`/proc/sys/net/core/rmem_max`
`Maximum send buffer size`	`/proc/sys/net/core/wmem_max`
`Maximum number of received packets that will be processed before resulting in congestion`	`/proc/sys/net/core/netdev_max_backlog`
`Minimum / default / maximum memory size of the TCP receive buffers`	`/proc/sys/net/ipv4/tcp_rmem`
`Minimum / default / maximum memory size of the TCP send buffers`	`/proc/sys/net/ipv4/tcp_wmem`
`Enable TCP to negotiate the use of window scaling (> 64K buffers) with the other end during connection setup`	`/proc/sys/net/ipv4/tcp_window_scaling`
`Timeout for a FIN packet before the socket is forcibly closed`	`/proc/sys/net/ipv4/tcp_fin_timeout`
`Maximum number of queued connection requests which= have still not received an ACK from the connecting client`	`/proc/sys/net/ipv4/tcp_max_syn_backlog`
`Allow to reuse TIME-WAIT sockets for new connections`	`/proc/sys/net/ipv4/tcp_tw_reuse`
`Local port range that is used by TCP and UDP to choose the local port`	`/proc/sys/net/ipv4/ip_local_port_range`
`MTU size`	`ifconfig or /proc/sys/net/ipv6/conf/<interface>/mtu`

2.2.6. Memory parameters

`Parameter`	`Expressed in`	`Where to set the value`
`Kernel policy for memory allocation`	`0, 1, 2`	`/proc/sys/vm/overcommit_memory`
`Percentage of the memory that should be added to the swap to determine the maximum address space that is allowed to be committed`	`Percentage of the physical memory`	`/proc/sys/vm/overcommit_ratio`
`Swappiness`	`[0, 100]`	`/proc/sys/vm/swappiness`
`Number of configured huge pages`	`Number of pages`	`/proc/sys/vm/nr_hugepages`

2.3. Setting the per-user limits

`Parameter`	`Default value`	`How to set the value (command level)`	`How to set value (syscall level)`
`Number of open file descriptors (hard limit)`	`_SC_OPEN_MAX`	`ulimit -Hn Note: should be set to less than /proc/sys/fs/file-max`	`setrlimit(RLIMIT_NOFILE)`
`Number of open file descriptors (soft limt)`	`_SC_OPEN_MAX`	`ulimit -Sn`	`setrlimit(RLIMIT(NOFILE)`
`Maximum number of processes (hard limit)`	`_SC_CHILD_MAX`	`ulimit -Hu`	`setrlimit(RLIMIT_NPROC)`
`Maximum number of processes (soft limit)`	`_SC_CHILD_MAX`	`ulimit -Su`	`setrlimit(RLIMIT_NPROC)`
`Maximum stack size (soft limit)`	`_STK_LIM`	`ulimit -Ss`	`setrlimit(RLIMIT_STACK)`
`Maximum size of the data segment (soft limit)`	`unlimited (RLIM_INFINITY)`	`ulimit -Sd`	`setrlimit(RLIMIT_DATA)`
`Maximum size of virtual memory (soft limit)`	`(datalim / 1024L) + (stacklim / 1024L)`	`ulimit -Sv`	`setrlimit(RLIMIT_RSS)`

2.4. Setting the per-process parameters

`Parameter`	`Where to set the value`
`Base address for shared libraries (specific to some ditros, like RH)`	`/proc/<pid>/mapped_base`

2.5. Getting statistics

After the application installation has completed and while it is running, there is a need to get some system statistics about resource consumptions and utilization, in order to appropriately tune the kernel parameters.

2.5.1. System-wide statistics

After installation has succeeded and while the installed application is running, a need occurs to get some system statistics. These statistics are usually taken from /proc. Only the /proc entries that has not yet been presented above are described hereafter:

`Parameter`	`Where to set the value`
`Time spent in user mode`	`/proc/stat (column 1 + column 2)`
`Time spent in kernel mode`	`/proc/stat (column 3)`
`Time spent idle`	`/proc/stat (column 4)`
`Number of context switches`	`/proc/stat (ctxt)`
`Free memory`	`/proc/meminfo (MemFree)`
`Total amount of swap space`	`/proc/meminfo (SwapTotal)`
`Free swap area`	`/proc/meminfo (SwapFree)`
`Network device statistics`	`/proc/net/dev`

2.5.2. Per-process statistics

Parameter	Where to set the value
Status information about a given process	/proc/<pid>/stat

3. Summary

From what has been presented in the 1st chapters of this paper, libtune should be able to set and get information in the following ways:

call an existing syscall or an existing library routine (ex: shmctl(), setrlimit(), getrlimit(), ...)

access a pseudo-file. Pseudo-files are characterized by their scope and by their contents, as described hereafter. It should be noted that any combination of scope and contents may be done:

pseudo-file scope. A file's scope is useful to build that file's name, as follows:

if the pseudo file contains system-wide data, its name is a fixed string (ex: /proc/sys/fs/file-max)

if the pseudo file contains data related to a single object, its name contains a variable part. This variable part may be one of:

a string for per-network interface files or per-disk files (ex: /sys/block/<disk>/size)

an integer for per-process files or per-cpu files (ex: /proc/acpi/processor/CPU<cpuid>/info)

pseudo-file contents. The knowledge of a file's contents helps in parsing that file. A pseudo-file may contain:

a single value. In that case, the entire file contents should be unconditionally dealt with (ex: /proc/sys/net/ipv6/conf/<interface>/mtu)

a single line (ex: /proc/sys/kernel/sem), with values separated by a fixed delimiter. In that case, the information that is needed to access a single value within this file are:

the delimiter value

the field range within the line

several lines (ex: /proc/net/dev), each line being dedicated to a single object. Each of these lines may have one of the following structures:

the object identifier followed by a set of values separated by a delimiter. Ex: statistics for the eth0 network interface in /proc/net/dev.

same as previous case, except that the object identifier is at the end of the line (ex: information about /dev/sda in /proc/partitions).

the object identifier

the field range within the line.

a single block of data made of several lines, each line containing a fixed string followed by the corresponding value (ex: value of SwapTotal in /proc/meminfo). The only information that is needed to access a data in such a file is the value of the string that precedes the data to access.

several blocks of data, each block being related to a single object. Each of these blocks has the following structure:

each block of data is made of several lines. Each of these blocks begins with a fixed separator, followed by the object identifier. Then, each line of the block of data for this object is made of a fixed string, followed by the corresponding value. Ex: value of "siblings" for CPU 2 in /proc/cpuinfo. To access a value in such files the needed information is:

the block delimiter

the object identifier

the value of the string that precedes the data to access.

Unless a counterexample is found, we will assume that files built on this template all have a system wide name.

4. The tunables database

We saw in the previous chapter that invoking the libtune API to set or get information is equivalent to one of:

calling an existing syscall or library routine
accessing a file or a part of a file, based on its structure. This file's name being either a fixed string or a string with a variable part, depending on the scope of the file.

Performing these operations is done with the support of what we will call the "tunables database" (TUNDB). This is actually an array, indexed by the data that will be accessed. Each entry of TUNDB is described as follows:

a set of attributes. These attributes describe:

a name scope that will be used to generate a file name. The scope may be:

ATTR_NAME_SYS_WIDE: system wide name
ATTR_NAME_PER_PROC: per process name
ATTR_NAME_PER_CPU: per CPU name
ATTR_NAME_PER_IRQ: per IRQ name
ATTR_NAME_PER_PCI: per PCI bus name
ATTR_NAME_PER_NETINT: per network interface name
ATTR_NAME_PER_USER: per user name

a contents scope that will be used to determine if the file contains a unique block of data, or several blocks of data, each one related to a single object:

ATTR_CONT_SYS_WIDE: system wide contents
ATTR_CONT_PER_PROC: per process contents
ATTR_CONT_PER_CPU: per CPU contents
ATTR_CONT_PER_IRQ: per IRQ contents
ATTR_CONT_PER_PCI: per PCI bus contents
ATTR_CONT_PER_NETINT: per network interface contents
ATTR_CONT_PER_USER: per user contents

the file location. It may be a fixed string for system wide file name, or a printf format in the other cases.
fields description. It can be either a field number if the data is accessed as a part of a line, or a string that describes a delimiter inside a line of a file.
2 strategy routines:
- The first one is the routine that will be called to actually get the needed data.
- The second one is the routine that will be called to actually set the needed data.
Their parameter is a pointer to a structure that contains the following information:
- the file pointer to access (once it has been built with an identifier, if needed)
- the index of the data in TUNDB. This field is used to access various information related to the keyword, such as a field number.
- a pointer to an identifier. This field is used to access data inside a file whose contents is not system wide (ex: cpu id for /proc/cpuinfo)
- a pointer to an input buffer (used for the set operation)
- a pointer to an output buffer (used both for the set and get operations)
a help field: this is a string that gives a small description of the entry.

The following array shows an example of how to fill the TUNDB for each case presented in the previous chapter:

`Characteristics of the underlying object`			`Example`	`attributes`	`location`	`field number or line delimiter`	`strategy routine`
`data format`	`name scope`	`contents scope`	`Example`	`attributes`	`location`	`field number or line delimiter`	`strategy routine`
`Existing routine`	`system wide`	`N/A`	`min shm segment size`	`N/A`	`N/A`	`N/A`	`strat_shmmin`
`Entire file contents`	`system wide`	`system wide`	`max number of shm segment ids`	`ATTR_NAME_SYS_WIDE \| ATTR_CONT_SYS_WIDE`	`/proc/sys/kernel/shmmni`	`N/A`	`strat_file`
	`per process`	`system wide`	`Shared libraries base address for a process`	`ATTR_NAME_PER_PROC` `\| ATTR_CONT_SYS_WIDE`	`/proc/%d/mapped_base`	`N/A`	`strat_file`
	`per network interface`	`system wide`	`MTU size`	`ATTR_NAME_PER_NETINT` `\| ATTR_CONT_SYS_WIDE`	`/proc/sys/net/ipv6/conf/%s/mtu`	`N/A`	`strat_file`
`single line`	`system wide`	`system wide`	`max number of semaphores per id`	`ATTR_NAME_SYS_WIDE` `\| ATTR_CONT_SYS_WIDE`	`/proc/sys/kernel/sem`	`1`	`strat_sub_file_line`
`single line`	`per process`	`system wide`	`ppid of a process`	`ATTR_NAME_PER_PROC` `\| ATTR_CONT_SYS_WIDE`	`/proc/%d/stat`	`4`	`strat_sub_file_line`
`one line of data per single object`	`system wide`	`per network interface`	`Number of receive packets for a network interface`	`ATTR_NAME_SYS_WIDE \| ATTR_CONT_PER_NETINT`	`/proc/net/dev`	`2`	`strat_sub_file_lines`
`several lines, each line containing a fixed string followed by the corresponding value`	`system wide`	`system wide`	`Total swap space`	`ATTR_NAME_SYS_WIDE` `\| ATTR_CONT_SYS_WIDE`	`/proc/meminfo`	`"SwapTotal"`	`strat_sub_file_block`
	`per process`	`system wide`	`Sleep average time of a process`	`ATTR_NAME_PER_PROC` `\| ATTR_CONT_SYS_WIDE`	`/proc/%d/status`	`"SleepAVG"`	`strat_sub_file_block`
`block of data for each object`	`system wide`	`per CPU`	`Package id of a cpu`	`ATTR_NAME_SYS_WIDE` `\| ATTR_CONT_PER_CPU`	`/proc/cpuinfo`	`"physical id"`	`strat_sub_file_block`

Note: it should be noted that files with a variable location should be present only once in the database: it is not reasonable to fill a new entry for the same pseudo-file associated to a different identifier. Ex: in the case of the process associated pseudo-files (/proc/<pid>/*), we would reach PID_MAX * around 20 files at least! This is the reason why a printf format has been chosen for the file location

4.1. The tunables database generation

The objects accessed by the TUNDB database that has been presented in the previous chapter can be classified in the following 2 classes:

"static objects": these are the existing system calls or library routines. They are considered as static because they are supposed to be portable from one distribution to the other, or from one release to the other inside a given distribution.
"dynamic object": these are mainly all the pseudo-file based objects (such as files under /proc or under /sys file systems). They are considered as dynamic because even though there is a set of common ones, some files may exist in some distributions and not in others. Moreover we can imagine that their contents may evolve from one release to the other in the same distribution. These dynamic objects can be accessed either to get their entire contents, or a subset of their contents, as seen in the previous chapter.

This classification leads to split the TUNDB database into the following "sub-databases":

TUNDB_S1: enables access to static objects. Due to its properties, listed hereafter, this sub-database can only be initialized from inside the library:

its contents is completely arbitrary
it is not based on file system objects

Sub-databases that enable access to a part of the data contained in dynamic objects. The properties of these databases are the following:

the objects they are based on are dynamic (they belong to pseudo file systems)
but their contents are completely arbitrary: for example, the fact that getting the memory space can be done by getting the integer that follows the string "MemTotal" in /proc/meminfo is not something that can be automatically generated.

Thus, these are also databases that should be statically initialized from inside the library. There are presently 4 such sub-databases:

TUNDB_S2 enables access to parts of dynamic objects through strat_get_subfile_line / strat_set_subfile_line: gets / sets a field within a line inside files that are made of a single line.
TUNDB_S3 enables access to parts of dynamic objects through strat_get_subfile_lines / strat_set_subfile_lines: gets / sets a field within a line inside files that are made of several lines, each line containing a given identifier
TUNDB_S4 enables access to parts of dynamic objects through strat_get_subfile_block / strat_set_subfile_block: gets / sets a value in files made of a single block of data. This block of data is made of several lines that contain an identifier and the actual value.
TUNDB_S5 enables access to parts of dynamic objects through strat_get_subfile_blocks / strat_set_subfile_blocks: gets / sets a value in files made of several blocks of data. Each block of data is identified by a unique identifier, and it is made of several lines that contain an identifier and the actual value.

TUNDB_S6: enables access to the entire contents of dynamic objects. Due to the dynamic nature of this database, it seems too restrictive to initialize it form inside the library: its contents would have to be redefined each time the set of pseudo files is changed. Thus, both the API and any utility using it would be distro-dependant and release-dependant inside a given distribution. Thus we have to find a way to automatically generate TUNDB_S6, in order to make it perfectly reflect the contents of /proc and /sys pseudo filesystems.

The following discussion only concerns TUNDB_S6, the database that is to be dynamically generated. In order to generate TUNDB_S6, we have to choose between:

making each newly registered pseudo-file under /sys or /proc register itself in TUNDB_S6.

procfs pseudo-files: create_proc_entry() can be changed to fill TUNDB_S6 with each newly created pseudo-file. And remove_proc_entry() can be changed to do the reverse operation.
sysfs pseudo-files: sysfs_create() can be changed to fill TUNDB_S6 with each newly created pseudo-file, and sysfs_hash_and_remove() can be changed to do the reverse operation.

The advantage of this solution is that the data base is always in a state that reflects that of the underlying pseudo filesystems. Its drawback is that it requires a change at the kernel level.

scanning the procfs and sysfs pseudo-filesystems in order to discover their tree structure and fill TUNDB_S6 with the set of pseudo-files. The advantage of this solution is that it can be completely developed outside the kernel. Cons: there are cases were the key words will not perfectly reflect the pseudo filesystems tree structure, even if this scanning is to be done periodically. Actually, we can imagine some drivers or modules that generate their own files under /proc and that are not loaded at the time the scanning is done. This means that the corresponding generated files would not be manageable by libtune until a new scanning is done. On the other hand, adding and removing files from the pseudo filesystems is not such a frequent operation, except for files associated to running processes. But these are files whose names will be present in TUNDB_S6 from the very beginning, in their variable format (/proc/%d/cmdline). So it seems reasonable to think that such a "latency" is acceptable.

So the second solution is the one that will be kept. It will be implemented through a daemon that will periodically scan the pseudo filesystems and update TUNDB_S6 according to the results of its scan. Actually, TUNDB_S6 is split into:

TUNDB_S6 that is statically initialized in the library, since most of the pseudo-files are generated during system initialization.
TUNDB_D that is the dynamic part of it. This is the array that will be dynamically updated by the daemon

The structure of TUNDB is shown in the following scheme:

The exchanges between the daemon and the library to update TUNDB_D are summarized in the following scheme:

In order to avoid TUNDB_D contents to disappear as soon as libtune is unloaded, TUNDB_D contents are actually stored in a file (/var/tuned/tundb_d) mapped by the library when needed. This file is initialized by a binary (libtuninit) after libtune has been installed.
When the tuned daemon is first called, it requests from the library to map /var/tuned/tundb_d into TUNDB_D.
Then the daemon periodically scans the /proc and /sys pseudo filesystems (forgetting the /proc/<pid>/* files).
If it detects that one or more files have been added since it has last been awaken, it requests from the library to add the corresponding entries into TUNDB_D. If the daemon detects that one or more files have been removed since it has last been awaken, it requests from the library to remove the corresponding entries from TUNDB_D.

4.2. Accessing the database

In order for an application to get or set system information, it should address an entry into TUNDB, through its index. This index enables the library to find all the needed information to get or set the addressed tunable (attributes, file location if any, associated strategy routines). Addressing a TUNDB entry is obvious for the static part (TUNDB_S1 to TUNDB_S6): the indexes can easily be defined as constants in a documented header file.
Then, since the static part of TUNDB is actually made of 6 distinct arrays, we need to convert the documented constant into an index in the appropriate array. This is done as follows:

The maximum number of indexes into each TUNDB is the same (TUNDB_MAX = 0x400)
An array called TUNLIMITS contains the following information for each TUNDB array:

the value of its first index. From the 1st rule, we have: TUNLIMITS[i+1].tunl_idx_first = TUNLIMITS[i].tunl_idx_first + TUNDB_MAX
the value of its last valid index

When a predefined documented keyword is referenced:

it is first divided by TUNDB_MAX. This gives us the array where the corresponding entry should be found.
the value of the first index for this array is substracted from the keyword value. This gives us the actual index in the array to access the needed information.

The following scheme summarizes the process that has just been described:

TUNDB_D, on its side, is an automatically generated part of TUNDB. For this array, each new index will be an increment of the last existing index in TUNDB. Moreover, in order for the applications to know which index to use for which file, a set of commands and API interfaces should be implemented, to query information from the database (it should be noted that a help field exists for each entry in the database).

4.3. Portability issues

We saw in the overview of this document that the main problem for installation or supervision scripts is a portability issue:

accessing a single information may vary from one distribution to the other
accessing an information may be supported in a distribution, while not supported in another one
accessing a single information may vary from one release of the distribution to the other
accessing information about a CPU may vary from one architecture to the other

That's why a set of database initialization files will be maintained in a tree structure sorted by distro, then by release, then by architecture. Obviously, a single index in the database should enable the access to the same information across distributions, releases and architectures. The choice for the right source files will be done at compilation time.

5. The libtune API

The API that comes out from the previous chapters is quite simple:

it should have interfaces to get and set information

tun_get()
tun_set()

It should have interfaces to query the TUNDB database:

tun_locate()
tun_get_kwd()
tun_help()

Since it is not POSIX compliant, this API is not intended to be integrated into the glibc: it will be a completely separate API.

5.1. Alternative

Looking at the limited number of actions during an installation, an alternative to the proposed API would be to define a single entry per wanted action. For example:

get_RAM_size()
set_semmni()

This would be feasible while the number of actions remains under a reasonable limit. But the problem with this kind of API is that is has to be enhanced each time a new action is needed. The tun_set() / tun_get() solution is the most generic one, that' why it is the one we kept: it can be used not only during applications installation but also, for example, by a daemon in charge of periodically collecting statistics, and of adjusting the kernel parameters based on its observations.

5.2. Access rights and required privileges

The query interfaces are not restricted to specific users.

The interfaces used to get or set information, on their side, are submitted to the same access rights as the underlying object:

files belonging to a pseudo filesystem: same access rights as these files. This applies too to the interfaces that access a third party process.
system calls: same access management as the underlying system call

5.3. Getting information (tun_get)

size_t tun_get(int keyword, void *identifier, char **out_buff, size_t *out_sz)

5.3.1. Parameters

This routine takes the following parameters:

keyword: this is the value that will be used as an index into TUNDB
identifier: this is the identifier of the object we want to get information for (may be NULL)
- if the TUNDB entry has a name attribute different from ATTR_NAME_SYS_WIDE, it means that the location field for this entry contains a variable part. The identifier will then be used to build the final file location
- if the TUNDB entry has a contents attribute different from ATTR_CONT_SYS_WIDE, it means that the file for this entry contains per-object blocks of data. The identifier will then be used as a searched string to determine the "interesting" block of data.
If the identifier is a pid, the convention is the following:

-1 means current process
> 0 means a real process id

out_buff: this is a pointer to the output buffer. It may be one of:
- a pointer to a NULL buffer. In that case, tun_get() will allocate a buffer large enough to contain the output data. This buffer should then be freed by the calling application.
- a pointer to a previously malloc-ated buffer. In that case, its size is provided through *out_sz. If that buffer is not large enough to contain all the output data, tun_get() will realloc-ate it, updating *out_buff and *out_sz as necessary.
The output data is provided in a string format. The buffer is null-terminated and does not include the last EOF.
out_sz: this is the size of the output buffer, including the last NULL character.

5.3.2. Returned values

On success, tun_get() returns the number of characters read, including the terminating null character, but not including the EOF character. This value can be used to handle embedded null characters in the data read.
tun_get() returns -1 on failure, and errno is set.

5.4. Setting information (tun_set)

size_t tun_set(int keyword, void *identifier, char *in_buff, size_t in_sz, char **out_buff, size_t *out_sz)

5.4.1. Parameters

This routine takes the following parameters:

keyword: this is the value that will be used as an index into TUNDB
identifier: this is the identifier of the object we want to set information for (may be NULL)
- if the TUNDB entry has a name attribute different from ATTR_NAME_SYS_WIDE, it means that the location field for this entry contains a variable part. The identifier will then be used to build the final file location
- if the TUNDB entry has a contents attribute different from ATTR_CONT_SYS_WIDE, it means that the file for this entry contains per-object blocks of data. The identifier will then be used as a searched string to determine the "interesting" block of data.
If the identifier is a pid, the convention is the following:

-1 means current process
> 0 means a real process id

in_buff: this is the input buffer. It should contain a null-terminating string with the value to set.
in_sz: this is the size of the input buffer, not including the last NULL character.
out_buff: this is a pointer to the output buffer. It contains the old setting of the changed object. It may be one of:
- a pointer to a NULL buffer. In that case, tun_set() will allocate a buffer large enough to contain the output data. This buffer should then be freed by the calling application.
- a pointer to a previously malloc-ated buffer. In that case, its size is provided through *out_sz. If that buffer is not large enough to contain all the output data, tun_set() will realloc-ate it, updating *out_buff and *out_sz as necessary.
The output data is provided in a string format. The buffer is null-terminated and does not include the last EOF.
If, for some reason an error occured while getting the old tunable setting, tun_set() will fail.
out_sz: this is the size of the output buffer, including the last NULL character.

5.4.2. Returned values

On success, tun_set() returns the number of output characters, including the terminating null character, but not including the EOF character. This value can be used to handle embedded null characters in the data read.
tun_set() returns -1 on failure, and errno is set.

5.5. Locating information (tun_locate)

This routine can be used, given an index into the TUNDB database, to locate the underlying pseudo-file.
This routine is only meaningful for indexes that correspond to information that is managed through pseudo files.

int tun_locate(int keyword, char **location, int *loc_sz)

5.5.1. Parameters

This routine takes the following parameters:

keyword: this is the value that will be used as an index into TUNDB
location: this is a pointer to the output buffer. It may be one of:
- a pointer to a NULL buffer. In that case, tun_locate() will allocate a buffer large enough to contain the output data. This buffer should then be freed by the calling application.
- a pointer to a previously malloc-ated buffer. In that case, its size is provided through *loc_sz. If that buffer is not large enough to contain all the output data, tun_locate() will realloc-ate it, updating *location and *loc_sz as necessary.
The output data is provided in a string format. The buffer is null-terminated.
loc_sz: this is the size of the output buffer, not including the last NULL character.

5.5.2. Returned values

0 on success
1 if the index corresponds to information that is not managed through pseudo files.
-1 on failure

5.6. Getting the key word for a location (tun_get_kwd)

This is the reverse operation of the preceding one: given a location, it returns the associated TUNDB index (to use for example in a set / get operation). Actually, since many indexes may have the same location in TUNDB, this routine returns a set of indexes. Ex: of entries that have the same underlying locations:

entire /proc/meminfo file
total memory space
size of a huge page
total swap space
free swap space, ...

int tun_get_kwd(char *location, int **keywords, int *nb_keywords)

5.6.1. Parameters

This routine takes the following parameters:

location: this is a null-terminated string that contains the file location to look for.
keywords: this is a pointer to an array of integers. It may be one of:
- a pointer to a NULL buffer. In that case, tun_get_kwd() will allocate a buffer large enough to contain the output data. This buffer should then be freed by the calling application.
- a pointer to a previously malloc-ated buffer. In that case, its size is provided through *nb_keywords. If that buffer is not large enough to contain all the output data, tun_get_kwd() will realloc-ate it, updating *keywords and *nb_keywords as necessary.
nb_keywords: this is the size of the output buffer.

5.6.2. Returned values

0 on success
-1 on failure

5.7. Getting help information (tun_help)

This routine can be used, given an index into the TUNDB database, to return the corresponding help string.

int tun_help(int keyword, char **help, int *help_sz)

5.7.1. Parameters

This routine takes the following parameters:

keyword: this is the value that will be used as an index into TUNDB
help: this is a pointer to the output buffer. It may be one of:
- a pointer to a NULL buffer. In that case, tun_help() will allocate a buffer large enough to contain the output data. This buffer should then be freed by the calling application.
- a pointer to a previously malloc-ated buffer. In that case, its size is provided through *help_sz. If that buffer is not large enough to contain all the output data, tun_help() will realloc-ate it, updating *help and *help_sz as necessary.
The output data is provided in a string format. The buffer is null-terminated.
help_sz: this is the size of the output buffer, not including the last NULL character.

5.7.2. Returned values

0 on success
1 if the index corresponds to information that does not have an associate help string (this is true for files detected by the daemon).
-1 on failure

5.8. Updating TUNDB_D (tun_update)

This routine is for internal use only: it is for use by the tuned daemon. It is called to initialize, update and clean the TUNDB_D array (dynamic part of TUNDB).

int tun_update(int cmd, char *fname)

5.8.1. Parameters

This routine takes the following parameters:

cmd: specifies the command to apply. It can be one of:

TUN_INIT: maps the files that will be used to share TUNDB_D[] and TUNLIMITS[]
TUN_CLEAN: cleans TUNDB_D[]
TUN_ADD: adds an entry to TUNDB_D
TUN_REMOVE: removes an entry from TUNDB_D

5.8.2. Returned values

The returned values depend on the requested action, as follows:

`cmd`	`returned value on success`	`returned value on failure`
`TUN_INIT`	`0`	`-1`
`TUN_CLEAN`	`0`	`0`
`TUN_ADD`	`>= 0 (associated keyword = D_IDX_FIRST + new index)`	`-1`
`TUN_REMOVE`	`0`	`0`

6. Deliverables

These are the remaining phases for the libtune API:

release 4:

Some other distros (to be defined) will be supported

release 5:

tun_get_kwd() will be added