Наши партнеры

Книги по Linux (с отзывами читателей)

Библиотека сайта rus-linux.net

After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.

Next: 9.3 Objects Up: 9. Slab Allocator Previous: 9.1 Caches Contents Index

Subsections

9.2.1 Storing the Slab Descriptor
9.2.2 Slab Creation
9.2.3 Tracking Free Objects
9.2.4 Initialising the kmem_bufctl_t Array
9.2.5 Finding the Next Free Object
9.2.6 Updating kmem_bufctl_t
9.2.7 Calculating the Number of Objects on a Slab
9.2.8 Slab Destroying

9.2 Slabs

This section will describe how a slab is structured and managed. The struct which describes it is much simpler than the cache descriptor, but how the slab is arranged is considerably more complex. It is declared as follows:

typedef struct slab_s {
        struct list_head        list;
        unsigned long           colouroff;
        void                    *s_mem;
        unsigned int            inuse;
        kmem_bufctl_t           free;
} slab_t;

The fields in this simple struct are as follows:

: list This is the linked list the slab belongs to. This will be one of slab_full, slab_partial or slab_free from the cache manager;
: colouroff This is the colour offset from the base address of the first object within the slab. The address of the first object is s_mem + colouroff;
: s_mem This gives the starting address of the first object within the slab;
: inuse This gives the number of active objects in the slab;
: free This is an array of bufctls used for storing locations of free objects. See Section for further details.

The reader will note that given the slab manager or an object within the slab, there does not appear to be an obvious way to determine what slab or cache they belong to. This is addressed by using the list field in the struct page that makes up the cache. SET_PAGE_CACHE() and SET_PAGE_SLAB() use the next and prev fields on the page $\rightarrow$ list to track what cache and slab an object belongs to. To get the descriptors from the page, the macros GET_PAGE_CACHE() and GET_PAGE_SLAB() are available. This set of relationships is illustrated in Figure 9.7

**Figure 9.7:** Page to Cache and Slab Relationship
$\includegraphics[]{graphs/pageslabcache.ps}$

The last issue is where the slab management struct is kept. Slab managers are kept either on (CFLGS_OFF_SLAB set in the static flags) or off-slab. Where they are placed are determined by the size of the object during cache creation.

9.2.1 Storing the Slab Descriptor

If the objects are larger than a threshold (512 bytes on x86), CFGS_OFF_SLAB is set in the cache flags and the slab descriptor is kept off-slab in one of the sizes cache (see Section 9.4). The selected sizes cache is large enough to contain the struct slab_t and kmem_cache_slabmgmt() allocates from it as necessary. This limits the number of objects that can be stored on the slab because there is limited space for the bufctls but that is unimportant as the objects are large and so there should not be many stored in a single slab.

**Figure 9.8:** Slab With Descriptor On-Slab
$\includegraphics[width=17cm]{graphs/slablayout_onslab.ps}$

Alternatively, the slab manger is reserved at the beginning of the slab. When stored on-slab, enough space is kept at the beginning of the slab to store both the slab_t and the kmem_bufctl_t array. The array is responsible for tracking where the next free object is stored and is discussed later in the chapter. The objects are stored after the kmem_bufctl_t array.

Figure 9.8 should help clarify what a slab with the descriptor on-slab looks like and Figure 9.9 illustrates how a cache uses a sizes cache to store the slab descriptor when the descriptor is kept off-slab.

**Figure 9.9:** Slab With Descriptor Off-Slab
$\includegraphics[width=17cm]{graphs/slablayout_offslab.ps}$

9.2.2 Slab Creation

**Figure 9.10:** Call Graph: `kmem_cache_grow()`
$\includegraphics[width=13cm]{graphs/kmem_cache_grow.ps}$

At this point, we have seen how the cache is created, but on creation, it is an empty cache with empty lists for its slab_full, slab_partial and slabs_free. New slabs are allocated to a cache by calling the function kmem_cache_grow(). This is frequently called ``cache growing'' and occurs when no objects are left in the slabs_partial list and there are no slabs in slabs_free. The tasks it fulfills are

Perform basic sanity checks to guard against bad usage;
Calculate colour offset for objects in this slab;
Allocate memory for slab and acquire a slab descriptor;
Link the pages used for the slab to the slab and cache descriptors described in Section 9.2;
Initialise objects in the slab;
Add the slab to the cache.

9.2.3 Tracking Free Objects

The slab allocator has got to have a quick and simple means of tracking where free objects are on the partially filled slabs. It achieves this by using a kmem_bufctl_t array that is associated with each slab manager as obviously it is up to the slab manager to know where its free objects are.

Historically, and according to the paper describing the slab allocator [#!bonwick94!#], kmem_bufctl_t was a linked list of objects. In Linux 2.2.x, this struct was a union of three items, a pointer to the next free object, a pointer to the slab manager and a pointer to the object. Which it was depended on the state of the object.

Today, the slab and cache an object belongs to is determined by the struct page and kmem_bufctl_t is simply an integer array of object indices. The number of elements in the array is the same as the number of objects on the slab.

141 typedef unsigned int kmem_bufctl_t;

As the array is kept after the slab descriptor and there is no pointer to the first element directly, a helper macro slab_bufctl() is provided.

163 #define slab_bufctl(slabp) \
164         ((kmem_bufctl_t *)(((slab_t*)slabp)+1))

This seemingly cryptic macro is quite simple when broken down. The parameter slabp is a pointer to the slab manager. The block ((slab_t*)slabp)+1 casts slabp to a slab_t struct and adds 1 to it. This will give a slab_t * pointer to the beginning of the kmem_bufctl_t array. (kmem_bufctl_t *) recasts that pointer back to the required type. The results in blocks of code that contain slab_bufctl(slabp)[i]. Translated, that says ``take a pointer to a slab descriptor, offset it with slab_bufctl() to the beginning of the kmem_bufctl_t array and return the i element of the array''.

The index to the next free object in the slab is stored in slab_t $\rightarrow$ free eliminating the need for a linked list to track free objects. When objects are allocated or freed, this pointer is updated based on information in the kmem_bufctl_t array.

9.2.4 Initialising the `kmem_bufctl_t` Array

When a cache is grown, all the objects and the kmem_bufctl_t array on the slab are initialised. The array is filled with the index of each object beginning with 1 and ending with the marker BUFCTL_END. For a slab with 5 objects, the elements of the array would look like Figure 9.11.

**Figure 9.11:** Initialised `kmem_bufctl_t` Array
$\includegraphics[]{graphs/bufctl_init.ps}$

The value 0 is stored in slab_t $\rightarrow$ free as the 0 object is the first free object to be used. The idea is that for a given object n, the index of the next free object will be stored in kmem_bufctl_t[n]. Looking at the array above, the next object free after 0 is 1. After 1, there are two and so on. As the array is used, this arrangement will make the array act as a LIFO for free objects.

9.2.5 Finding the Next Free Object

kmem_cache_alloc() will call kmem_cache_alloc_one_tail() when allocating an object to perform the ``real'' work of updating the kmem_bufctl_t() array.

slab_t $\rightarrow$ free has the index of the first free object. The index of the next free object is at kmem_bufctl_t[slab_t $\rightarrow$ free]. In code terms, this looks like

1253         objp = slabp->s_mem + slabp->free*cachep->objsize;
1254         slabp->free=slab_bufctl(slabp)[slabp->free];

slabp $\rightarrow$ s_mem is the index of the first object on the slab. slabp $\rightarrow$ free is the index of the object to allocate and it has to be multiplied by the size of an object.

The index of the next free object is stored at kmem_bufctl_t[slabpfree]. There is no pointer directly to the array hence the helper macro slab_bufctl() is used. Note that the kmem_bufctl_t array is not changed during allocations but that the elements that are unallocated are unreachable. For example, after two allocations, index 0 and 1 of the kmem_bufctl_t array are not pointed to by any other element.

9.2.6 Updating kmem_bufctl_t

The kmem_bufctl_t list is only updated when an object is freed in the function kmem_cache_free_one(). The array is updated with this block of code:

1451         unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
1452 
1453         slab_bufctl(slabp)[objnr] = slabp->free;
1454         slabp->free = objnr;

objp is the object about to be freed and objnr is its index. kmem_bufctl_t[objnr] is updated to point to the current value of slabp $\rightarrow$ free, effectively placing the object pointed to by free on the pseudo linked list. slabp $\rightarrow$ free is updated to the object being freed so that it will be the next one allocated.

9.2.7 Calculating the Number of Objects on a Slab

During cache creation, the function kmem_cache_estimate() is called to estimate how many objects may be stored on a single slab taking into account whether the slab descriptor must be stored on-slab or off-slab and the size of each kmem_bufctl_t needed to track if an object is free or not. It returns the number of objects that may be stored and how many bytes are wasted. The number of wasted bytes is important if cache colouring is to be used.

The calculation is quite basic and takes the following steps

Initialise wastage to be the total size of the slab, ${\mathrm{PAGE\_SIZE}}^{\mathrm{gfp\_order}}$ ;
Subtract the amount of space required to store the slab descriptor;
Count up the number of objects that may be stored. Include the size of the kmem_bufctl_t if the slab descriptor is stored on the slab. Keep increasing the size of i until the slab is filled;
Return the number of objects and bytes wasted.

9.2.8 Slab Destroying

When a cache is being shrunk or destroyed, the slabs will be deleted. As the objects may have destructors, these must be called, so the tasks of this function are:

If available, call the destructor for every object in the slab;
If debugging is enabled, check the red marking and poison pattern;
Free the pages the slab uses.

The call graph at Figure 9.12 is very simple.

**Figure 9.12:** Call Graph: `kmem_slab_destroy()`
$\includegraphics[width=7cm]{graphs/kmem_slab_destroy.ps}$

Next: 9.3 Objects Up: 9. Slab Allocator Previous: 9.1 Caches Contents Index

Mel 2004-02-15