Библиотека сайта rus-linux.net
The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.
To be fully clear, this webpage is not the actual book.
Next: 9.2 Slabs Up: 9. Slab Allocator Previous: 9. Slab Allocator   Contents   Index
Subsections
- 9.1.1 Cache Descriptor
- 9.1.2 Cache Static Flags
- 9.1.3 Cache Dynamic Flags
- 9.1.4 Cache Allocation Flags
- 9.1.5 Cache Colouring
- 9.1.6 Cache Creation
- 9.1.7 Cache Reaping
- 9.1.8 Cache Shrinking
- 9.1.9 Cache Destroying
9.1 Caches
One cache exists for each type of object that is to be cached. For a full
list of caches available on a running system, run cat /proc/slabinfo
. This file gives some basic information on the caches. An excerpt from the
output of this file looks like:
slabinfo - version: 1.1 (SMP) kmem_cache 80 80 248 5 5 1 : 252 126 urb_priv 0 0 64 0 0 1 : 252 126 tcp_bind_bucket 15 226 32 2 2 1 : 252 126 inode_cache 5714 5992 512 856 856 1 : 124 62 dentry_cache 5160 5160 128 172 172 1 : 252 126 mm_struct 240 240 160 10 10 1 : 252 126 vm_area_struct 3911 4480 96 112 112 1 : 252 126 size-64(DMA) 0 0 64 0 0 1 : 252 126 size-64 432 1357 64 23 23 1 : 252 126 size-32(DMA) 17 113 32 1 1 1 : 252 126 size-32 850 2712 32 24 24 1 : 252 126
Each of the column fields correspond to a field in the struct
kmem_cache_s
structure. The columns listed in the excerpt above are:
- cache-name A human readable name such as ``tcp_bind_bucket'';
- num-active-objs Number of objects that are in use;
- total-objs How many objects are available in total including unused;
- obj-size The size of each object, typically quite small;
- num-active-slabs Number of slabs containing objects that are active;
- total-slabs How many slabs in total exist;
- num-pages-per-slab The pages required to create one slab, typically 1.
If SMP is enabled like in the example excerpt, two more columns will be displayed after a colon. They refer to the per CPU cache described in Section 9.5. The columns are:
- limit This is the number of free objects the pool can have before
half of it is given to the global free pool;
- batchcount The number of objects allocated for the processor in a block
when no objects are free.
To speed allocation and freeing of objects and slabs they are arranged into
three lists; slabs_full
, slabs_partial
and
slabs_free
. slabs_full
has all its objects in use.
slabs_partial
has free objects in it and so is a prime candidate
for allocation of objects. slabs_free
has no allocated objects
and so is a prime candidate for slab destruction.
9.1.1 Cache Descriptor
All information describing a cache is stored in a struct
kmem_cache_s
declared in mm/slab.c
. This is an extremely large
struct and so will be described in parts.
190 struct kmem_cache_s { 193 struct list_head slabs_full; 194 struct list_head slabs_partial; 195 struct list_head slabs_free; 196 unsigned int objsize; 197 unsigned int flags; 198 unsigned int num; 199 spinlock_t spinlock; 200 #ifdef CONFIG_SMP 201 unsigned int batchcount; 202 #endif 203
Most of these fields are of interest when allocating or freeing objects.
- slabs_* These are the three lists where the slabs are stored as
described in the previous section;
- objsize This is the size of each object packed into the slab;
- flags These flags determine how parts of the allocator will behave
when dealing with the cache. See Section 9.1.2;
- num This is the number of objects contained in each slab;
- spinlock A spinlock protecting the structure from concurrent accessses;
- batchcount This is the number of objects that will be allocated in
batch for the per-cpu caches as described in the previous section.
206 unsigned int gfporder; 209 unsigned int gfpflags; 210 211 size_t colour; 212 unsigned int colour_off; 213 unsigned int colour_next; 214 kmem_cache_t *slabp_cache; 215 unsigned int growing; 216 unsigned int dflags; 217 219 void (*ctor)(void *, kmem_cache_t *, unsigned long); 222 void (*dtor)(void *, kmem_cache_t *, unsigned long); 223 224 unsigned long failures; 225
This block deals with fields of interest when allocating or freeing slabs from the cache.
- gfporder This indicates the size of the slab in pages. Each slab
consumes
pages as these are the allocation sizes
the buddy allocator provides;
- gfpflags The GFP flags used when calling the buddy allocator to allocate
pages are stored here. See Section 7.4 for a full list;
- colour Each slab stores objects in different cache lines if possible.
Cache colouring will be further discussed in Section 9.1.5;
- colour_off This is the byte alignment to keep slabs at. For example,
slabs for the size-X caches are aligned on the L1 cache;
- colour_next This is the next colour line to use. This value wraps back
to 0 when it reaches
colour
; - growing This flag is set to indicate if the cache is growing or
not. If it is, it is much less likely this cache will be
selected to reap free slabs under memory pressure;
- dflags These are the dynamic flags which change during the cache
lifetime. See Section 9.1.3;
- ctor A complex object has the option of providing a constructor function
to be called to initialise each new object. This is a pointer
to that function and may be NULL;
- dtor This is the complementing object destructor and may be NULL;
- failures This field is not referred to anywhere in the code.
227 char name[CACHE_NAMELEN]; 228 struct list_head next;
These are set during cache creation
- name This is the human readable name of the cache;
- next This is the next cache on the cache chain.
229 #ifdef CONFIG_SMP 231 cpucache_t *cpudata[NR_CPUS]; 232 #endif
- cpudata This is the per-cpu data and is discussed further in Section 9.5.
233 #if STATS 234 unsigned long num_active; 235 unsigned long num_allocations; 236 unsigned long high_mark; 237 unsigned long grown; 238 unsigned long reaped; 239 unsigned long errors; 240 #ifdef CONFIG_SMP 241 atomic_t allochit; 242 atomic_t allocmiss; 243 atomic_t freehit; 244 atomic_t freemiss; 245 #endif 246 #endif 247 };
These figures are only available if the CONFIG_SLAB_DEBUG
option is set during compile time. They are all beancounters and not of
general interest. The statistics for /proc/slabinfo
are calculated
when the proc entry is read by another process by examining every slab used
by each cache rather than relying on these fields to be available.
- num_active The current number of active objects in the cache is stored
here;
- num_allocations A running total of the number of objects that have been
allocated on this cache is stored in this field;
- high_mark This is the highest value
num_active
has had to date; - grown This is the number of times
kmem_cache_grow()
has been called; - reaped The number of times this cache has been reaped is kept here;
- errors This field is never used;
- allochit This is the total number of times an allocation has used
the per-cpu cache;
- allocmiss To complement
allochit
, this is the number of times an allocation has missed the per-cpu cache; - freehit This is the number of times a free was placed on a per-cpu
cache;
- freemiss This is the number of times an object was freed and placed on
the global pool.
9.1.2 Cache Static Flags
A number of flags are set at cache creation time that remain the same for
the lifetime of the cache. They affect how the slab is structured and how
objects are stored within it. All the flags are stored in a bitmask in the
flags
field of the cache descriptor. The full list of possible
flags that may be used are declared in linux/slab.h
.
There are three principle sets. The first set is internal flags which are
set only by the slab allocator and is listed in Table 9.1. The only relevant flag is the CFGS_OFF_SLAB
flag which determines where the slab descriptor is stored.
The second set are set by the cache creator and they determine how the allocator treats the slab and how objects are stored. They are listed in Table 9.2.
The last flags are only available if the compile option
CONFIG_SLAB_DEBUG
is set. They determine what additional checks
will be made to slabs and objects and are primarily of interest only when
new caches are being developed.
To prevent callers using the wrong flags a CREATE_MASK
is
defined in mm/slab.c
consisting of all the allowable flags. When
a cache is being created, the requested flags are compared against the
CREATE_MASK
and reported as a bug if invalid flags are used.
9.1.3 Cache Dynamic Flags
The dflags
field has only one flag, DFLGS_GROWN
,
but it is important. The flag is set during kmem_cache_grow()
so that kmem_cache_reap()
will be unlikely to choose the cache
for reaping. When the function does find a cache with this flag set, it
skips the cache and removes the flag.
9.1.4 Cache Allocation Flags
These flags correspond to the GFP page flag options for allocating pages for
slabs. Callers sometimes call with either SLAB_*
or GFP_*
flags, but they really should use only SLAB_*
flags. They correspond
directly to the flags described in Section 7.4 so will not
be discussed in detail here. It is presumed the existence of these flags are
for clarity and in case the slab allocator needed to behave differently in
response to a particular flag but in reality, there is no difference.
9.1.5 Cache Colouring
To utilise hardware cache better, the slab allocator will offset objects
in different slabs by different amounts depending on the amount of space
left over in the slab. The offset is in units of BYTES_PER_WORD
unless SLAB_HWCACHE_ALIGN
is set in which case it is aligned to
blocks of L1_CACHE_BYTES
for alignment to the L1 hardware cache.
During cache creation, it is calculated how many objects can fit on a slab (see Section 9.2.7) and how many bytes would be wasted. Based on wastage, two figures are calculated for the cache descriptor
- colour This is the number of different offsets that can be used;
- colour_off This is the multiple to offset each objects by in the slab.
With the objects offset, they will use different lines on the associative hardware cache. Therefore, objects from slabs are less likely to overwrite each other in memory.
The result of this is best explained by an example. Let us say that
s_mem
(the address of the first object) on the slab is 0 for
convenience, that 100 bytes are wasted on the slab and alignment is to be
at 32 bytes to the L1 Hardware Cache on a Pentium II.
In this scenario, the first slab created will have its objects start at 0.
The second will start at 32, the third at 64, the fourth at 96 and the fifth
will start back at 0. With this, objects from each of the slabs will not
hit the same hardware cache line on the CPU. The value of colour
is 3 and colour_off
is 32.
9.1.6 Cache Creation
The function kmem_cache_create()
is responsible for creating new
caches and adding them to the cache chain. The tasks that are taken to create
a cache are
- Perform basic sanity checks for bad usage;
- Perform debugging checks if
CONFIG_SLAB_DEBUG
is set; - Allocate a
kmem_cache_t
from thecache_cache
slab cache ; - Align the object size to the word size;
- Calculate how many objects will fit on a slab;
- Align the slab size to the hardware cache;
- Calculate colour offsets ;
- Initialise remaining fields in cache descriptor;
- Add the new cache to the cache chain.
Figure 9.2 shows the call graph relevant to the creation of a cache; each function is fully described in the Code Commentary.
9.1.7 Cache Reaping
When a slab is freed, it is placed on the slabs_free
list for future use. Caches do not automatically shrink themselves
so when kswapd notices that memory is tight, it calls
kmem_cache_reap()
to free some memory. This function is responsible
for selecting a cache that will be required to shrink its memory usage. It
is worth noting that cache reaping does not take into account what memory
node or zone is under pressure. This means that with a NUMA or high memory
machine, it is possible the kernel will spend a lot of time freeing memory
from regions that are under no memory pressure but this is not a problem
for architectures like the x86 which has only one bank of memory.
The call graph in Figure 9.3 is deceptively
simple as the task of selecting the proper cache to reap is quite
long. In the event that there are numerous caches in the system,
only REAP_SCANLEN
9.1 caches are examined in each call. The last
cache to be scanned is stored in the variable clock_searchp
so as not to examine the same caches repeatedly. For each scanned cache,
the reaper does the following
- Check flags for
SLAB_NO_REAP
and skip if set; - If the cache is growing, skip it;
- if the cache has grown recently (
DFLGS_GROWN
is set indflags
), skip it but clear the flag so it will be reaped the next time; - Count the number of free slabs in
slabs_free
and calculate how many pages that would free in the variablepages
; - If the cache has constructors or large slabs, adjust
pages
to make it less likely for the cache to be selected; - If the number of pages that would be freed exceeds
REAP_PERFECT
, free half of the slabs inslabs_free
; - Otherwise scan the rest of the caches and select the one that would
free the most pages for freeing half of its slabs in
slabs_free
.
9.1.8 Cache Shrinking
When a cache is selected to shrink itself, the steps it takes are simple and brutal
- Delete all objects in the per CPU caches;
- Delete all slabs from
slabs_free
unless the growing flag gets set.
Linux is nothing, if not subtle.
Two varieties of shrink functions are provided with confusingly
similar names. kmem_cache_shrink()
removes all slabs from
slabs_free
and returns the number of pages freed as a result. This
is the principal function exported for use by the slab allocator users.
The second function __kmem_cache_shrink()
frees all slabs from
slabs_free
and then verifies that slabs_partial
and slabs_full
are empty. This is for internal use only and is
important during cache destruction when it doesn't matter how many pages
are freed, just that the cache is empty.
9.1.9 Cache Destroying
When a module is unloaded, it is responsible for destroying any cache with
the function kmem_cache_destroy()
. It is important that the cache
is properly destroyed as two caches of the same human-readable name are not
allowed to exist. Core kernel code often does not bother to destroy its
caches as their existence persists for the life of the system. The steps
taken to destroy a cache are
- Delete the cache from the cache chain;
- Shrink the cache to delete all slabs;
- Free any per CPU caches (
kfree()
); - Delete the cache descriptor from the
cache_cache
.
Footnotes
- ...
REAP_SCANLEN
9.1 REAP_SCANLEN
is statically defined as 10.
Next: 9.2 Slabs Up: 9. Slab Allocator Previous: 9. Slab Allocator   Contents   Index Mel 2004-02-15