Библиотека сайта rus-linux.net
Linux Device Drivers, 2nd EditionBy Alessandro Rubini & Jonathan Corbet2nd Edition June 2001 0-59600-008-1, Order Number: 0081 586 pages, $39.95 |
Chapter 7
Getting Hold of MemoryContents:
The Real Story of kmalloc
Lookaside Caches
get_free_page and Friends
vmalloc and Friends
Boot-Time Allocation
Backward Compatibility
Quick ReferenceThus far, we have used kmalloc and kfree for the allocation and freeing of memory. The Linux kernel offers a richer set of memory allocation primitives, however. In this chapter we look at other ways of making use of memory in device drivers and at how to make the best use of your system's memory resources. We will not get into how the different architectures actually administer memory. Modules are not involved in issues of segmentation, paging, and so on, since the kernel offers a unified memory management interface to the drivers. In addition, we won't describe the internal details of memory management in this chapter, but will defer it to "Memory Management in Linux" in Chapter 13, "mmap and DMA".
The Real Story of kmalloc
The kmalloc allocation engine is a powerful tool, and easily learned because of its similarity to malloc. The function is fast -- unless it blocks -- and it doesn't clear the memory it obtains; the allocated region still holds its previous content. The allocated region is also contiguous in physical memory. In the next few sections, we talk in detail about kmalloc, so you can compare it with the memory allocation techniques that we discuss later.
The Flags Argument
The first argument to kmalloc is the size of the block to be allocated. The second argument, the allocation flags, is much more interesting, because it controls the behavior of kmalloc in a number of ways.
The most-used flag,
GFP_KERNEL
, means that the allocation (internally performed by calling, eventually, get_free_pages, which is the source of theGFP_
prefix) is performed on behalf of a process running in kernel space. In other words, this means that the calling function is executing a system call on behalf of a process. UsingGFP_KERNEL
means that kmalloccan put the current process to sleep waiting for a page when called in low-memory situations. A function that allocates memory usingGFP_KERNEL
must therefore be reentrant. While the current process sleeps, the kernel takes proper action to retrieve a memory page, either by flushing buffers to disk or by swapping out memory from a user process.Other flags can be used in place of or in addition to
GFP_KERNEL
andGFP_ATOMIC
, although those two cover most of the needs of device drivers. All the flags are defined in<linux/mm.h>
: individual flags are prefixed with a double underscore, like__GFP_DMA
; collections of flags lack the prefix and are sometimes called allocation priorities.
GFP_KERNEL
GFP_BUFFER
GFP_ATOMIC
GFP_USER
Used to allocate memory on behalf of the user. It may sleep, and is a low-priority request.
GFP_HIGHUSER
__GFP_DMA
This flag requests memory usable in DMA data transfers to/from devices. Its exact meaning is platform dependent, and the flag can be OR'd to either
GFP_KERNEL
orGFP_ATOMIC
.
__GFP_HIGHMEM
Memory zones
Version 2.4 of the kernel knows about three memory zones: DMA-capable memory, normal memory, and high memory. While allocation normally happens in the normal zone, setting either of the bits just mentioned requires memory to be allocated from a different zone. The idea is that every computer platform that must know about special memory ranges (instead of considering all RAM equivalent) will fall into this abstraction.
Whenever a new page is allocated to fulfill the kmalloc request, the kernel builds a list of zones that can be used in the search. If
__GFP_DMA
is specified, only the DMA zone is searched: if no memory is available at low addresses, allocation fails. If no special flag is present, both normal and DMA memory is searched; if__GFP_HIGHMEM
is set, then all three zones are used to search a free page.The Size Argument
The kernel manages the system's physical memory, which is available only in page-sized chunks. As a result, kmalloc looks rather different than a typical user-space malloc implementation. A simple, heap-oriented allocation technique would quickly run into trouble; it would have a hard time working around the page boundaries. Thus, the kernel uses a special page-oriented allocation technique to get the best use from the system's RAM.
You can find the exact values used for the allocation blocks in mm/kmalloc.c (with the 2.0 kernel) or
mm/slab.c
(in current kernels), but remember that they can change again without notice. The trick of allocating less than 4 KB works well for scull with all 2.x kernels, but it's not guaranteed to be optimal in the future.In any case, the maximum size that can be allocated by kmalloc is 128 KB -- slightly less with 2.0 kernels. If you need more than a few kilobytes, however, there are better ways than kmalloc to obtain memory, as outlined next.
Lookaside Caches
A device driver often ends up allocating many objects of the same size, over and over. Given that the kernel already maintains a set of memory pools of objects that are all the same size, why not add some special pools for these high-volume objects? In fact, the kernel does implement this sort of lookaside cache. Device drivers normally do not exhibit the sort of memory behavior that justifies using a lookaside cache, but there can be exceptions; the USB and ISDN drivers in Linux 2.4 use caches.
Linux memory caches have a type of
kmem_cache_t
and are created with a call to kmem_cache_create:kmem_cache_t * kmem_cache_create(const char *name, size_t size, size_t offset, unsigned long flags, void (*constructor)(void *, kmem_cache_t *, unsigned long flags), void (*destructor)(void *, kmem_cache_t *, unsigned long flags) );
SLAB_NO_REAP
SLAB_HWCACHE_ALIGN
SLAB_CACHE_DMA
This flag requires each data object to be allocated in DMA-capable memory.
The
constructor
anddestructor
arguments to the function are optional functions (but there can be no destructor without a constructor); the former can be used to initialize newly allocated objects and the latter can be used to "clean up" objects prior to their memory being released back to the system as a whole.Once a cache of objects is created, you can allocate objects from it by calling kmem_cache_alloc:
void *kmem_cache_alloc(kmem_cache_t *cache, int flags);To free an object, use kmem_cache_free:
void kmem_cache_free(kmem_cache_t *cache, const void *obj);int kmem_cache_destroy(kmem_cache_t *cache);One side benefit to using lookaside caches is that the kernel maintains statistics on cache usage. There is even a kernel configuration option that enables the collection of extra statistical information, but at a noticeable runtime cost. Cache statistics may be obtained from /proc/slabinfo.
A scull Based on the Slab Caches: scullc
/* Allocate a quantum using the memory cache */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmem_cache_alloc(scullc_cache, GFP_KERNEL); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, scullc_quantum); }And these lines release memory:
for (i = 0; i < qset; i++) if (dptr->data[i]) kmem_cache_free(scullc_cache, dptr->data[i]); kfree(dptr->data);To support use of
scullc_cache
, these few lines are included in the file at proper places:/* declare one cache pointer: use it for all devices */ kmem_cache_t *scullc_cache; /* init_module: create a cache for our quanta */ scullc_cache = kmem_cache_create("scullc", scullc_quantum, 0, SLAB_HWCACHE_ALIGN, NULL, NULL); /* no ctor/dtor */ if (!scullc_cache) { result = -ENOMEM; goto fail_malloc2; } /* cleanup_module: release the cache of our quanta */ kmem_cache_destroy(scullc_cache);The main differences in passing from scullto scullc are a slight speed improvement and better memory use. Since quanta are allocated from a pool of memory fragments of exactly the right size, their placement in memory is as dense as possible, as opposed to scull quanta, which bring in an unpredictable memory fragmentation.
get_free_page and Friends
If a module needs to allocate big chunks of memory, it is usually better to use a page-oriented technique. Requesting whole pages also has other advantages, which will be introduced later, in "The mmap Device Operation" in Chapter 13, "mmap and DMA".
To allocate pages, the following functions are available:
- get_zeroed_page
Returns a pointer to a new page and fills the page with zeros.
- __get_free_page
- __get_free_pages
- __get_dma_pages
Similar to get_free_pages, but guarantees that the allocated memory is DMA capable. If you use version 2.2 or later of the kernel, you can simply use __get_free_pages and pass the
__GFP_DMA
flag; if you want backward compatibility with 2.0, you need to call this function instead.
The prototypes for the functions follow:
unsigned long get_zeroed_page(int flags); unsigned long __get_free_page(int flags); unsigned long __get_free_pages(int flags, unsigned long order); unsigned long __get_dma_pages(int flags, unsigned long order);The
flags
argument works in the same way as with kmalloc; usually eitherGFP_KERNEL
orGFP_ATOMIC
is used, perhaps with the addition of the__GFP_DMA
flag (for memory that can be used for direct memory access operations) or__GFP_HIGHMEM
when high memory can be used.order
is the base-two logarithm of the number of pages you are requesting or freeing (i.e., log2N). For example,order
is0
if you want one page and3
if you request eight pages. Iforder
is too big (no contiguous area of that size is available), the page allocation will fail. The maximum value oforder
was 5 in Linux 2.0 (corresponding to 32 pages) and 9 with later versions (corresponding to 512 pages: 2 MB on most platforms). Anyway, the biggerorder
is, the more likely it is that the allocation will fail.When a program is done with the pages, it can free them with one of the following functions. The first function is a macro that falls back on the second:
void free_page(unsigned long addr); void free_pages(unsigned long addr, unsigned long order);Although
kmalloc(GFP_KERNEL)
sometimes fails when there is no available memory, the kernel does its best to fulfill allocation requests. Therefore, it's easy to degrade system responsiveness by allocating too much memory. For example, you can bring the computer down by pushing too much data into a scull device; the system will start crawling while it tries to swap out as much as possible in order to fulfill the kmalloc request. Since every resource is being sucked up by the growing device, the computer is soon rendered unusable; at that point you can no longer even start a new process to try to deal with the problem. We don't address this issue in scull, since it is just a sample module and not a real tool to put into a multiuser system. As a programmer, you must nonetheless be careful, because a module is privileged code and can open new security holes in the system (the most likely is a denial-of-service hole like the one just outlined).A scull Using Whole Pages: scullp
The following lines show how it allocates memory:
/* Here's the allocation of a single quantum */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = (void *)__get_free_pages(GFP_KERNEL, dptr->order); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order); }The code to deallocate memory in scullp, instead, looks like this:
/* This code frees a whole quantum set */ for (i = 0; i < qset; i++) if (dptr->data[i]) free_pages((unsigned long)(dptr->data[i]), dptr->order);But the biggest advantage of __get_free_page is that the page is completely yours, and you could, in theory, assemble the pages into a linear area by appropriate tweaking of the page tables. For example, you can allow a user process to mmap memory areas obtained as single unrelated pages. We'll discuss this kind of operation in "The mmap Device Operation" in Chapter 13, "mmap and DMA", where we show how scullp offers memory mapping, something that scull cannot offer.
vmalloc and Friends
The next memory allocation function that we'll show you is vmalloc, which allocates a contiguous memory region in the virtual address space. Although the pages are not necessarily consecutive in physical memory (each page is retrieved with a separate call to __get_free_page), the kernel sees them as a contiguous range of addresses. vmalloc returns 0 (the
NULL
address) if an error occurs, otherwise, it returns a pointer to a linear memory area of size at leastsize
.#include <linux/vmalloc.h> void * vmalloc(unsigned long size); void vfree(void * addr); void *ioremap(unsigned long offset, unsigned long size); void iounmap(void * addr);It's worth stressing that memory addresses returned by kmalloc and get_free_pagesare also virtual addresses. Their actual value is still massaged by the MMU (memory management unit, usually part of the CPU) before it is used to address physical memory.[30] vmalloc is not different in how it uses the hardware, but rather in how the kernel performs the allocation task.
This difference can be perceived by comparing the pointers returned by the allocation functions. On some platforms (for example, the x86), addresses returned by vmalloc are just greater than addresses that kmalloc addresses. On other platforms (for example, MIPS and IA-64), they belong to a completely different address range. Addresses available for vmalloc are in the range from
VMALLOC_START
toVMALLOC_END
. Both symbols are defined in<asm/pgtable.h>
.An example of a function that uses vmalloc is the create_module system call, which uses vmalloc to get space for the module being created. Code and data of the module are later copied to the allocated space using copy_from_user, after insmod has relocated the code. In this way, the module appears to be loaded into contiguous memory. You can verify, by looking in /proc/ksyms, that kernel symbols exported by modules lie in a different memory range than symbols exported by the kernel proper.
One minor drawback of vmalloc is that it can't be used at interrupt time because internally it uses
kmalloc(GFP_KERNEL)
to acquire storage for the page tables, and thus could sleep. This shouldn't be a problem -- if the use of __get_free_page isn't good enough for an interrupt handler, then the software design needs some cleaning up.A scull Using Virtual Addresses: scullv
/* Allocate a quantum using virtual addresses */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = (void *)vmalloc(PAGE_SIZE << dptr->order); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order); }And these lines release memory:
/* Release the quantum set */ for (i = 0; i < qset; i++) if (dptr->data[i]) vfree(dptr->data[i]);salma% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem Device 0: qset 500, order 0, sz 1048576 item at e00000003e641b40, qset at e000000025c60000 0:e00000003007c000 1:e000000024778000 salma% cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem Device 0: qset 500, order 4, sz 1048576 item at e0000000303699c0, qset at e000000025c87000 0:a000000000034000 1:a000000000078000 salma% uname -m ia64 rudo% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem Device 0: qset 500, order 0, sz 1048576 item at c4184780, qset at c71c4800 0:c262b000 1:c2193000 rudo% cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem Device 0: qset 500, order 4, sz 1048576 item at c4184b80, qset at c71c4000 0:c881a000 1:c882b000 rudo% uname -m i686The values show two different behaviors. On IA-64, physical addresses and virtual addresses are mapped to completely different address ranges (0xE and 0xA), whereas on x86 computers vmalloc returns virtual addresses just above the mapping used for physical memory.
Boot-Time Allocation
If you really need a huge buffer of physically contiguous memory, you need to allocate it by requesting memory at boot time. This technique is inelegant and inflexible, but it is also the least prone to failure. Needless to say, a module can't allocate memory at boot time; only drivers directly linked to the kernel can do that.
Acquiring a Dedicated Buffer at Boot Time
When the kernel is booted, it gains access to all the physical memory available in the system. It then initializes each of its subsystems by calling that subsystem's initialization function, allowing initialization code to allocate a memory buffer for private use by reducing the amount of RAM left for normal system operation.
With version 2.4 of the kernel, this kind of allocation is performed by calling one of these functions:
#include <linux/bootmem.h> void *alloc_bootmem(unsigned long size); void *alloc_bootmem_low(unsigned long size); void *alloc_bootmem_pages(unsigned long size); void *alloc_bootmem_low_pages(unsigned long size);The bigphysarea Patch
Reserving High RAM Addresses
The last option for allocating contiguous memory areas, and possibly the easiest, is reserving a memory area at the end of physical memory (whereas bigphysarea reserves it at the beginning of physical memory). To this aim, you need to pass a command-line option to the kernel to limit the amount of memory being managed. For example, one of your authors uses
mem=126M
to reserve 2 megabytes in a system that actually has 128 megabytes of RAM. Later, at runtime, this memory can be allocated and used by device drivers.The advantage of allocator over the bigphysarea patch is that there's no need to modify official kernel sources. The disadvantage is that you must change the command-line option to the kernel whenever you change the amount of RAM in the system. Another disadvantage, which makes allocator unsuitable in some situations is that high memory cannot be used for some tasks, such as DMA buffers for ISA devices.
Backward Compatibility
The lookaside cache functions were introduced in Linux 2.1.23, and were simply not available in the 2.0 kernel. Code that must be portable back to Linux 2.0 should stick with kmalloc and kfree. Moreover, kmem_destroy_cache was introduced during 2.3 development and has only been backported to 2.2 as of 2.2.18. For this reason scullc refuses to compile with a 2.2 kernel older than that.
Quick Reference
The functions and symbols related to memory allocation follow.
#include <linux/malloc.h>
void *kmalloc(size_t size, int flags);
void kfree(void *obj);
#include <linux/mm.h>
GFP_KERNEL
GFP_ATOMIC
__GFP_DMA
__GFP_HIGHMEM
kmalloc flags.
__GFP_DMA
and__GFP_HIGHMEM
are flags that can be OR'd to eitherGFP_KERNEL
orGFP_ATOMIC
.
#include <linux/malloc.h>
kmem_cache_t *kmem_cache_create(char *name, size_t size, size_t offset, unsigned long flags, constructor(), destructor());
int kmem_cache_destroy(kmem_cache_t *cache);
Create and destroy a slab cache. The cache can be used to allocate several objects of the same size.
SLAB_NO_REAP
SLAB_HWCACHE_ALIGN
SLAB_CACHE_DMA
SLAB_CTOR_ATOMIC
SLAB_CTOR_CONSTRUCTOR
Flags that the allocator can pass to the constructor and the destructor functions.
void *kmem_cache_alloc(kmem_cache_t *cache, int flags);
void kmem_cache_free(kmem_cache_t *cache, const void *obj);
unsigned long get_zeroed_page(int flags);
unsigned long __get_free_page(int flags);
unsigned long __get_free_pages(int flags, unsigned long order);
unsigned long __get_dma_pages(int flags, unsigned long order);
The page-oriented allocation functions. get_zeroed_page returns a single, zero-filled page. All the other versions of the call do not initialize the contents of the returned page(s). __get_dma_pages is only a compatibility macro in Linux 2.2 and later (you can use
__GFP_DMA
instead).
void free_page(unsigned long addr);
void free_pages(unsigned long addr, unsigned long order);
#include <linux/vmalloc.h>
void * vmalloc(unsigned long size);
void vfree(void * addr);
#include <asm/io.h>
void * ioremap(unsigned long offset, unsigned long size);
void iounmap(void *addr);
These functions allocate or free a contiguous virtual address space. ioremap accesses physical memory through virtual addresses, while vmalloc allocates free pages. Regions mapped with ioremap are freed with iounmap, while pages obtained from vmalloc are released with vfree.
#include <linux/bootmem.h>
void *alloc_bootmem(unsigned long size);
void *alloc_bootmem_low(unsigned long size);
void *alloc_bootmem_pages(unsigned long size);
void *alloc_bootmem_low_pages(unsigned long size);
Only with version 2.4 of the kernel, memory can be allocated at boot time using these functions. The facility can only be used by drivers directly linked in the kernel image.
Back to: Linux Device Drivers, 2nd Edition
oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy
╘ 2001, O'Reilly & Associates, Inc.