Библиотека сайта rus-linux.net
Linux Device Drivers, 2nd EditionBy Alessandro Rubini & Jonathan Corbet2nd Edition June 2001 0-59600-008-1, Order Number: 0081 586 pages, $39.95 |
Chapter 16
Physical Layout of the Kernel SourceContents:
Booting the Kernel
Before Booting
The init Process
The kernel Directory
The fs Directory
The mm Directory
The net directory
ipc and lib
include and arch
DriversSo far, we've talked about the Linux kernel from the perspective of writing device drivers. Once you begin playing with the kernel, however, you may find that you want to "understand it all." In fact, you may find yourself passing whole days navigating through the source code and grepping your way through the source tree to uncover the relationships among the different parts of the kernel.
This kind of "heavy grepping" is one of the tasks your authors perform quite often, and it is an efficient way to retrieve information from the source code. Nowadays you can even exploit Internet resources to understand the kernel source tree; some of them are listed in the Preface. But despite Internet resources, wise use of grep,[62] less, and possibly ctags or etagscan still be the best way to extract information from the kernel sources.
Booting the Kernel
The usual way to look at a program is to start where execution begins. As far as Linux is concerned, it's hard to tell where execution begins -- it depends on how you define "begins."
The first function called by start_kernel, after acquiring the kernel lock and printing the Linux banner string, is setup_arch. This allows platform-specific C-language code to run; setup_arch receives a pointer to the local
command_line
pointer in start_kernel, so it can make it point to the real (platform-dependent) location where the command line is stored. As the next step, start_kernel passes the command line to parse_options (defined in the same init/main.c file) so that the boot options can be honored.The initial boot sequence can thus be summarized as follows:
start_kernel is called. It acquires the kernel lock, prints the banner, and calls setup_arch.
start_kernel initializes basic facilities and forks the init thread.
It is the task of the init thread to perform all other initialization. The thread is part of the same init/main.c file, and the bulk of the initialization (init) calls are performed by do_basic_setup. The function initializes all bus subsystems that it finds (PCI, SBus, and so on). It then invokes do_initcalls; device driver initialization is performed as part of the initcall processing.
The idea of init calls was added in version 2.3.13 and is not available in older kernels; it is designed to avoid hairy
#ifdef
conditionals all over the initialization code. Every optional kernel feature (device driver or whatever) must be initialized only if configured in the system, so the call to initialization functions used to be surrounded by#ifdef CONFIG_
FEATURE and#endif
. With init calls, each optional feature declares its own initialization function; the compilation process then places a reference to the function in a special ELF section. At boot time, do_initcalls scans the ELF section to invoke all the relevant initialization functions.morgana%grep -c ifdef linux-2.[024]/init/main.c
linux-2.0/init/main.c:120 linux-2.2/init/main.c:246 linux-2.4/init/main.c:35There is yet another advantage to putting the initialization code into a special section. Once initialization is complete, that code is no longer needed. Since this code has been isolated, the kernel is able to dump it and reclaim the memory it occupies.
Before Booting
A known limitation of the x86 platform is that the CPU can see only 640 KB of system memory when it is powered on, no matter how large your installed memory is. Dealing with the limitation requires the kernel to be compressed, and support for decompression is available in arch/i386/boot together with other code such as VGA mode setting. On the PC, because of this limit, you can't do anything with a vmlinux kernel image, and the file you actually boot is called zImage or bzImage; the boot sector described earlier is actually prepended to this file rather than to vmlinux. We won't spend more time on the booting process on the x86 platform, since you can choose from several boot loaders, and the topic is generally well discussed elsewhere.
Some platforms differ greatly in the layout of their boot code from the PC. Sometimes the code must deal with several variations of the same architecture. This is the case, for example, with ARM, MIPS, and M68k. These platforms cover a wide variety of CPU and system types, ranging from powerful servers and workstations down to PDAs or embedded appliances. Different environments require different boot code and sometimes even different ldscripts to compile the kernel image. Some of this support is not included in the official kernel tree published by Linus and is available only from third-party Concurrent Versions System (CVS) trees that closely track the official tree but have not yet been merged. Current examples include the SGI CVS tree for MIPS workstations and the LinuxCE CVS tree for MIPS-based palm computers. Nonetheless, we'd like to spend a few words on this topic because we feel it's an interesting one. Everything from start_kernelonward is based on this extra complexity but doesn't notice it.
Specific ld scripts and makefile rules are needed especially for embedded systems, and particularly for variants without a memory management unit, which are supported by uClinux. When you have no hardware MMU that maps virtual addresses to physical ones, you must link the kernel to be executed from the physical address where it will be loaded in the target platform. It's not uncommon in small systems to link the kernel so that it is loaded into read-only memory (usually flash memory), where it is directly activated at power-on time without the help of any boot loader.
When the kernel is executed directly from flash memory, the makefiles, ld scripts, and boot code work in tight cooperation. The ld rules place the code and read-only segments (such as the init calls information) into flash memory, while placing the data segments (data and block started by symbol (BSS)) in system RAM. The result is that the two sets are not consecutive. The makefile, then, offers special rules to coalesce all these sections into consecutive addresses and convert them to a format suitable for upload to the target system. Coalescing is mandatory because the data segment contains initialized data structures that must get written to read-only memory or otherwise be lost. Finally, assembly code that runs before start_kernel must copy over the data segment from flash memory to RAM (to the address where the linker placed it) and zero out the address range associated with the BSS segment. Only after this remapping has taken place can C-language code run.
The init Process
When start_kernel forks out the init thread (implemented by the init function in init/main.c), it is still running in kernel mode, and so is the init thread. When all initializations described earlier are complete, the thread drops the kernel lock and prepares to execute the user-space init process. The file being executed resides in /sbin/init, /etc/init, or /bin/init. If none of those are found, /bin/sh is run as a recovery measure in case the real init got lost or corrupted. As an alternative, the user can specify on the kernel command line which file the initthread should execute.
The thread is able to invoke system calls while running in kernel mode because init/main.c has declared
__KERNEL_SYSCALLS__
before including<asm/unistd.h>
. The header defines special code that allows kernel code to invoke a limited number of system calls just as if it were running in user space. More information about kernel system calls can be found in http://www.linux.it/kerneldocs/ksys.The kernel Directory
The most important such facility is scheduling. Thus, sched.c, together with
<linux/sched.h>
, can be considered the most important source file in the Linux kernel. In addition to the scheduler proper, implemented by schedule, the file defines the system calls that control process priorities and all the mechanisms for sleeping and waking.The fork and exit system calls are implemented by two files that are named after them. They are comprehensive and well-structured files that deal with everything related to process creation and destruction.
Other facilities that are implemented in this directory are time handling (time.c), kernel timers (timer.c), signal delivery and handling (signal.c), module management and related system calls (module.c), the kmod thread (kmod.c), systemwide power management (pm.c), tasklets (softirq.c), and the panic function (panic.c).
The fs Directory
File handling is at the core of any Unix system, and the fs directory in Linux is the fattest of all directories. It includes all the filesystems supported by the current Linux version, each in its own subdirectory, as well as the most important system calls after fork and exit.
The execve system call lives in exec.c and relies on the various available binary formats to actually interpret the binary data found in the executable files. The most important binary format nowadays is ELF, implemented by binfmt_elf.c. binfmt_script.csupports the execution of interpreted files. After detecting the need for an interpreter (usually on the
#!
or "shebang" line), the file relies on the other binary formats to load the interpreter.The fundamental system calls for file access are defined in open.c and read_write.c. The former also defines close and several other file-access system calls (chown, for instance). select.c implements selectand poll. pipe.c and fifo.c implement pipes and named pipes. readdir.c implements the getdents system call, which is how user-space programs read directories (the name stands for "get directory entries"). Other programming interfaces to access directory data (such as the readdir interface) are all implemented in user space as library functions, based on the getdents system call.
Most system calls related to moving files around, such as mkdir, rmdir, rename, link, symlink, and mknod, are implemented in namei.c, which in turn lays its foundations on the directory entry cache that lives in dcache.c.
Of particular interest to device driver writers is devices.c, which implements the char and block driver registries and acts as dispatcher for all devices. It does so by implementing the generic open method that is used before the device-specific
file_operations
structure is fetched and used. read and write for block devices are implemented in block_dev.c, which in turn delegates to buffer.c everything related to buffer management.There are several other files in this directory, but they are less interesting. The most important ones are inode.cand file.c, which manage the internal organization of file and inode data structures; ioctl.c, which implements ioctl; and dquot.c, which implements quotas.
The mm Directory
The last major directory of kernel source files is devoted to memory management. The files in this directory implement all the data structures that are used throughout the system to manage memory-related issues. While memory management is founded on registers and features specific to a given CPU, we've already seen in Chapter 13, "mmap and DMA" how most of the code has been made platform independent. Interested users can check how asm/arch-arch/mmimplements the lowest level for a specific computer platform.
The kmalloc/kfree memory allocation engine is defined in slab.c. This file is a completely new implementation that replaces what used to live in kmalloc.c. The latter file doesn't exist anymore after version 2.0.
The other important allocation tool, vmalloc, and the function that lies behind them all, get_free_pages, are defined in vmalloc.c and page_alloc.crespectively. Both are pretty straightforward and make interesting reading.
In addition to allocation services, a memory management system must offer memory mappings. After all, mmap is the foundation of many system activities, including the execution of a file. The actual sys_mmap function doesn't live here, though. It is buried in architecture-specific code, because system calls with more than five arguments need special handling in relation to CPU registers. The function that implements mmap for all platforms is do_mmap_pgoff, defined in mmap.c. The same file implements sys_sendfile and sys_brk. The latter may look unrelated, because brk is used to raise the maximum virtual address usable by a process. Actually, Linux (and most current Unices) creates new virtual address space for a process by mapping pages from /dev/zero.
The mechanisms for mapping a regular file into memory have been placed in filemap.c; the file acts on pretty low-level data structures within the memory management system. mprotect and remapare implemented in two files of the same names; memory locking appears in mlock.c.
When a process has several memory maps active, you need an efficient way to look for free areas in its memory address space. To this end, all memory maps of a process are laid out in an Adelson-Velski-Landis (AVL) tree. The software structure is implemented in mmap_avl.c.
Swap file initialization and removal (i.e., the swapon and swapoff system calls) are in swapfile.c. The scope of swap_state.c is the swap cache, and page aging is in swap.c. What is known as swapping is not defined here. Instead, it is part of managing memory pages, implemented by the kswapd thread.
The net directory
The net directory in the Linux file hierarchy is the repository for the socket abstraction and the network protocols; these features account for a lot of code, since Linux supports several different network protocols. Each protocol (IP, IPX, and so on) lives in its own subdirectory; the directory for IP is called ipv4 because it represents version 4 of the protocol. The new standard (not yet in wide use as we write this) is called ipv6 and is implemented in Linux as well. Unix-domain sockets are treated as just another network protocol; their implementation can be found in the unixsubdirectory.
The network implementation in Linux is based on the same file operations that act on device files. This is natural, because network connections (sockets) are described by normal file descriptors. The file socket.c is the locus of the socket file operations. It dispatches the system calls to one of the network protocols via a
struct proto_ops
structure. This structure is defined by each network protocol to map system calls to its specific, low-level data handling operations.The ethernet and bridgedirectories are used to implement specific low-level functionalities, specifically, the Ethernet-related helper functions described in Chapter 14, "Network Drivers", and bridging functionality.
sunrpc and khttpd are peculiar because they include kernel-level implementations of tasks that are usually carried out in user space.
The two remaining source files within net are sysctl_net.c and netsyms.c. The former is the back end of the sysctlmechanism,[66] and the latter is just a list of
EXPORT_SYMBOL
declarations. There are several such files all over the kernel, usually one in each major directory.ipc and lib
The smallest directories (in size) in the Linux source tree are ipc and lib. The former is an implementation of the System V interprocess communication primitives, namely semaphores, message queues, and shared memory; they often get forgotten, but many applications use them (especially shared memory). The latter directory includes generic support functions, similar to the ones available in the standard C library.
The generic library functions are a very small subset of those available in user space, but cover the indispensable things you generally need to write code: string functions (including simple_atol to convert a string to a
long
integer with error checking) and<ctype.h>
functions. The most important file in this directory is vsprintf.c; it implements the function by the same name, which sits at the core of sprintf and printk. Another important file is inflate.c, which includes the decompressing code of gzip.include and arch
In a quick overview of the kernel source code, there's little to say about headers and architecture-specific code. Header files have been introduced all over the book, so their role (and the separation between include/linux and include/asm) should already be clear.
Drivers
Current Linux kernels support a huge number of devices. Device drivers account for half of the size of the source tree (actually two-thirds if you exclude architecture-specific code that you are not using). They account for almost 1500 C-language files and more than 800 headers.
The drivers directory itself doesn't host any source file, only subdirectories (and, obviously, a makefile).
Structuring the huge amount of source code is not easy, and the developers haven't followed any strict rules. The original division between drivers/char and drivers/block is inefficient nowadays, and more directories have been created according to several different requirements. Still, the most generic char and block drivers are found in drivers/char and drivers/block, so we'll start by visiting those two.
drivers/char
The generic tty layer (as well as line disciplines, tty software drivers, and similar features) is implemented in this directory. console.c defines the
linux
terminal type (by implementing its specific escape sequences and keyboard encoding). vt.c defines the virtual consoles, including code for switching from one virtual console to another. Selection support (the cut-and-paste capability of the Linux text console) is implemented by selection.c; the default line discipline is implemented by n_tty.c.There are other files that, despite what you might expect, are device independent. lp.c implements a generic parallel port printer driver that includes a console-on-line-printer capability. It remains device independent by using the parport device driver to map operations to actual hardware (as seen in Figure 2-2). Similarly, keyboard.c implements the higher levels of keyboard handling; it exports the handle_scancodefunction so that platform-specific keyboard drivers (like pc_keyb.c, in the same directory) can benefit from generalized management. mem.c implements /dev/mem, /dev/null, and /dev/zero, basic resources you can't do without.
drivers/block
A relatively new entry in this directory is blkpg.c (added as of 2.3.3). The file implements generic code for partition and geometry handling in block devices. Its code, together with the fs/partitions directory described earlier, replaces what was earlier part of "generic hard disk" support. The file called genhd.c still exists, but now includes only the generic initialization function for block drivers (similar to the one for char drivers that is part of mem.c). One of the public functions exported by blkpg.c is blk_ioctl, covered by "The ioctl Method" in Chapter 12, "Loading Block Drivers".
In addition to the hardware-dependent device drivers you would expect to find in drivers/block, the directory also includes software device drivers that are inherently cross-platform, just like the sbull and spull drivers that we introduced in this book. They are the RAM disk rd.c, the "network block device" nbd.c, and the loopback block device loop.c. The loopback device is used to mount files as if they were block devices. (See the manpage for mount, where it describes the -o loop option.) The network block device can be used to access remote resources as block devices (thus allowing, for example, a remote swap device).
drivers/ide
The IDE family of device drivers used to live in drivers/block but has expanded to the point where they were moved into a separate directory. As a matter of fact, the IDE interface has been enhanced and extended over time in order to support more than just conventional hard disks. For example, IDE tapes are now supported as well.
drivers/md
This directory is concerned with implementing RAID functionality and the Logical Volume Manager abstraction. The code registers its own char and block major numbers, so it can be considered a driver just like those traditional drivers; nonetheless, the code has been kept separate because it has nothing to do with direct hardware management.
drivers/cdrom
This directory hosts the generic CD-ROM interface. Both the IDE and SCSI cdrom drivers rely on drivers/cdrom/cdrom.c for some of their functionality. The main entry points to the file are register_cdrom and unregister_cdrom; the caller passes them a pointer to
struct cdrom_device_info
as the main object involved in CD-ROM management.drivers/scsi
Everything related to the SCSI bus has always been placed in this directory. This includes both controller-independent support for specific devices (such as hard drives and tapes) and drivers for specific SCSI controller boards.
Management of the SCSI bus interface is scattered in several files: scsi.c, hosts.c, scsi_ioctl.c, and a dozen more. If you are interested in the whole list, you'd better browse the makefile, where
scsi_mod-objs
is defined. All public entry points to this group of files have been collected in scsi_syms.c.Host adapters (i.e., SCSI controller hardware) can be plugged into the core system by calling scsi_register_module with an argument of
MODULE_SCSI_HA
. Most drivers currently do that by using the scsi_module.cfacility to register themselves: the driver's source file defines its (static) data structures and then includes scsi_module.c. This file defines standard initialization and cleanup functions, based on<linux/init.h>
and the init calls mechanisms. This technique allows drivers to serve as either modules or compiled-in functions without any#ifdef
lines.drivers/net
As you might expect, this directory is the home for most interface adapters. Unlike drivers/scsi, this directory doesn't include the actual communication protocols, which live in the top-level net directory tree. Nonetheless, there's still some bit of software abstraction implemented in drivers/net, namely, the implementation of the various line disciplines used by serial-based network communication.
drivers/sound
Like drivers/scsi and drivers/net, this directory includes all the drivers for sound cards. The contents of the directory are somewhat similar to the SCSI directory: a few files make up the core sound system, and individual device drivers stack on top of it. The core sound system is in charge of requesting the major number
SOUND_MAJOR
and dispatching any use of it to the underlying device drivers. A hardware driver plugs into the core by calling sound_install_audiodrv, declared in dev_table.c.drivers/video
Here you find all the frame buffer video devices. The directory is concerned with video output, not video input. Like /drivers/sound, the whole directory implements a single char device driver; a core frame buffer system dispatches actual access to the various frame buffers available on the computer.
The entry point to /dev/fb devices is in fbmem.c. The file registers the major number and maintains an internal list of which frame buffer device is in charge of each minor number. A hardware driver registers itself by calling register_framebuffer, passing a pointer to
struct fb_info
. The data structure includes everything that's needed for specific device management. It includes the open and releasemethods, but no read, write, or mmap; these methods are implemented in a generalized way in fbmem.c itself.When the first frame buffer device is registered, the function register_framebuffer calls take_over_console (exported by drivers/char/console.c) in order to actually set up the current frame buffer as the system console. At boot time, before frame buffer initialization, the console is either the native text screen or, if none is there, the first serial port. The command line starting the kernel, of course, can override the default by selecting a specific console device. Kernel developers created take_over_console to add support for frame buffer consoles without complicating the boot code. (Usually frame buffer drivers depend on PCI or equivalent support, so they can't be active too early during the boot process.) The take_over_console feature, however, is not limited to frame buffers; it's available to any code involving any hardware. If you want to transmit kernel messages using a Morse beeper or UDP network packets, you can do that by calling take_over_console from your kernel module.
drivers/input
Input management is another facility meant to simplify and standardize activities that are common to several drivers, and to offer a unified interface to user space. The core file here is called input.c. It registers itself as a char driver using
INPUT_MAJOR
as its major number. Its role is collecting events from low-level device drivers and dispatching them to higher layers.drivers/media
This directory, introduced as of version 2.4.0-test7, collects other communication media, currently radio and video input devices. Both the media/radio and media/videodrivers currently stack on video/videodev.c, which implements the "Video For Linux" API.
video/videodev.c is a generic container. It requests a major number and makes it available to hardware drivers. Individual low-level drivers register by calling video_register_device. They pass a pointer to their own
struct video_device
and an integer that specifies the type of device. Supported devices are frame grabbers (VFL_TYPE_GRABBER
), radios (VFL_TYPE_RADIO
), teletext devices (VFL_TYPE_VTX
), and undecoded vertical-blank information (VFL_TYPE_VBI
).Bus-Specific Directories
Directories devoted to external buses include drivers/usb, drivers/pcmcia, drivers/parport (generic cross-platform parallel port support, which defines a whole new class of device drivers), drivers/isdn (all ISDN controllers supported by Linux and their common support functions), drivers/atm (the same, for ATM network connections), and drivers/ieee1394 (FireWire).
Platform-Specific Directories
Other Subdirectories
There are other subdirectories in drivers, but they are, in our opinion, currently of minor interest and very specific use. drivers/mtd implements a Memory Technology Device layer, which is used to manage solid-state disks (flash memories and other kinds of EEPROM). drivers/i2c offers an implementation of the i2c protocol, which is the "Inter Integrated Circuit" two-wire bus used internally by several modern peripherals, especially frame grabbers. drivers/i2o, similarly, handles I2O devices (a proprietary high-speed communication standard for certain PCI devices, which has been unveiled under pressure from the free software community). drivers/pnp is a collection of common ISA Plug-and-Play code from various drivers, but fortunately the PnP hack is not really used nowadays by manufacturers.
Back to: Linux Device Drivers, 2nd Edition
oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy
╘ 2001, O'Reilly & Associates, Inc.