![]() |
|
![]() |
Linux Device Drivers, 2nd EditionBy Alessandro Rubini & Jonathan Corbet2nd Edition June 2001 0-59600-008-1, Order Number: 0081 586 pages, $39.95 |
Chapter 3
Char DriversContents:
The Design of scull
Major and Minor Numbers
File Operations
The file Structure
open and release
scull's Memory Usage
A Brief Introduction to Race Conditions
read and write
Playing with the New Devices
The Device Filesystem
Backward Compatibility
Quick ReferenceThe Design of scull
The first step of driver writing is defining the capabilities (the mechanism) the driver will offer to user programs. Since our "device" is part of the computer's memory, we're free to do what we want with it. It can be a sequential or random-access device, one device or many, and so on.
- scull0 to scull3
Four devices each consisting of a memory area that is both global and persistent. Global means that if the device is opened multiple times, the data contained within the device is shared by all the file descriptors that opened it. Persistent means that if the device is closed and reopened, data isn't lost. This device can be fun to work with, because it can be accessed and tested using conventional commands such as cp, cat, and shell I/O redirection; we'll examine its internals in this chapter.
- scullpipe0 to scullpipe3
Four FIFO (first-in-first-out) devices, which act like pipes. One process reads what another process writes. If multiple processes read the same device, they contend for data. The internals of scullpipe will show how blocking and nonblocking read and writecan be implemented without having to resort to interrupts. Although real drivers synchronize with their devices using hardware interrupts, the topic of blocking and nonblocking operations is an important one and is separate from interrupt handling (covered in Chapter 9, "Interrupt Handling").
- scullsingle
- scullpriv
- sculluid
- scullwuid
Major and Minor Numbers
Char devices are accessed through names in the filesystem. Those names are called special files or device files or simply nodes of the filesystem tree; they are conventionally located in the /dev directory. Special files for char drivers are identified by a "c" in the first column of the output of ls -l. Block devices appear in /dev as well, but they are identified by a "b." The focus of this chapter is on char devices, but much of the following information applies to block devices as well.
If you issue the ls -l command, you'll see two numbers (separated by a comma) in the device file entries before the date of last modification, where the file length normally appears. These numbers are the major device number and minor device number for the particular device. The following listing shows a few devices as they appear on a typical system. Their major numbers are 1, 4, 7, and 10, while the minors are 1, 3, 5, 64, 65, and 129.
crw-rw-rw- 1 root root 1, 3 Feb 23 1999 null crw------- 1 root root 10, 1 Feb 23 1999 psaux crw------- 1 rubini tty 4, 1 Aug 16 22:22 tty1 crw-rw-rw- 1 root dialout 4, 64 Jun 30 11:19 ttyS0 crw-rw-rw- 1 root dialout 4, 65 Aug 16 00:00 ttyS1 crw------- 1 root sys 7, 1 Feb 23 1999 vcs1 crw------- 1 root sys 7, 129 Feb 23 1999 vcsa1 crw-rw-rw- 1 root root 1, 5 Feb 23 1999 zeroThe minor number is used only by the driver specified by the major number; other parts of the kernel don't use it, and merely pass it along to the driver. It is common for a driver to control several devices (as shown in the listing); the minor number provides a way for the driver to differentiate among them.
Version 2.4 of the kernel, though, introduced a new (optional) feature, the device file system or devfs. If this file system is used, management of device files is simplified and quite different; on the other hand, the new filesystem brings several user-visible incompatibilities, and as we are writing it has not yet been chosen as a default feature by system distributors. The previous description and the following instructions about adding a new driver and special file assume that devfs is not present. The gap is filled later in this chapter, in "The Device Filesystem".
When devfs is not being used, adding a new driver to the system means assigning a major number to it. The assignment should be made at driver (module) initialization by calling the following function, defined in <linux/fs.h>:
int register_chrdev(unsigned int major, const char *name, struct file_operations *fops);The return value indicates success or failure of the operation. A negative return code signals an error; a 0 or positive return code reports successful completion. The major argument is the major number being requested, name is the name of your device, which will appear in /proc/devices, and fops is the pointer to an array of function pointers, used to invoke your driver's entry points, as explained in "File Operations", later in this chapter.
The next question is how to give programs a name by which they can request your driver. A name must be inserted into the /dev directory and associated with your driver's major and minor numbers.
mknod /dev/scull0 c 254 0Dynamic Allocation of Major Numbers
Some major device numbers are statically assigned to the most common devices. A list of those devices can be found in Documentation/devices.txt within the kernel source tree. Because many numbers are already assigned, choosing a unique number for a new driver can be difficult -- there are far more custom drivers than available major numbers. You could use one of the major numbers reserved for "experimental or local use,"[14] but if you experiment with several "local" drivers or you publish your driver for third parties to use, you'll again experience the problem of choosing a suitable number.
The disadvantage of dynamic assignment is that you can't create the device nodes in advance because the major number assigned to your module can't be guaranteed to always be the same. This means that you won't be able to use loading-on-demand of your driver, an advanced feature introduced in Chapter 11, "kmod and Advanced Modularization". For normal use of the driver, this is hardly a problem, because once the number has been assigned, you can read it from /proc/devices.
A typical /proc/devices file looks like the following:
Character devices: 1 mem 2 pty 3 ttyp 4 ttyS 6 lp 7 vcs 10 misc 13 input 14 sound 21 sg 180 usb Block devices: 2 fd 8 sd 11 sr 65 sd 66 sdThe script to load a module that has been assigned a dynamic number can thus be written using a tool such as awk to retrieve information from /proc/devices in order to create the files in /dev.
#!/bin/sh module="scull" device="scull" mode="664" # invoke insmod with all arguments we were passed # and use a pathname, as newer modutils don't look in . by default /sbin/insmod -f ./$module.o $* || exit 1 # remove stale nodes rm -f /dev/${device}[0-3] major=`awk "\\$2==\"$module\" {print \\$1}" /proc/devices` mknod /dev/${device}0 c $major 0 mknod /dev/${device}1 c $major 1 mknod /dev/${device}2 c $major 2 mknod /dev/${device}3 c $major 3 # give appropriate group/permissions, and change the group. # Not all distributions have staff; some have "wheel" instead. group="staff" grep '^staff:' /etc/group > /dev/null || group="wheel" chgrp $group /dev/${device}[0-3] chmod $mode /dev/${device}[0-3]The last few lines of the script may seem obscure: why change the group and mode of a device? The reason is that the script must be run by the superuser, so newly created special files are owned by root. The permission bits default so that only root has write access, while anyone can get read access. Normally, a device node requires a different access policy, so in some way or another access rights must be changed. The default in our script is to give access to a group of users, but your needs may vary. Later, in the section "Access Control on a Device File" in Chapter 5, "Enhanced Char Driver Operations", the code for sculluid will demonstrate how the driver can enforce its own kind of authorization for device access. A scull_unload script is then available to clean up the /dev directory and remove the module.
As an alternative to using a pair of scripts for loading and unloading, you could write an init script, ready to be placed in the directory your distribution uses for these scripts.[15] As part of the scull source, we offer a fairly complete and configurable example of an init script, called scull.init; it accepts the conventional arguments -- either "start" or "stop" or "restart" -- and performs the role of both scull_load and scull_unload.
If repeatedly creating and destroying /dev nodes sounds like overkill, there is a useful workaround. If you are only loading and unloading a single driver, you can just use rmmod and insmodafter the first time you create the special files with your script: dynamic numbers are not randomized, and you can count on the same number to be chosen if you don't mess with other (dynamic) modules. Avoiding lengthy scripts is useful during development. But this trick, clearly, doesn't scale to more than one driver at a time.
Here's the code we use in scull's source to get a major number:
result = register_chrdev(scull_major, "scull", &scull_fops); if (result < 0) { printk(KERN_WARNING "scull: can't get major %d\n",scull_major); return result; } if (scull_major == 0) scull_major = result; /* dynamic */Removing a Driver from the System
When a module is unloaded from the system, the major number must be released. This is accomplished with the following function, which you call from the module's cleanup function:
int unregister_chrdev(unsigned int major, const char *name);[17]The word oops is used as both a noun and a verb by Linux enthusiasts.
In addition to unloading the module, you'll often need to remove the device files for the removed driver. The task can be accomplished by a script that pairs to the one used at load time. The script scull_unload does the job for our sample device; as an alternative, you can invoke scull.init stop.
If dynamic device files are not removed from /dev, there's a possibility of unexpected errors: a spare /dev/framegrabber on a developer's computer might refer to a fire-alarm device one month later if both drivers used a dynamic major number. "No such file or directory" is a friendlier response to opening /dev/framegrabber than the new driver would produce.
dev_t and kdev_t
So far we've talked about the major number. Now it's time to discuss the minor number and how the driver uses it to differentiate among devices.
Every time the kernel calls a device driver, it tells the driver which device is being acted upon. The major and minor numbers are paired in a single data type that the driver uses to identify a particular device. The combined device number (the major and minor numbers concatenated together) resides in the field i_rdev of the inode structure, which we introduce later. Some driver functions receive a pointer to struct inode as the first argument. So if you call the pointer inode (as most driver writers do), the function can extract the device number by looking at inode->i_rdev.
The information about kdev_t is confined in <linux/kdev_t.h>, which is mostly comments. The header makes instructive reading if you're interested in the reasoning behind the code. There's no need to include the header explicitly in the drivers, however, because <linux/fs.h> does it for you.
The following macros and functions are the operations you can perform on kdev_t:
- MAJOR(kdev_t dev);
- MINOR(kdev_t dev);
- MKDEV(int ma, int mi);
- kdev_t_to_nr(kdev_t dev);
- to_kdev_t(int dev);
As long as your code uses these operations to manipulate device numbers, it should continue to work even as the internal data structures change.
File Operations
In the next few sections, we'll look at the various operations a driver can perform on the devices it manages. An open device is identified internally by a file structure, and the kernel uses the file_operations structure to access the driver's functions. The structure, defined in <linux/fs.h>, is an array of function pointers. Each file is associated with its own set of functions (by including a field called f_op that points to a file_operations structure). The operations are mostly in charge of implementing the system calls and are thus named open, read, and so on. We can consider the file to be an "object" and the functions operating on it to be its "methods," using object-oriented programming terminology to denote actions declared by an object to act on itself. This is the first sign of object-oriented programming we see in the Linux kernel, and we'll see more in later chapters.
- loff_t (*llseek) (struct file *, loff_t, int);
The llseek method is used to change the current read/write position in a file, and the new position is returned as a (positive) return value. The loff_t is a "long offset" and is at least 64 bits wide even on 32-bit platforms. Errors are signaled by a negative return value. If the function is not specified for the driver, a seek relative to end-of-file fails, while other seeks succeed by modifying the position counter in the file structure (described in "The file Structure" later in this chapter).
- ssize_t (*read) (struct file *, char *, size_t, loff_t *);
- ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
- int (*readdir) (struct file *, void *, filldir_t);
- unsigned int (*poll) (struct file *, struct poll_table_struct *);
The poll method is the back end of two system calls, poll and select, both used to inquire if a device is readable or writable or in some special state. Either system call can block until a device becomes readable or writable. If a driver doesn't define its pollmethod, the device is assumed to be both readable and writable, and in no special state. The return value is a bit mask describing the status of the device.
- int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
- int (*mmap) (struct file *, struct vm_area_struct *);
- int (*open) (struct inode *, struct file *);
- int (*flush) (struct file *);
The flush operation is invoked when a process closes its copy of a file descriptor for a device; it should execute (and wait for) any outstanding operations on the device. This must not be confused with the fsync operation requested by user programs. Currently, flush is used only in the network file system (NFS) code. If flush is NULL, it is simply not invoked.
- int (*release) (struct inode *, struct file *);
- int (*fsync) (struct inode *, struct dentry *, int);
- int (*fasync) (int, struct file *, int);
- int (*lock) (struct file *, int, struct file_lock *);
The lock method is used to implement file locking; locking is an indispensable feature for regular files, but is almost never implemented by device drivers.
- ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
- ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
These methods, added late in the 2.3 development cycle, implement scatter/gather read and write operations. Applications occasionally need to do a single read or write operation involving multiple memory areas; these system calls allow them to do so without forcing extra copy operations on the data.
- struct module *owner;
The scull device driver implements only the most important device methods, and uses the tagged format to declare its file_operations structure:
struct file_operations scull_fops = { llseek: scull_llseek, read: scull_read, write: scull_write, ioctl: scull_ioctl, open: scull_open, release: scull_release, };owner: THIS_MODULE,That approach works, but only on 2.4 kernels. A more portable approach is to use the SET_MODULE_OWNER macro, which is defined in <linux/module.h>. scullperforms this initialization as follows:
SET_MODULE_OWNER(&scull_fops);This macro works on any structure that has an owner field; we will encounter this field again in other contexts later in the book.
The file Structure
struct file, defined in <linux/fs.h>, is the second most important data structure used in device drivers. Note that a file has nothing to do with the FILEs of user-space programs. A FILE is defined in the C library and never appears in kernel code. A struct file, on the other hand, is a kernel structure that never appears in user programs.
The file structure represents an open file. (It is not specific to device drivers; every open file in the system has an associated struct file in kernel space.) It is created by the kernel on open and is passed to any function that operates on the file, until the last close. After all instances of the file are closed, the kernel releases the data structure. An open file is different from a disk file, represented by struct inode.
- mode_t f_mode;
The file mode identifies the file as either readable or writable (or both), by means of the bits FMODE_READ and FMODE_WRITE. You might want to check this field for read/write permission in your ioctl function, but you don't need to check permissions for read and write because the kernel checks before invoking your method. An attempt to write without permission, for example, is rejected without the driver even knowing about it.
- loff_t f_pos;
The current reading or writing position. loff_t is a 64-bit value (long long in gcc terminology). The driver can read this value if it needs to know the current position in the file, but should never change it (read and write should update a position using the pointer they receive as the last argument instead of acting on filp->f_pos directly).
- unsigned int f_flags;
These are the file flags, such as O_RDONLY, O_NONBLOCK, and O_SYNC. A driver needs to check the flag for nonblocking operation, while the other flags are seldom used. In particular, read/write permission should be checked using f_mode instead of f_flags. All the flags are defined in the header <linux/fcntl.h>.
- struct file_operations *f_op;
The operations associated with the file. The kernel assigns the pointer as part of its implementation of open, and then reads it when it needs to dispatch any operations. The value in filp->f_op is never saved for later reference; this means that you can change the file operations associated with your file whenever you want, and the new methods will be effective immediately after you return to the caller. For example, the code for open associated with major number 1 (/dev/null, /dev/zero, and so on) substitutes the operations in filp->f_op depending on the minor number being opened. This practice allows the implementation of several behaviors under the same major number without introducing overhead at each system call. The ability to replace the file operations is the kernel equivalent of "method overriding" in object-oriented programming.
- void *private_data;
The open system call sets this pointer to NULL before calling the openmethod for the driver. The driver is free to make its own use of the field or to ignore it. The driver can use the field to point to allocated data, but then must free memory in the release method before the file structure is destroyed by the kernel. private_data is a useful resource for preserving state information across system calls and is used by most of our sample modules.
- struct dentry *f_dentry;
The directory entry (dentry) structure associated with the file. Dentries are an optimization introduced in the 2.1 development series. Device driver writers normally need not concern themselves with dentry structures, other than to access the inode structure as filp->f_dentry->d_inode.
open and release
Now that we've taken a quick look at the fields, we'll start using them in real scull functions.
The open Method
The open method is provided for a driver to do any initialization in preparation for later operations. In addition, open usually increments the usage count for the device so that the module won't be unloaded before the file is closed. The count, described in "The Usage Count" in Chapter 2, "Building and Running Modules", is then decremented by the release method.
In most drivers, open should perform the following tasks:
Check for device-specific errors (such as device-not-ready or similar hardware problems)
Initialize the device, if it is being opened for the first time
Identify the minor number and update the f_op pointer, if necessary
Allocate and fill any data structure to be put in filp->private_data
In scull, most of the preceding tasks depend on the minor number of the device being opened. Therefore, the first thing to do is identify which device is involved. We can do that by looking at inode->i_rdev.
We've already talked about how the kernel doesn't use the minor number of the device, so the driver is free to use it at will. In practice, different minor numbers are used to access different devices or to open the same device in a different way. For example, /dev/st0 (minor number 0) and /dev/st1 (minor 1) refer to different SCSI tape drives, whereas /dev/nst0 (minor 128) is the same physical device as /dev/st0, but it acts differently (it doesn't rewind the tape when it is closed). All of the tape device files have different minor numbers, so that the driver can tell them apart.
A driver never actually knows the name of the device being opened, just the device number -- and users can play on this indifference to names by aliasing new names to a single device for their own convenience. If you create two special files with the same major/minor pair, the devices are one and the same, and there is no way to differentiate between them. The same effect can be obtained using a symbolic or hard link, and the preferred way to implement aliasing is creating a symbolic link.
The scull driver uses the minor number like this: the most significant nibble (upper four bits) identifies the type (personality) of the device, and the least significant nibble (lower four bits) lets you distinguish between individual devices if the type supports more than one device instance. Thus, scull0 is different from scullpipe0 in the top nibble, while scull0 and scull1 differ in the bottom nibble.[19] Two macros (TYPE and NUM) are defined in the source to extract the bits from a device number, as shown here:
#define TYPE(dev) (MINOR(dev) >> 4) /* high nibble */ #define NUM(dev) (MINOR(dev) & 0xf) /* low nibble */For each device type, scull defines a specific file_operations structure, which is placed in filp->f_op at open time. The following code shows how multiple fops are implemented:
struct file_operations *scull_fop_array[]={ &scull_fops, /* type 0 */ &scull_priv_fops, /* type 1 */ &scull_pipe_fops, /* type 2 */ &scull_sngl_fops, /* type 3 */ &scull_user_fops, /* type 4 */ &scull_wusr_fops /* type 5 */ }; #define SCULL_MAX_TYPE 5 /* In scull_open, the fop_array is used according to TYPE(dev) */ int type = TYPE(inode->i_rdev); if (type > SCULL_MAX_TYPE) return -ENODEV; filp->f_op = scull_fop_array[type];int scull_open(struct inode *inode, struct file *filp) { Scull_Dev *dev; /* device information */ int num = NUM(inode->i_rdev); int type = TYPE(inode->i_rdev); /* * If private data is not valid, we are not using devfs * so use the type (from minor nr.) to select a new f_op */ if (!filp->private_data && type) { if (type > SCULL_MAX_TYPE) return -ENODEV; filp->f_op = scull_fop_array[type]; return filp->f_op->open(inode, filp); /* dispatch to specific open */ } /* type 0, check the device number (unless private_data valid) */ dev = (Scull_Dev *)filp->private_data; if (!dev) { if (num >= scull_nr_devs) return -ENODEV; dev = &scull_devices[num]; filp->private_data = dev; /* for other methods */ } MOD_INC_USE_COUNT; /* Before we maybe sleep */ /* now trim to 0 the length of the device if open was write-only */ if ( (filp->f_flags & O_ACCMODE) == O_WRONLY) { if (down_interruptible(&dev->sem)) { MOD_DEC_USE_COUNT; return -ERESTARTSYS; } scull_trim(dev); /* ignore errors */ up(&dev->sem); } return 0; /* success */ }The calls to down_interruptible and up can be ignored for now; we will get to them shortly.
The only real operation performed on the device is truncating it to a length of zero when the device is opened for writing. This is performed because, by design, overwriting a pscull device with a shorter file results in a shorter device data area. This is similar to the way opening a regular file for writing truncates it to zero length. The operation does nothing if the device is opened for reading.
The release Method
The role of the release method is the reverse of open. Sometimes you'll find that the method implementation is called device_close instead of device_release. Either way, the device method should perform the following tasks:
The basic form of scull has no hardware to shut down, so the code required is minimal:[20]
int scull_release(struct inode *inode, struct file *filp) { MOD_DEC_USE_COUNT; return 0; }scull's Memory Usage
Before introducing the read and write operations, we'd better look at how and why scull performs memory allocation. "How" is needed to thoroughly understand the code, and "why" demonstrates the kind of choices a driver writer needs to make, although scull is definitely not typical as a device.
The implementation chosen for scull is not a smart one. The source code for a smart implementation would be more difficult to read, and the aim of this section is to show read and write, not memory management. That's why the code just uses kmallocand kfree without resorting to allocation of whole pages, although that would be more efficient.
![]()
Figure 3-1. The layout of a scull device
The data structure used to hold device information is as follows:
typedef struct Scull_Dev { void **data; struct Scull_Dev *next; /* next list item */ int quantum; /* the current quantum size */ int qset; /* the current array size */ unsigned long size; devfs_handle_t handle; /* only used if devfs is there */ unsigned int access_key; /* used by sculluid and scullpriv */ struct semaphore sem; /* mutual exclusion semaphore */ } Scull_Dev;The next code fragment shows in practice how Scull_Dev is used to hold data. The function scull_trim is in charge of freeing the whole data area and is invoked by scull_open when the file is opened for writing. It simply walks through the list and frees any quantum and quantum set it finds.
int scull_trim(Scull_Dev *dev) { Scull_Dev *next, *dptr; int qset = dev->qset; /* "dev" is not null */ int i; for (dptr = dev; dptr; dptr = next) { /* all the list items */ if (dptr->data) { for (i = 0; i < qset; i++) if (dptr->data[i]) kfree(dptr->data[i]); kfree(dptr->data); dptr->data=NULL; } next=dptr->next; if (dptr != dev) kfree(dptr); /* all of them but the first */ } dev->size = 0; dev->quantum = scull_quantum; dev->qset = scull_qset; dev->next = NULL; return 0; }A Brief Introduction to Race Conditions
Now that you understand how scull's memory management works, here is a scenario to consider. Two processes, A and B, both have the same scull device open for writing. Both attempt simultaneously to append data to the device. A new quantum is required for this operation to succeed, so each process allocates the required memory and stores a pointer to it in the quantum set.
A semaphore is a general mechanism for controlling access to resources. In its simplest form, a semaphore may be used for mutual exclusion; processes using semaphores in the mutual exclusion mode are prevented from simultaneously running the same code or accessing the same data. This sort of semaphore is often called a mutex, from "mutual exclusion."
Semaphores in Linux are defined in <asm/semaphore.h>. They have a type of struct semaphore, and a driver should only act on them using the provided interface. In scull, one semaphore is allocated for each device, in the Scull_Dev structure. Since the devices are entirely independent of each other, there is no need to enforce mutual exclusion across multiple devices.
Semaphores must be initialized prior to use by passing a numeric argument to sema_init. For mutual exclusion applications (i.e., keeping multiple threads from accessing the same data simultaneously), the semaphore should be initialized to a value of 1, which means that the semaphore is available. The following code in scull's module initialization function (scull_init) shows how the semaphores are initialized as part of setting up the devices.
for (i=0; i < scull_nr_devs; i++) { scull_devices[i].quantum = scull_quantum; scull_devices[i].qset = scull_qset; sema_init(&scull_devices[i].sem, 1); }A process wishing to enter a section of code protected by a semaphore must first ensure that no other process is already there. Whereas in classical computer science the function to obtain a semaphore is often called P, in Linux you'll need to call down or down_interruptible. These functions test the value of the semaphore to see if it is greater than 0; if so, they decrement the semaphore and return. If the semaphore is 0, the functions will sleep and try again after some other process, which has presumably freed the semaphore, wakes them up.
The down_interruptible function can be interrupted by a signal, whereas down will not allow signals to be delivered to the process. You almost always want to allow signals; otherwise, you risk creating unkillable processes and other undesirable behavior. A complication of allowing signals, however, is that you always have to check if the function (here down_interruptible) was interrupted. As usual, the function returns 0 for success and nonzero in case of failure. If the process is interrupted, it will not have acquired the semaphores; thus, you won't need to call up. A typical call to invoke a semaphore therefore normally looks something like this:
if (down_interruptible (&sem)) return -ERESTARTSYS;A process that obtains a semaphore must always release it afterward. Whereas computer science calls the release function V, Linux uses up instead. A simple call like
up (&sem);read and write
ssize_t read(struct file *filp, char *buff, size_t count, loff_t *offp); ssize_t write(struct file *filp, const char *buff, size_t count, loff_t *offp);Cross-space copies are performed in Linux by special functions, defined in <asm/uaccess.h>. Such a copy is either performed by a generic (memcpy-like) function or by functions optimized for a specific data size (char, short, int, long); most of them are introduced in "Using the ioctl Argument" in Chapter 5, "Enhanced Char Driver Operations".
The code for read and writein scull needs to copy a whole segment of data to or from the user address space. This capability is offered by the following kernel functions, which copy an arbitrary array of bytes and sit at the heart of every read and write implementation:
unsigned long copy_to_user(void *to, const void *from, unsigned long count); unsigned long copy_from_user(void *to, const void *from, unsigned long count);The topic of user-space access and invalid user space pointers is somewhat advanced, and is discussed in "Using the ioctl Argument"" in Chapter 5, "Enhanced Char Driver Operations". However, it's worth suggesting that if you don't need to check the user-space pointer you can invoke __copy_to_user and __copy_from_user instead. This is useful, for example, if you know you already checked the argument.
Whatever the amount of data the methods transfer, they should in general update the file position at *offp to represent the current file position after successful completion of the system call. Most of the time the offp argument is just a pointer to filp->f_pos, but a different pointer is used in order to support the pread and pwrite system calls, which perform the equivalent of lseek and read or write in a single, atomic operation.
Figure 3-2 represents how a typical read implementation uses its arguments.
![]()
Figure 3-2. The arguments to read
Although kernel functions return a negative number to signal an error, and the value of the number indicates the kind of error that occurred (as introduced in Chapter 2, "Building and Running Modules" in "Error Handling in init_module"), programs that run in user space always see -1 as the error return value. They need to access the errno variable to find out what happened. The difference in behavior is dictated by the POSIX calling standard for system calls and the advantage of not dealing with errno in the kernel.
The read Method
The return value for read is interpreted by the calling application program as follows:
ssize_t scull_read(struct file *filp, char *buf, size_t count, loff_t *f_pos) { Scull_Dev *dev = filp->private_data; /* the first list item */ Scull_Dev *dptr; int quantum = dev->quantum; int qset = dev->qset; int itemsize = quantum * qset; /* how many bytes in the list item */ int item, s_pos, q_pos, rest; ssize_t ret = 0; if (down_interruptible(&dev->sem)) return -ERESTARTSYS; if (*f_pos >= dev->size) goto out; if (*f_pos + count > dev->size) count = dev->size - *f_pos; /* find list item, qset index, and offset in the quantum */ item = (long)*f_pos / itemsize; rest = (long)*f_pos % itemsize; s_pos = rest / quantum; q_pos = rest % quantum; /* follow the list up to the right position (defined elsewhere) */ dptr = scull_follow(dev, item); if (!dptr->data) goto out; /* don't fill holes */ if (!dptr->data[s_pos]) goto out; /* read only up to the end of this quantum */ if (count > quantum - q_pos) count = quantum - q_pos; if (copy_to_user(buf, dptr->data[s_pos]+q_pos, count)) { ret = -EFAULT; goto out; } *f_pos += count; ret = count; out: up(&dev->sem); return ret; }The write Method
write, like read, can transfer less data than was requested, according to the following rules for the return value:
The scull code for write deals with a single quantum at a time, like the read method does:
ssize_t scull_write(struct file *filp, const char *buf, size_t count, loff_t *f_pos) { Scull_Dev *dev = filp->private_data; Scull_Dev *dptr; int quantum = dev->quantum; int qset = dev->qset; int itemsize = quantum * qset; int item, s_pos, q_pos, rest; ssize_t ret = -ENOMEM; /* value used in "goto out" statements */ if (down_interruptible(&dev->sem)) return -ERESTARTSYS; /* find list item, qset index and offset in the quantum */ item = (long)*f_pos / itemsize; rest = (long)*f_pos % itemsize; s_pos = rest / quantum; q_pos = rest % quantum; /* follow the list up to the right position */ dptr = scull_follow(dev, item); if (!dptr->data) { dptr->data = kmalloc(qset * sizeof(char *), GFP_KERNEL); if (!dptr->data) goto out; memset(dptr->data, 0, qset * sizeof(char *)); } if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL); if (!dptr->data[s_pos]) goto out; } /* write only up to the end of this quantum */ if (count > quantum - q_pos) count = quantum - q_pos; if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) { ret = -EFAULT; goto out; } *f_pos += count; ret = count; /* update the size */ if (dev->size < *f_pos) dev-> size = *f_pos; out: up(&dev->sem); return ret; }readv and writev
Unix systems have long supported two alternative system calls named readv and writev. These "vector" versions take an array of structures, each of which contains a pointer to a buffer and a length value. A readv call would then be expected to read the indicated amount into each buffer in turn. writev, instead, would gather together the contents of each buffer and put them out as a single write operation.
The prototypes for the vector operations are as follows:
ssize_t (*readv) (struct file *filp, const struct iovec *iov, unsigned long count, loff_t *ppos); ssize_t (*writev) (struct file *filp, const struct iovec *iov, unsigned long count, loff_t *ppos);Here, the filp and ppos arguments are the same as for read and write. The iovec structure, defined in <linux/uio.h>, looks like this:
struct iovec { void *iov_base; _ _kernel_size_t iov_len; };Many drivers, though, will gain no benefit from implementing these methods themselves. Thus, scull omits them. The kernel will emulate them with read and write, and the end result is the same.
Playing with the New Devices
Once you are equipped with the four methods just described, the driver can be compiled and tested; it retains any data you write to it until you overwrite it with new data. The device acts like a data buffer whose length is limited only by the amount of real RAM available. You can try using cp, dd, and input/output redirection to test the driver.
The free command can be used to see how the amount of free memory shrinks and expands according to how much data is written into scull.
The Device Filesystem
As suggested at the beginning of the chapter, recent versions of the Linux kernel offer a special filesystem for device entry points. The filesystem has been available for a while as an unofficial patch; it was made part of the official source tree in 2.3.46. A backport to 2.2 is available as well, although not included in the official 2.2 kernels.
Although use of the special filesystem is not widespread as we write this, the new features offer a few advantages to the device driver writer. Therefore, our version of scullexploits devfs if it is being used in the target system. The module uses kernel configuration information at compile time to know whether particular features have been enabled, and in this case we depend on CONFIG_DEVFS_FS being defined or not.
The main advantages of devfs are as follows:
Device entry points in /dev are created at device initialization and removed at device removal.
There is no need to allocate a major number for the device driver and deal with minor numbers.
To handle device creation and removal, the driver should call the following functions:
#include <linux/devfs_fs_kernel.h> devfs_handle_t devfs_mk_dir (devfs_handle_t dir, const char *name, void *info); devfs_handle_t devfs_register (devfs_handle_t dir, const char *name, unsigned int flags, unsigned int major, unsigned int minor, umode_t mode, void *ops, void *info); void devfs_unregister (devfs_handle_t de);The various arguments to the register/unregister functions are as follows:
- dir
- name
- flags
- major
- minor
- mode
- ops
- info
- de
A "devfs entry" obtained by a previous call to devfs_register.
The flags are used to select specific features to be enabled for the special file being created. Although the flags are briefly and clearly documented in <linux/devfs_fs_kernel.h>, it's worth introducing some of them.
- DEVFS_FL_NONE
- DEVFS_FL_DEFAULT
The former symbol is simply 0, and is suggested for code readability. The latter macro is currently defined to DEVFS_FL_NONE, but is a good choice to be forward compatible with future implementations of the filesystem.
- DEVFS_FL_AUTO_OWNER
- DEVFS_FL_SHOW_UNREG
- DEVFS_FL_HIDE
The former flag requests not to remove the device file from /dev when it is unregistered. The latter requests never to show it in /dev. The flags are not usually needed for normal devices.
- DEVFS_FL_AUTO_DEVNUM
- DEVFS_FL_NO_PERSISTENCE
It is possible to query the flags associated with a device or to change them at runtime. The following two functions perform the tasks:
int devfs_get_flags (devfs_handle_t de, unsigned int *flags); int devfs_set_flags (devfs_handle_t de, unsigned int flags);Using devfs in Practice
/* If we have devfs, create /dev/scull to put files in there */ scull_devfs_dir = devfs_mk_dir(NULL, "scull", NULL); if (!scull_devfs_dir) return -EBUSY; /* problem */ for (i=0; i < scull_nr_devs; i++) { sprintf(devname, "%i", i); devfs_register(scull_devfs_dir, devname, DEVFS_FL_AUTO_DEVNUM, 0, 0, S_IFCHR | S_IRUGO | S_IWUGO, &scull_fops, scull_devices+i); }if (scull_devices) { for (i=0; i<scull_nr_devs; i++) { scull_trim(scull_devices+i); /* the following line is only used for devfs */ devfs_unregister(scull_devices[i].handle); } kfree(scull_devices); } /* once again, only for devfs */ devfs_unregister(scull_devfs_dir);The only extra task that needs to be performed in order to support both environments is dealing with initialization of filp->f_ops and filp->private_data in the open device method. The former pointer is simply not modified, since the right file operations have been specified in devfs_register. The latter will only need to be initialized by the open method if it is NULL, since it will only be NULL if devfs is not being used.
/* * If private data is not valid, we are not using devfs * so use the type (from minor nr.) to select a new f_op */ if (!filp->private_data && type) { if (type > SCULL_MAX_TYPE) return -ENODEV; filp->f_op = scull_fop_array[type]; return filp->f_op->open(inode, filp); /* dispatch to specific open */ } /* type 0, check the device number (unless private_data valid) */ dev = (Scull_Dev *)filp->private_data; if (!dev) { if (num >= scull_nr_devs) return -ENODEV; dev = &scull_devices[num]; filp->private_data = dev; /* for other methods */ }crw-rw-rw- 1 root root 144, 1 Jan 1 1970 0 crw-rw-rw- 1 root root 144, 2 Jan 1 1970 1 crw-rw-rw- 1 root root 144, 3 Jan 1 1970 2 crw-rw-rw- 1 root root 144, 4 Jan 1 1970 3 crw-rw-rw- 1 root root 144, 5 Jan 1 1970 pipe0 crw-rw-rw- 1 root root 144, 6 Jan 1 1970 pipe1 crw-rw-rw- 1 root root 144, 7 Jan 1 1970 pipe2 crw-rw-rw- 1 root root 144, 8 Jan 1 1970 pipe3 crw-rw-rw- 1 root root 144, 12 Jan 1 1970 priv crw-rw-rw- 1 root root 144, 9 Jan 1 1970 single crw-rw-rw- 1 root root 144, 10 Jan 1 1970 user crw-rw-rw- 1 root root 144, 11 Jan 1 1970 wuserPortability Issues and devfs
The source files of scull are somewhat complicated by the need to be able to compile and run well with Linux versions 2.0, 2.2, and 2.4. This portability requirement brings in several instances of conditional compilation based on CONFIG_DEVFS_FS.
#include <devfs_fs_kernel.h> int init_module() { /* request a major: does nothing if devfs is used */ result = devfs_register_chrdev(major, "name", &fops); if (result < 0) return result; /* register using devfs: does nothing if not in use */ devfs_register(NULL, "name", /* .... */ ); return 0; }#ifdef CONFIG_DEVFS_FS /* only if enabled, to avoid errors in 2.0 */ #include <linux/devfs_fs_kernel.h> #else typedef void * devfs_handle_t; /* avoid #ifdef inside the structure */ #endifNothing is defined in sysdep.h because it is very hard to implement this kind of hack generically enough to be of general use. Each driver should arrange for its own needs to avoid excessive #ifdef statements in function code. Also, we chose not to support devfs in the sample code for this book, with the exception of scull. We hope this discussion is enough to help readers exploit devfs if they want to; devfs support has been omitted from the rest of the sample files in order to keep the code simple.
Backward Compatibility
This chapter, so far, has described the kernel programming interface for version 2.4 of the Linux kernel. Unfortunately, this interface has changed significantly over the course of kernel development. These changes represent improvements in how things are done, but, once again, they also pose a challenge for those who wish to write drivers that are compatible across multiple versions of the kernel.
Insofar as this chapter is concerned, there are few noticeable differences between versions 2.4 and 2.2. Version 2.2, however, changed many of the prototypes of the file_operations methods from what 2.0 had; access to user space was greatly modified (and simplified) as well. The semaphore mechanism was not as well developed in Linux 2.0. And, finally, the 2.1 development series introduced the directory entry (dentry) cache.
Changes in the File Operations Structure
A number of factors drove the changes in the file_operations methods. The longstanding 2 GB file-size limit caused problems even in the Linux 2.0 days. As a result, the 2.1 development series started using the loff_t type, a 64-bit value, to represent file positions and lengths. Large file support was not completely integrated until version 2.4 of the kernel, but much of the groundwork was done earlier and had to be accommodated by driver writers.
Another change introduced during 2.1 development was the addition of the f_pos pointer argument to the read and write methods. This change was made to support the POSIX pread and pwrite system calls, which explicitly set the file offset where data is to be read or written. Without these system calls, threaded programs can run into race conditions when moving around in files.
- int (*lseek) (struct inode *, struct file *, off_t, int);
Note that this method is called lseek in Linux 2.0, instead of llseek. The name change was made to recognize that seeks could now happen with 64-bit offset values.
- int (*read) (struct inode *, struct file *, char *, int);
- int (*write) (struct inode *, struct file *, const char *, int);
As mentioned, these functions in Linux 2.0 had the inode pointer as an argument, and lacked the position argument.
- void (*release) (struct inode *, struct file *);
In the 2.0 kernel, the release method could not fail, and thus returned void.
/* * The following wrappers are meant to make things work with 2.0 kernels */ #ifdef LINUX_20 int scull_lseek_20(struct inode *ino, struct file *f, off_t offset, int whence) { return (int)scull_llseek(f, offset, whence); } int scull_read_20(struct inode *ino, struct file *f, char *buf, int count) { return (int)scull_read(f, buf, count, &f->f_pos); } int scull_write_20(struct inode *ino, struct file *f, const char *b, int c) { return (int)scull_write(f, b, c, &f->f_pos); } void scull_release_20(struct inode *ino, struct file *f) { scull_release(ino, f); } /* Redefine "real" names to the 2.0 ones */ #define scull_llseek scull_lseek_20 #define scull_read scull_read_20 #define scull_write scull_write_20 #define scull_release scull_release_20 #define llseek lseek #endif /* LINUX_20 */Two other incompatibilities are related to the file_operations structure. One is that the flush method was added during the 2.1 development cycle. Driver writers almost never need to worry about this method, but its presence in the middle of the structure can still create problems. The best way to avoid dealing with the flush method is to use the tagged initialization syntax, as we did in all the sample source files.
The other difference is in the way an inode pointer is retrieved from a filp pointer. Whereas modern kernels use a dentry (directory entry) data structure, version 2.0 had no such structure. Therefore, sysdep.h defines a macro that should be used to portably access an inode from a filp:
#ifdef LINUX_20 # define INODE_FROM_F(filp) ((filp)->f_inode) #else # define INODE_FROM_F(filp) ((filp)->f_dentry->d_inode) #endifThe Module Usage Count
In 2.2 and earlier kernels, the Linux kernel did not offer any assistance to modules in maintaining the usage count. Modules had to do that work themselves. This approach was error prone and required the duplication of a lot of work. It also encouraged race conditions. The new method is thus a definite improvement.
Code that is written to be portable, however, must be prepared to deal with the older way of doing things. That means that the usage count must still be incremented when a new reference is made to the module, and decremented when that reference goes away. Portable code must also work around the fact that the owner field did not exist in the file_operations structure in earlier kernels. The easiest way to handle that is to use SET_MODULE_OWNER, rather than working with the owner field directly. In sysdep.h, we provide a null SET_FILE_OWNER for kernels that do not have this facility.
Changes in Semaphore Support
Semaphore support was less developed in the 2.0 kernel; support for SMP systems in general was primitive at that time. Drivers written for only that kernel version may not need to use semaphores at all, since only one CPU was allowed to be running kernel code at that time. Nonetheless, there may still be a need for semaphores, and it does not hurt to have the full protection needed by later kernel versions.
Most of the semaphore functions covered in this chapter existed in the 2.0 kernel. The one exception is sema_init; in version 2.0, programmers had to initialize semaphores manually. The sysdep.h header file handles this problem by defining a version of sema_init when compiled under the 2.0 kernel:
#ifdef LINUX_20 # ifdef MUTEX_LOCKED /* Only if semaphore.h included */ extern inline void sema_init (struct semaphore *sem, int val) { sem->count = val; sem->waking = sem->lock = 0; sem->wait = NULL; } # endif #endif /* LINUX_20 */Changes in Access to User Space
Finally, access to user space changed completely at the beginning of the 2.1 development series. The new interface has a better design and makes much better use of the hardware in ensuring safe access to user-space memory. But, of course, the interface is different. The 2.0 memory-access functions were as follows:
void memcpy_fromfs(void *to, const void *from, unsigned long count); void memcpy_tofs(void *to, const void *from, unsigned long count);The names of these functions come from the historical use of the FS segment register on the i386. Note that there is no return value from these functions; if the user supplies an invalid address, the data copy will silently fail. sysdep.h hides the renaming and allows you to portably call copy_to_user and copy_from_user.
Quick Reference
This chapter introduced the following symbols and header files. The list of the fields in struct file_operations and struct file is not repeated here.
- #include <linux/fs.h>
The "file system" header is the header required for writing device drivers. All the important functions are declared in here.
- int register_chrdev(unsigned int major, const char *name, struct file_operations *fops);
- int unregister_chrdev(unsigned int major, const char *name);
- kdev_t inode->i_rdev;
The device "number" for the current device is accessible from the inode structure.
- int MAJOR(kdev_t dev);
- int MINOR(kdev_t dev);
These macros extract the major and minor numbers from a device item.
- kdev_t MKDEV(int major, int minor);
This macro builds a kdev_t data item from the major and minor numbers.
- SET_MODULE_OWNER(struct file_operations *fops)
This macro sets the owner field in the given file_operations structure.
- #include <asm/semaphore.h>
- void sema_init (struct semaphore *sem, int val);
- int down_interruptible (struct semaphore *sem);
- void up (struct semaphore *sem);
Obtains a semaphore (sleeping, if necessary) and releases it, respectively.
- #include <asm/segment.h>
- #include <asm/uaccess.h>
segment.h defines functions related to cross-space copying in all kernels up to and including 2.0. The name was changed to uaccess.h in the 2.1 development series.
- unsigned long __copy_from_user (void *to, const void *from, unsigned long count);
- unsigned long __copy_to_user (void *to, const void *from, unsigned long count);
- void memcpy_fromfs(void *to, const void *from, unsigned long count);
- void memcpy_tofs(void *to, const void *from, unsigned long count);
These functions were used to copy an array of bytes from user space to kernel space and vice versa in version 2.0 of the kernel.
- #include <linux/devfs_fs_kernel.h>
- devfs_handle_t devfs_mk_dir (devfs_handle_t dir, const char *name, void *info);
- devfs_handle_t devfs_register (devfs_handle_t dir, const char *name, unsigned int flags,
- unsigned int major, unsigned int minor, umode_t mode, void *ops, void *info);
- void devfs_unregister (devfs_handle_t de);
These are the basic functions for registering devices with the device filesystem (devfs).
![]() |
![]() |
![]() |
Back to: Linux Device Drivers, 2nd Edition
© 2001, O'Reilly & Associates, Inc.