Библиотека сайта rus-linux.net
Maximum RPM: Taking the Red Hat Package Manager to the Limit | ||
---|---|---|
Prev | Appendix A. Format of the RPM File | Next |
RPM File Format
While the following details concerning the actual format of an RPM package file were accurate at the time this was written, three points should be kept in mind:
The file format is subject to change.
If a package file is to be manipulated somehow, you are strongly urged to use the appropriate rpmlib routines to access the package file. Why? See point number 1!
This appendix describes the most recent version of the RPM file format: version 3. The file(1) utility can be used to see a package's file format version.
With those caveats out of the way, let's take a look inside an RPM file…
Parts of an RPM File
Every RPM package file can be divided into four distinct sections. They are:
The lead.
The signature.
The header.
The archive.
Package files are written to disk in network byte order. If required, RPM will automatically convert to host byte order when the package file is read. Let's take a look at each section, starting with the lead.
The Lead
The lead is the first part of an RPM package file. In previous versions of RPM, it was used to store information used internally by RPM. Today, however, the lead's sole purpose is to make it easy to identify an RPM package file. For example, the file(1) command uses the lead. [1] All the information contained in the lead has been duplicated or superseded by information contained in the header. [2]
|
rpm-2.2.1-1.i386.rpm
:
|
The first four bytes (edab eedb
) are the magic values
that identify the file as an RPM package file. Both the
file command and RPM use these magic numbers to
determine whether a file is legitimate or not.
The next two bytes (0300
) indicate RPM file format
version. In this case, the file's major version number is 3, and the
minor version number is 0. Versions of RPM later than 2.1 create
version 3.0 package files.
The next two bytes (0000
) determine what type of RPM
file the file is. There are presently two types defined:
Binary package file (type =
0000
)Source package file (type =
0001
)
In this case, the file is a binary package file.
The next two bytes (0001
) are used to store the
architecture that the package was built for. In this case, the number 1
refers to the i386 architecture.
[3]
In the case of a source package file, these two bytes should be ignored,
as source packages are not built for a specific architecture.
7270 6d2d
)
contain the name of the package. The name must end with a null byte,
which leaves sixty-five bytes for RPM's usual
name-version-release-style name. In this case, we can
read the name from the right side of the output:
|
Since the name rpm-2.2.1-1
is shorter than the
sixty-five bytes allocated for the name, the leftover bytes are filled
with nulls.
0001
):
|
These bytes represent the operating system for which this package was
built. In this case, 1 equals Linux. As with the
architecture-to-number translations, the operating system and
corresponding code numbers can be found in the file,
/usr/lib/rpmrc
.
The next two bytes (0005
) indicate the type of
signature used in the file. A type 5 signature is new to version 3 RPM
files. The signature appears next in the file, but we need to discuss
an additional detail before exploring the signature.
Wanted: A New RPM Data Structure
By looking at the C structure that defines the lead, and matching it with the bytes in an actual package file, it's trivial to extract the data from the lead. From a programming standpoint, it's also easy to manipulate data in the lead — It's simply a matter of using the element names from the structure. But there's a problem. And because of that problem the lead is no longer used internally by RPM.
The lead: An Abandoned Data Structure
What's the problem, and why is the lead no longer used by RPM? The answer to these questions is a single word: inflexibility. The technique of defining a C structure to access data in a file just isn't very flexible. Let's look at an example.
Flip back to the lead's C structure in the section called The Lead. Say, for example, that
some software comes along, and it has a long name. A
very long name. A name so long, in fact, that
the 66 bytes defined in the structure element
name
just couldn't hold it.
What can we do? Well, we could certainly change the structure such
that the name
element would be 100 bytes
long. But once a new version of RPM is created using this new
structure, we have two problems:
Any package file created with the new version of RPM wouldn't be able to read older package formats.
Any older version of RPM would be unable to install packages created with the newer version of RPM.
Not a very good situation! Ideally, we would like to somehow eliminate the requirement that the format of the data written to a package file be engraved in granite. We should be able to do the following things, all without losing compatibility with existing versions of RPM.
Add extra data to the file format.
Change the size of existing data.
Reorder the data.
Sounds like a big problem, but there's a solution…
Is There a Solution?
The solution is to standardize the method by which information is retrieved from a file. This is done by creating a well-defined data structure that contains easily searched information about the data, and then physically separating that information from the data.
When the data is required, it is found by using the easily searched information, which points to the data itself. The benefits are, that the data can be placed anywhere in the file, and that the format of the data itself can change.
The Solution: The Header Structure
The header structure is RPM's solution to the problem of easily manipulating information in a standard way. The header structure's sole purpose in life is to contain zero or more pieces of data. A file can have more than one header structure in it. In fact, an RPM package file has two — the signature, and the header. It was from this header that the header structure got its name.
There are three sections to each header structure. The first section is known as the header structure header. The header structure header is used to identify the start of a header structure, its size, and the number of data items it contains.
Following the header structure header is an area called the index. The index contains one or more index entries. Each index entry contains information about, and a pointer to, a specific data item.
After the index comes the store. It is in the store that the data items are kept. The data in the store is packed together as closely as possible. The order in which the data is stored is immaterial — a far cry from the C structure used in the lead.
The Header Structure in Depth
Let's take a more in-depth look at the actual format of a header structure, starting with the header structure header:
The Header Structure Header
The header structure header always starts with a three-byte magic
number: 8e ad e8
. Following this is a one-byte
version number. Next are four bytes that are reserved for future
expansion. After the reserved bytes, there is a four-byte number
that indicates how many index entries exist in this header
structure, followed by another four-byte number indicating how many
bytes of data are part of the header structure.
The Index Entry
The header structure's index is made up of zero or more index entries. Each entry is sixteen bytes longs. The first four bytes contain a tag — a numeric value that identifies what type of data is pointed to by the entry. The tag values change according to the header structure's position in the RPM file. A list of the actual tag values, and what they represent, will be included later in this appendix.
Following the tag, is a four-byte type, which is a numeric value that describes the format of the data pointed to by the entry. The types and their values do not change from header structure to header structure. Here is the current list:
NULL = 0
CHAR = 1
INT8 = 2
INT16 = 3
INT32 = 4
INT64 = 5
STRING = 6
BIN = 7
STRING_ARRAY = 8
A few of the data types might need some clarification. The STRING data type is simply a null-terminated string, while the STRING_ARRAY is a collection of strings. Finally, the BIN data type is a collection of binary data. This is normally used to identify data that is longer than an INT, but not a printable STRING.
Next is a four-byte offset that contains the position of the data, relative to the beginning of the store. We'll talk about the store in just a moment.
Finally, there is a four-byte count that contains the number of data items pointed to by the index entry. There are a few wrinkles to the meaning of the count, and they center around the STRING and STRING_ARRAY data types. STRING data always has a count of 1, while STRING_ARRAY data has a count equal to the number of strings contained in the store.
The Store
The store is where the data contained in the header structure is stored. Depending on the data type being stored, there are some details that should be kept in mind:
For STRING data, each string is terminated with a null byte.
For INT data, each integer is stored at the natural boundary for its type. A 64-bit INT is stored on an 8-byte boundary, a 16-bit INT is stored on a 2-byte boundary, and so on.
All data is in network byte order.
With all these details out of the way, let's take a look at the signature.
The Signature
The signature section follows the lead in the RPM package file. It contains information that can be used to verify the integrity, and optionally, the authenticity of the majority of the package file. The signature is implemented as a header structure.
You probably noticed the word, "majority", above. The information in the signature header structure is based on the contents of the package file's header and archive only. The data in the lead and the signature header structure are not included when the signature information is created, nor are they part of any subsequent checks based on that information.
While that omission might seem to be a weakness in RPM's design, it really isn't. In the case of the lead, since it is used only for easy identification of package files, any changes made to that part of the file would, at worst, leave the file in such a state that RPM wouldn't recognize it as a valid package file. Likewise, any changes to the signature header structure would make it impossible to verify the file's integrity, since the signature information would have been changed from their original values.
Analyzing the Signature Area
rpm-2.2.1-1.i386.rpm
:
|
The first three bytes (8ead e8
) contain the magic
number for the start of the header structure. The next byte
(01
) is the header structure's version.
As we discussed earlier, the next four bytes (0000
0000
) are reserved. The four bytes after that
(0000 0003
) represent the number of index entries
in the signature section, namely, three. Following that are four
bytes (0000 00ac
) that indicate how many bytes of
data are stored in the signature. The hex value
00ac
, when converted to decimal, means the store is
172 bytes long.
Following the first 16 bytes is the index. Each of the three index entries in this header structure consists of four 32-bit integers, in the following order:
Tag
Type
Offset
Count
|
0000
03e8
), which is 1000 when translated from hex. Looking in
the RPM source directory at the file
lib/signature.h
, we find the following tag
definitions:
|
So the tag we are studying is for a size signature. Let's continue.
The next four bytes (0000 0004
) contain the data
type. As we saw earlier, data type 4 means that the data stored for
this index entry, is a 32-bit integer. Skipping the next four bytes
for a moment, the last four bytes (0000 0001
) are
the number of 32-bit integers pointed to by this index entry.
Now, let's go back to the four bytes prior to the count (0000
0000
). This number is the offset, in bytes, at which the
size signature is located. It has a value of zero, but the question
is, zero bytes from what? The answer, although it doesn't do us much
good, is that the offset is calculated from the start of the store.
So first we must find where the store begins, and we can do that by
performing a simple calculation.
|
0000 0003
). Since we
know that each index entry is sixteen bytes long (four for the tag,
four for the type, four for the offset, and four for the count), we
can multiply the number of entries (3) by the number of bytes in each
entry (16), and obtain the total size of the index, which is 48
decimal, or 30 in hex. Since the first index entry starts at hex
offset 70, we can simply add hex 30 to hex 70, and get, in hex, offset
a0. So let's skip down to offset a0, and see what's there:
|
0004
4c4f
) should represent the size of this file. Converting to
decimal, this is 281,679. Let's take a look at the size of the actual
file:
|
Hmmm, something's not right. Or is it? It looks like we're short by 336 bytes, or in hex, 150. Interesting how that's a nice round hex number, isn't it? For now, let's continue through the remainder of the index entries, and see if hex 150 pops up elsewhere.
|
b025
(Remember
that offset of four!) and ends on the second line with
5375
. This is a 128-bit MD5 checksum of the
package file's header and archive sections.
|
|
03ea
(1002 in decimal —
a PGP signature block) and is also a BIN data type. The data starts
20 decimal bytes from the start of the data area, which would put it
at file offset b4 (in hex). It's a biggie — 152 bytes long!
Here's the data, starting with 8900
:
|
It ends with the bytes 4a9b
. This is a 1,216-bit
PGP signature block. It is also the end of the signature section.
There are four null bytes following the last data item in order to
round the size out so that it ends on an 8-byte boundary. This means
that the offset of the next section starts at offset 150, in hex.
Say, wasn't the size in the size signature off by 150 hex? Yes, the
size in the signature is the size of the file —
less the size of the lead and the signature
sections.
The Header
The header section contains all available information about the package.
Entries such as the package's name, version, and file list, are
contained in the header. Like the signature section, the header is in
header structure format. Unlike the signature, which has only three
possible tag types, the header has more than sixty
different tags. The list of currently defined tags appears later in
this appendix on the section called Header Tag Listing.
Be aware that the list of tags changes frequently — the definitive
list appears in the RPM sources in lib/rpmlib.h
.
Analyzing the Header
8ead e8
). The sixteen bytes, starting with the
magic, are the header structures's header. They follow the same
format as the header in the signature's header structure:
|
As before, the byte following the magic identifies this header
structure as being in version 1 format. Following the four reserved
bytes, we find the count of entries stored in the header
(0000 0021
). Converting to decimal, we find that
there are 33 entries in the header. The next four bytes
(0000 09d3
) converted to decimal, tell us that
there are 2,515 bytes of data in the store.
|
The first four bytes (0000 03e8
) are the tag, which
is the tag for the package name. The next four bytes indicate the
data is type 6, or a null-terminated string. There's an offset of
zero in the next four bytes, meaning that the data for this tag is
first in the store. Finally, the last four bytes (0000
0001
) show that the data count is 1, which is the only legal
value for data of type STRING.
|
Since the data type for this entry is a null-terminated string, we
need to keep reading bytes until we reach a byte whose numeric value
is zero. We find the bytes 72
,
70
, 6d
, and
00
— a null. Looking at the ASCII display on
the right, we find that the bytes form the string
rpm
, which is the name of this package.
|
|
The byte at offset 509 is 2f — a "/". Reading up to the first
null byte, we find that the first filename is
/bin/rpm
, followed by
/etc/rpmrc
. This continues on for 22 more
filenames.
There are many more tags that we could decode, but they are all done in the same manner.
Header Tag Listing
lib/rpmlib.h
in the latest version of the RPM
sources.
|
The Archive
|
In this example, the archive starts at offset d43
.
According to the contents of /usr/lib/magic
, the
first two bytes of a gzipped file should be
1f8b
, which is, in fact, what we see. The following
byte (08
) is the flag used by GNU zip to indicate the
file has been compressed with gzip's "deflation"
method. The eighth byte has a value of 02
, which
means that the archive has been compresed using
gzip's maximum compression setting. The following
byte contains a code indicating the operating system under which the
archive was compressed. A 03
in this byte indicates
that the compression ran under a UNIX-like operating system.
The remainder of the RPM package file is the compressed archive. After the archive is uncompressed, it is an ordinary cpio archive in SVR4 format with a CRC checksum.
Notes
[1] | Please refer to the section called Identifying RPM files with the file(1) command for a discussion on identifying RPM package files with the file command. |
[2] | The header is discussed in the section called The Header. |
[3] | It should be noted that the architecture used internally by RPM is actually stored in the header. This value is strictly for file(1)'s use. |