MINIX - A Technical Description

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
From: ast@botter.UUCP
Newsgroups: comp.sys.ibm.pc
Subject: MINIX - A Technical Description
Message-ID: <1034@botter.cs.vu.nl>
Date: Thu, 15-Jan-87 17:54:20 EST
Article-I.D.: botter.1034
Posted: Thu Jan 15 17:54:20 1987
Date-Received: Sun, 18-Jan-87 00:19:40 EST
Reply-To: ast@cs.vu.nl (Andy Tanenbaum)
Distribution: world
Organization: VU Informatica, Amsterdam
Lines: 457


Below is an extract of the paper I will be presenting at UNIFORUM on Jan 22.
I may try to set up a BOF session on MINIX at USENIX for the benefit of
people not going to UNIFORUM (I originally submitted my paper to USENIX
but it was rejected because it didn't fit in anywhere).   Check the listings.
I am going to Palo Alto after the meetings, and won't be back reading my mail
until early February.

Well over 100 people have written me expressing interest in comp.os.minix.
I have communicated this to the net adminstrators.

Andy Tanenbaum
----------------------------Use troff -ms on this paper--------------------
.RP
.ND
.nr PD 0i
.tr ~ 
.de MX
\s-1MINIX\s+1\\$1
..
.LP
.TL
MINIX: A CHEAP UNIX CLONE WITH SOURCE CODE
.AU
Andrew S. Tanenbaum
.AI
Dept. of Mathematics and Computer Science
Vrije Universiteit
Amsterdam, The Netherlands
.sp 4
.NH 1
OVERVIEW OF THE MINIX SYSTEM ARCHITECTURE
.PP
.UX
is organized as a single executable program that is loaded into memory at
system boot time and then run.
.MX
is structured in a much more modular way, as a collection of processes
that communicate with each other and with user processes by sending and
receiving messages.
There are separate processes for the memory manager, the file system, for
each device driver, and for certain other system functions.
This structure enforces a better interface between the pieces.
The file system cannot, for example, accidentally
change the memory manager's tables because the file system and
memory manager each have their own private address spaces.
.PP
These system processes are each full-fledged processes, with their own
memory allocation, process table entry and state.
They can be run, blocked, and send messages, just as the user processes.
In fact, the memory manager and file system each run in user space as
ordinary processes.
The device drivers are all linked together with the kernel into the same
binary program, but they communicate with each other and with the other
processes by message passing.
.PP
When the system is compiled, four 
binary programs are independently created: the kernel
(including the driver processes), the memory manager, the file system,
and 
.I init 
(which reads 
.I /etc/ttys
and forks off the login processes).
In other words, compiling the system results in four distinct
.I a.out 
files.
When the system is booted, all four of these are read into memory from the
boot diskette.
.PP
It is possible, and in fact, normal, to modify, recompile, and relink, say,
the file system, without having to relink the other three pieces.
This design provides a high degree of modularity by dividing the system up
into independent pieces, each with a well-defined function and interface
to the other pieces.
The pieces communicate by sending and receiving messages.
.PP
The various processes are structured in four layers:
.sp
   4.~The user processes (top layer).
.br
   3.~The server processes (memory manager and file system).
.br
   2.~The device drivers, one process per device.
.br
   1.~Process and message handling (bottom layer).
.sp
Let us now briefly summarize the function of each layer.
.PP
Layer 1 is concerned with doing process management including CPU scheduling
and interprocess communication.
When a process does a SEND or RECEIVE, it traps to the kernel, which then
tries to execute the command.
If the command cannot be executed (e.g., a process does a RECEIVE and there
are no messages waiting for it), the caller is blocked until the command can be
executed, at which time the process is reactivated.
When an interrupt occurs, layer 1 converts it into a message to the
appropriate device driver, which will normally be blocked waiting for it.
The decision about which process to run when is also made in layer 1.
A priority algorithm is used, giving device drivers higher priority over
ordinary user processes, for example.
.PP
Layer 2 contains the device drivers, one process per major device.
These processes are part of the kernel's address space because they must
run in kernel mode to access I/O device registers and execute I/O
instructions.
Although the IBM PC does not have user mode/kernel mode, most other
machines do, so this decision has been made with an eye toward the
future.
To distinguish the processes within the kernel from those in user space,
the kernel processes are called 
.B tasks .
.PP
Layer 3 contains only two processes, the memory manager and the file
system.
They are both structured as 
.B servers ,
with the user processes as 
.B clients .
When a user process (i.e., a client) wants to execute a system call, it
calls, for example, the library procedure
.I read
with the file descriptor, buffer, and count.
The library procedure builds a message containing the system call number
and the parameters and sends it to the file system.
The client then blocks waiting for a reply.
When the file system receives the message, it carries it out and sends
back a reply containing the number of bytes read or the error code.
The library procedure gets the reply and returns the result to the caller
in the usual way.
The user is completely unaware of what is going on here, making it easy to
replace the local file system with a remote one.
.PP
Layer 4 contains the user programs.
When the system comes up,
.I init
forks off login processes, which then wait for input.
On a successful login, the shell is executed.
Processes can fork, resulting in a tree of processes, with 
.I init
at the root.
When CTRL-D is typed to a shell, it exits, and 
.I init
replaces the shell with another login process.
.NH 1
LAYER 1 - PROCESSES AND MESSAGES
.PP
The two basic concepts on which 
.MX
is built are processes and messages.
A process is an independently schedulable entity with its own process table
entry.
A message is a structure containing the sender's process number, a message
type field, and a variable part (a union) containing the parameters or
reply codes of the message.
Message size is fixed, depending on how big the union happens to be on the
machine in question.
On the IBM PC it is 24 bytes.
.PP
Three kernel calls are provided:
.sp
   - RECEIVE(source, &message)
.br
   - SEND(destination, &message)
.br
   - SENDREC(process, &message)
.sp
These are the only true system calls (i.e., traps to the kernel).
RECEIVE announces the willingness of the caller to accept a message from
a specified process, or ANY, if the RECEIVER will accept any message.
(From here on, ``process'' also includes the tasks.)
If no message is available, the receiving process is blocked.
SEND attempts to transmit a message to the destination process.
If the destination process is currently blocked trying to receive from the
sender, the kernel copies the message from the sender's buffer to the
receiver's buffer, and then marks them both as runnable.
If the receiver is not waiting for a message from the sender, the sender is
blocked.
.PP
The SENDREC primitive combines the functions of the other two.
It sends a message to the indicated process, and then blocks until a
reply has been received.
The reply overwrites the original message.
User processes use SENDREC to execute system calls by sending messages to
the servers and then blocking until the reply arrives.
.PP
There are two ways to enter the kernel.
One way is by the trap resulting from a process' attempt to send or receive
a message.
The other way is by an interrupt.
When an interrupt occurs, the registers and machine state of the currently
running process are saved in its process table entry.
Then a general interrupt handler is called with the interrupt number as
parameter.
This procedure builds a message of type INTERRUPT, copies it to the
buffer of the waiting task, marks that task as runnable (unblocked), and
then calls the scheduler to see who to run next.
.PP
The scheduler maintains three queues, corresponding to layers 2, 3, and 4,
respectively.
The driver queue has the highest priority, the server queue has middle priority,
and the user queue has lowest priority.
The scheduling algorithm is simple: find the highest priority queue that has
at least one process on it, and run the first process on that queue.
In this way, a clock interrupt will cause a process switch if the file
system was running, but not if the disk driver was running.
If the disk driver was running, the clock task will be put at the end of
the highest priority queue, and run when its turn comes.
.PP
In addition to this rule, once every 100 msec, the clock task checks to see
if the current process is a user process that has been running for at least
100 msec.
If so, that user is removed from the front of the user queue and put on the
back.
In effect, compute bound user processes are run using a round robin
scheduler.
Once started, a user process runs until either it blocks trying to send or
receive a message, or it has had 100 msec of CPU time.
This algorithm is simple, fair, and easy to implement.
.NH 1
LAYER 2 - DEVICE DRIVERS
.PP
Like all versions of
.UX
for the IBM PC,
.MX
does not use the ROM BIOS for input or output because the BIOS does not 
support interrupts.
Suppose a user forks off a compilation in the background and then calls the
editor.
If the editor tried to read from the terminal using the BIOS, the
compilation (and any other background jobs such as printing) would be
stopped dead in their tracks waiting for the the next character to be typed.
Such behavior may be acceptable in the MS-DOS world, but it certainly
is not in the
.UX 
world.
As a result, 
.MX
contains a complete set of drivers that duplicate the functions of the BIOS.
Like the rest of
.MX ,
these drivers are written in C, not assembly language.
.PP
This design has important implications for running 
.MX
on PC clones.
A clone whose hardware is not compatible with the PC down to the chip
level, but which tries to hide the differences by making the BIOS calls
functionally identical to IBM's will not run an unmodified
.MX
because 
.MX
does not use the BIOS.
.PP
Each device driver is a separate process in
.MX .
At present, the drivers include the clock driver, terminal driver, various
disk drivers (e.g., RAM disk, floppy disk), and printer driver.
Each driver has a main loop consisting of three actions:
.sp
   1.~Wait for an incoming message.
.br
   2.~Perform the request contained in the message.
.br
   3.~Send a reply message.
.sp
Request messages have a standard format, containing the opcode (e.g., READ,
WRITE, or IOCTL), the minor device number, the position (e.g., disk block
number), the buffer address, the byte count, and the number of the process
on whose behalf the work is being done.
.PP
As an example of where device drivers fit in, consider what happens
when a user wants to read from a file.
The user sends a message to the file system.
If the file system has the needed data in its buffer cache, they are copied
back to the user.
Otherwise, the file system sends a message to the disk task requesting that
the block be read into a buffer within the file system's address space
(in its cache).
Users may not send messages to the tasks directly.
Only the servers may do this.
.PP
.MX
supports a RAM disk.
In fact, the RAM disk is always used to hold the root device.
When the system is booted, after the operating system has been loaded, the user
is instructed to insert the root file system diskette.
The file system then sees how big it is, allocates the necessary memory,
and copies the diskette to the RAM disk.
Other file systems can then be mounted on the root device.
.PP
This organization puts important directories such as
.I /bin
and
.I /tmp
on the fastest device, and also makes it easy to work with either floppy
disks or hard disks or a mixture of the two by mounting them on
.I /usr
or
.I /user
or elsewhere.
In any event, the root device is always in the same place.
.PP
In the standard distribution, the RAM disk is about 240K, most of which
is full of parts of the C compiler.
In the 256K system, a much smaller RAM disk has to be used, which explains
why this version has no C compiler: there is no place to put it.
(The
.I /usr
diskette is completely full with the other utility programs and one of the
design goals was to make the system run on a 256K PC with 1 floppy disk.)
Users with an unusual configuration such as 256K and three hard disks are free to
juggle things around as they see fit.
.PP
The terminal driver is compatible with the standard V7 terminal driver.
It supports cooked mode, raw mode, and cbreak mode.
It also supports several escape sequences, such as cursor positioning and
reverse scrolling because the screen editor needs them.
.PP
The printer driver copies its input to the printer character for character
without modification.
It does not even convert line feed to carriage return + line feed.
This makes it possible to send escape sequences to graphics printers without
the driver messing things up.
.MX
does not spool output because floppy disk systems rarely have enough spare
disk space for the spooling directory.
Instead one normally would print a file
.I f
by saying
.sp
lpr <f &
.sp
to do the printing in the background.
The
.I lpr
program insert carriage returns, expands tabs, and so on, so it should only
be used for straight ASCII files.
On hard disk systems, a spooler would not be difficult to write.
.NH 1
LAYER 3 - SERVERS
.PP
Layer 3 contains two server processes: the memory manager and the file system.
They are both structured in the same way as the device drivers, that is a
main loop that accepts requests, performs them, and then replies.
We will now look at each of these in turn.
.PP
The memory manager's job is to handle those system calls that affect
memory allocation, as well as a few others.
These include FORK, EXEC, WAIT, KILL, and BRK.
The memory model used by 
.MX
is exceptionally simple in order to accommodate computers without any
memory management hardware.
When the shell forks off a process, a copy of the shell is made in memory.
When the child does an EXEC, the new core image is placed in memory.
Thereafter it is never moved.
.MX
does not swap or page.
.PP
The amount of memory allocated to the process is determined by a field in
the header of the executable file.
A program,
.I chmem ,
has been provided to manipulate this field.
When a process is started, the text segment is set at the very bottom of
the allocated memory area, followed by the data and bss.
The stack starts at the top of the allocated memory and grows downward.
The space between the bottom of the stack and the top of the data segment
is available for both segments to grow into as needed.
If the two segments meet, the process is killed.
.PP
In the past, before paging was invented, all memory allocation schemes
worked like this.
In the future, when even small microcomputers will use 32-bit CPUs and 
1M x 1 bit memory chips, the minimum feasible memory will be 4 megabytes and
this allocation scheme will probably become popular again due to its
inherent simplicity.
Thus the
.MX
scheme can be regarded as either hopelessly outdated or amazingly
futuristic, as you prefer.
.PP
The memory manager keeps track of memory using a list of holes.
When new memory is needed, either for FORK or for EXEC, it searches the
hole list and takes the first hole that is big enough (first fit).
When a process terminates, if it is adjacent to a hole on either side, the
process' memory and the hole are merged into a bigger hole.
.PP
The file system is really a remote file server that happens to be running
on the user's machine.
However it is straightforward to convert it into
a true network file server.
All that needs to be done is change the message interface and provide some
way of authenticating requests.
(In
.MX ,
the source field in the incoming message is trustworthy because it is
filled in by the kernel.)
When running remote, the
.MX
file server maintains state information, like RFS and unlike NFS.
.PP
The 
.MX
file system is similar to that of V7
.UX .
The i-node is slightly different, containing only 9 disk addresses instead of
13, and only 1 time instead of 3.
These changes reduce the i-node from 64 bytes to 32 bytes, to store more
i-nodes per disk block and reduce the size of the in-core i-node table.
.PP
Free disk blocks and free inodes are kept track of using bit maps rather
than free lists.
The bit maps for the root device and all mounted file systems are kept in
memory.
When a file grows, the system makes a definite effort to allocate the new
block as close as possible to the old ones, to minimize arm motion.
Disk storage is not necessarily allocated one block at a time.
A minor device can be configured to allocate 2, 4 (or more) contiguous blocks
whenever a block is allocated.
Although this wastes disk space, these
multiblock
.B zones
improve disk performance by keeping file blocks close together.
The standard parameters for
.MX
as distributed are 1K blocks and 1K zones (i.e., just 1 block per zone).
.PP
.MX
maintains a buffer cache of recently used blocks.
A hashing algorithm is used to look up blocks in the cache.
When an i-node block, directory block, or other critical block is modified,
it is written back to disk immediately.
Data blocks are only written back at the next SYNC or when the buffer is
needed for something else.
.PP
The
.MX
directory system and format is identical to that of V7
.UX .
File names are strings of up to 14 characters, and directories
can be arbitrarily long.
.NH 1
LAYER 4 - USER PROCESSES
.PP
This layer contains 
.I init ,
the shell, the editor, the compiler, the utilities, and all the user processes.
These processes may only send messages to the memory manager and the file
system, and these servers only accept valid system call requests.
Thus the user processes do not perceive
.MX
to be a general-purpose message passing system.
However, removing the one line of code that checks if the message destination
is valid would convert it into a much more general system (but less
.UX -like).